seven things tagged “language”

Telugu Script Components

Telugu is a phonetic language, written from left to right, with each character representing generally a syllable. There are 52 letters in the Telugu alphabet: 16 Achchulu which denote basic vowel sounds, 36 Hallulu which represent consonants. In addition to these 52 letters, there are several semi-vowel symbols, called Maatralu, which are used in conjunction with Hallulu and, half consonants, called Voththulu, to form clusters of consonants.

Improved Symbol Segmentation for TELUGU Optical Character Recognition

Whence “Gubernatorial”?

I’m put off by the word “gubernatorial” whenever I see it. Seems very silly, saccharine, like something a 5-year old mispronounced in 1953 that just stuck because it was so cute 🙄

Nope.

“Because, if you go back to where this word came from, in the original Latin, it’s from the verb, gubernare and gubernator, one who governs,” [Lisa McLendon, professor, University of Kansas School of Journalism] says.

Then, “governor, with the ‘v,’ came into English from French in about the 14th century,” she says. "French had taken the Latin and they swapped the ‘b’ for a ‘v.’ "

English speakers went back to the “b” about 400 years later, but just for gubernatorial. And, there’s the split.

Where Does The Term ‘Gubernatorial’ Come From?, NPR

250 Bullshit Words by Unknown More Pasta

Here’s some Buzzword Bingo based on these words by the same company.

  • accelerate
  • accountability
  • action items
  • actionable
  • aggregator
  • agile
  • algorithm
  • alignment
  • analytics
  • at the end of the day
  • B2B/B2C
  • bandwidth
  • below the fold
  • best of breed
  • best practices
  • beta
  • big data
  • bleeding edge
  • blueprint
  • boil the ocean
  • bottom line
  • bounce rate
  • brand evangelist
  • bricks and clicks
  • bring to the party
  • bring to the table
  • brogrammer
  • BYOD
  • change agent
  • clickthrough
  • close the loop
  • codify
  • collaboration
  • collateral
  • come to Jesus
  • content strategy
  • convergence
  • coopetition
  • create value
  • credibility
  • cross the chasm
  • cross-platform
  • cross-pollinate
  • crowdfund
  • crowdsource
  • curate
  • cutting-edge
  • data mining
  • deep dive
  • design pattern
  • digital divide
  • digital natives
  • discovery
  • disruptive
  • diversity
  • DNA
  • do more with less
  • dot-bomb
  • downsizing
  • drink the Kool Aid
  • DRM
  • e-commerce hairball
  • eat your own dog food
  • emerging
  • empathy
  • enable
  • end-to-end
  • engagement
  • engaging
  • enterprise
  • entitled
  • epic
  • evangelist
  • exit strategy
  • eyeballs
  • face time
  • fail fast
  • fail forward
  • fanboy
  • finalize
  • first or best
  • flat
  • flow
  • freemium
  • funded
  • funnel
  • fusion
  • game changer
  • gameify
  • gamification
  • glamour metrics
  • globalization
  • green
  • groupthink
  • growth hack
  • guru
  • headlights
  • heads down
  • herding cats
  • high level
  • holistic
  • homerun
  • html5
  • hyperlocal
  • i _______
  • iconic
  • ideation
  • ignite
  • immersive
  • impact
  • impressions
  • in the weeds
  • infographic
  • innovate
  • integrated
  • IoT
  • jellyfish
  • knee deep
  • lean
  • lean in
  • let’s shake it and see what falls off
  • let’s socialize this
  • let’s table that
  • level up
  • leverage
  • like _______ for _______
  • lizard brain
  • long tail
  • low hanging fruit
  • make it pop
  • make the logo bigger
  • maker
  • marketing funnel
  • mashup
  • milestone
  • mindshare
  • mobile-first
  • modernity
  • monetize
  • moving forward
  • multi-channel
  • multi-level
  • MVP
  • netiquette
  • next gen
  • next level
  • ninja
  • no but, yes if
  • offshoring
  • on the runway
  • open the kimono
  • operationalize
  • opportunity
  • optimize
  • organic
  • out of pocket
  • outside the box
  • outsourcing
  • over the top
  • paradigm shift
  • patent pending design
  • peeling the onion
  • ping
  • pipeline
  • pivot
  • pop
  • portal
  • proactive
  • productize
  • proof of concept
  • public facing
  • pull the trigger
  • push the envelope
  • put it in the parking lot
  • qualified leads
  • quick-win
  • reach out
  • Ready. Fire. Aim.
  • real time
  • rearranging the deck chairs on the Titanic
  • reimagining
  • reinvent the wheel
  • responsive
  • revolutionize
  • rich
  • rightshoring
  • rightsizing
  • rockstar
  • ROI
  • run it up the flagpole
  • scalability
  • scratch your own itch
  • scrum
  • sea change
  • seamless
  • SEM
  • SEO
  • sexy
  • shift
  • sizzle
  • slam dunk
  • social currency
  • social media
  • social media expert
  • social proof
  • soft launch
  • solution
  • stakeholder
  • standup
  • startup
  • stealth mode
  • stealth startup
  • sticky
  • storytelling
  • strategery
  • strategy
  • sustainability
  • sweat your assets
  • synergy
  • take it offline
  • team building
  • tee off
  • the cloud
  • the mayor of _________
  • thought leader
  • tiger team
  • tollgate
  • top of mind
  • touch base
  • touchpoints
  • transgenerate
  • transparent
  • trickthrough
  • uber
  • unicorn
  • uniques
  • unpack
  • user
  • usercentric
  • value proposition
  • value-add
  • vertical cross-pollination
  • viral
  • visibility
  • vision
  • Web 2.0
  • webinar
  • what is our solve
  • what’s the ask?
  • win-win
  • wizard

Stop Words

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though “stop words” usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who”, “The The”, or “Take That”. Other search engines remove some of the most common words—including lexical words, such as “want”—from a query in order to improve performance.

Wikipedia

I was hitting Algolia’s search limits had to remove words I didn’t care about searching like “and”, “only”, “there”, or “I’ve” in an attempt to shrink the size of the posts on this site in the search index. There are quite a few lists on the internet and I ended up using a few of them for significant (> 65% average) size reductions in the search corpora.

The Scunthorpe Problem

The Scunthorpe problem (or the Clbuttic Mistake) is the unintentional blocking of websites, e-mails, forum posts or search results by a spam filter or search engine because their text contains a string of letters that appear to have an obscene or otherwise unacceptable meaning.

Wikipedia

Examples would be: shitake mushrooms, Herman I. Libshitz, magna cum laude, Arun Dikshit.

Naughty Letter Frequencies in English

Here’s a community-maintained "List of Dirty, Naughty, Obscene, and Otherwise Bad Words" across various languages on Github. I was curious about a naïve frequency distribution of consonants across the English-language corpus (NSFW, obviously) and wrote a small script. Here are the results:

Letter Count
t 211
s 208
n 193
r 186
l 167
g 147
c 124
b 121
p 116
h 97
d 91
m 91
k 72
y 70
f 48
w 41
v 29
j 21
x 19
z 7
q 5

Not sure what I’m going to do with this information but here it is. 🤬