four things tagged “language”

Stop Words

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though “stop words” usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who”, “The The”, or “Take That”. Other search engines remove some of the most common words—including lexical words, such as “want”—from a query in order to improve performance.

Wikipedia

I was hitting Algolia’s search limits had to remove words I didn’t care about searching like “and”, “only”, “there”, or “I’ve” in an attempt to shrink the size of the posts on this site in the search index. There are quite a few lists on the internet and I ended up using a few of them for significant (> 65% average) size reductions in the search corpora.

The Scunthorpe Problem

The Scunthorpe problem (or the Clbuttic Mistake) is the unintentional blocking of websites, e-mails, forum posts or search results by a spam filter or search engine because their text contains a string of letters that appear to have an obscene or otherwise unacceptable meaning.

Wikipedia

Examples would be: shitake mushrooms, Herman I. Libshitz, magna cum laude, Arun Dikshit.

Naughty Letter Frequencies in English

Here’s a community-maintained "List of Dirty, Naughty, Obscene, and Otherwise Bad Words" across various languages on Github. I was curious about a naïve frequency distribution of consonants across the English-language corpus (NSFW, obviously) and wrote a small script. Here are the results:

Letter Count
t 211
s 208
n 193
r 186
l 167
g 147
c 124
b 121
p 116
h 97
d 91
m 91
k 72
y 70
f 48
w 41
v 29
j 21
x 19
z 7
q 5

Not sure what I’m going to do with this information but here it is. 🤬