four things tagged “language”
I was hitting Algolia’s search limits had to remove words I didn’t care about searching like “and”, “only”, “there”, or “I’ve” in an attempt to shrink the size of the posts on this site in the search index. There are quite a few lists on the internet and I ended up using a few of them for significant (> 65% average) size reductions in the search corpora.
Examples would be: shitake mushrooms, Herman I. Libshitz, magna cum laude, Arun Dikshit.
Here’s a community-maintained "List of Dirty, Naughty, Obscene, and Otherwise Bad Words" across various languages on Github. I was curious about a naïve frequency distribution of consonants across the English-language corpus (NSFW, obviously) and wrote a small script. Here are the results:
Not sure what I’m going to do with this information but here it is. 🤬