Archive for the ‘natural language’ Category
A recent trend on Facebook is to create a wordcloud of all of your posts using an external service. I chose not to use it because I tend to use Facebook for personal interactions among close friends and I didn’t want to send all of my data to another external company.
Twitter is a different matter, however! All of the data is open and it’s very easy to write a computer program to generate Twitter world clouds without the need for an external service.
I wrote a simple script in R that generates a wordcloud from the most recent 3200 tweets and outputs the top 200 words (get the code on github). The script removes many of the uninteresting words such as the, of, and that would otherwise dominate the cloud. These stopwords come from the Top100Words list of the R package qdap but I also added a few more such as ‘just’ and ‘me’ that I seem to use a lot.
This is the current wordcloud for my twitter account, walkingrandomly. Click on the image to see a bigger version. My main interests are very clear – Python programming, research software, data and anything that’s new!
Once I had seen my wordcloud, I wondered how things would look for other twitter users who I pay a lot of attention to. This is how it looks for Manchester University’s Nick Higham. Clearly he’s big on SIAM, Manchester, and Matrix Analysis!
I then looked at my manager at Sheffield University, Neil Lawrence. Neil finds data and the city of Sheffield very important and also writes about workshops, science, blog posts and machine learning a lot.
The R code that generated these wordclouds is available on github but it won’t work out of the box. You’ll need to register with twitter for app development (It’s free and fairly straightforward) and get various access keys before you can use the code.
Let’s use Mathematica to to discover the longest English words where the letters are in alphabetical order. The following command will give all such words
DictionaryLookup[x__ /; Characters[x] == Sort[Characters[x]]]
I’m not going to show all of the output because there are 562 of them (including single letter words such as ‘I’ and ‘a’) as we can see by doing
Length[ DictionaryLookup[x__ /; Characters[x] == Sort[Characters[x]]] ] 562
The longest of these words has seven characters:
Max[Map[StringLength, DictionaryLookup[x__ /; Characters[x] == Sort[Characters[x]]]]] 7
It turns out that only one such word has the maximum 7 characters
DictionaryLookup[x__ /; Characters[x] == Sort[Characters[x]] && StringLength[x] == 7] {"billowy"}
There are 34 such words that contain 6 characters
DictionaryLookup[ x__ /; Characters[x] == Sort[Characters[x]] && StringLength[x] == 6] {"abbess", "Abbott", "abhors", "accent", "accept", "access", \ "accost", "adders", "almost", "begins", "bellow", "Bellow", "bijoux", \ "billow", "biopsy", "bloops", "cellos", "chills", "chilly", "chimps", \ "chinos", "chintz", "chippy", "chivvy", "choosy", "choppy", "Deimos", \ "effort", "floors", "floppy", "flossy", "gloppy", "glossy", "knotty"}
If you insist on all letters being different, there are 9:
DictionaryLookup[ x__ /; Characters[x] == Sort[Characters[x]] && StringLength[x] == 6 && Length[Union[Characters[x]]] == Length[Characters[x]]] {"abhors", "almost", "begins", "bijoux", "biopsy", "chimps", \ "chinos", "chintz", "Deimos"}
How about where all the letters are in reverse alphabetical order with no repeats? The longest such words have 7 characters
Max[ Map[StringLength, DictionaryLookup[ x__ /; Characters[x] == Reverse[Sort[Characters[x]]]]]] 7
Here they are
DictionaryLookup[ x__ /; Characters[x] == Reverse[Sort[Characters[x]]] && StringLength[x] == 7 && Length[Union[Characters[x]]] == Length[Characters[x]]] {"sponged", "wronged"}