## Getting useful data out of Wolfram Alpha can be difficult

May 22nd, 2009 | Categories: mathematica, Wolfram Alpha | Tags:

Up until now I have been using Wolfram Alpha as the ultimate geek toy and have been truly delighted with it but I thought it was high time I tried to consider how one might use it more seriously.  So I set myself a task.  Nothing too complicated you understand , after all I am still finding my feet with this new system, but something that may at least possibly come up in the real world.  The task I set myself was

Obtain the actual data points for the Gross Domestic Product (GDP) of the UK from 1970 to 1980 inclusive.  To allow me to import this data into pretty much every analysis program on the planet I’ll want it in a CSV file of the form

1970,GDP of UK for 1970
1971,GDP of UK for 1971
etc

Should be easy huh?  Wolfram Alpha knows all about the GDP of the UK – if I Wolf GDP UK then I get the following output among other things).

Fabulous! The data is clearly in there but how do I get it out in the form I want? Let’s try the hopeful UK GDP from 1970 to 1980.  Alas I get the now familiar ‘Wolfram|Alpha isn’t sure what to do with your input.’ Moving on, I tried UK GDP 1970 to 1980 and UK GDP 1970-1980 but they didn’t work either.

I can get at a single datum easily enough.  UK GDP 1970 gives me 123.7 billion for example but how do I get it to give me a list?  Further experimentation showed me that I can get the GDP for any two years if I Wolf for something like (UK GDP 1970) (UK GDP 1971).

By now I feel I am getting somewhere. While playing with Wolfram Alpha (and reading the community forum) I’ve discovered that it will sometimes parse Mathematica code as well as plain English. ‘What I need‘, I thought, ‘is a piece of Mathematica code that would generate the query for me‘. So I tried

Table[“(GDP UK ” <> ToString[x] <> “)”, {x, 1970, 1980}]

but that didn’t work but then I shouldn’t be surprised because Table turns out to be one of the Mathematica functions that Wolfram Alpha doesn’t parse. Ho hum…

I tried a LOT of different inputs but the practical upshot is that the only one that worked was (UK GDP 1970) (UK GDP 1971) (UK GDP 1972) (UK GDP 1973) (UK GDP 1974) (UK GDP 1975) (UK GDP 1976) (UK GDP 1977) (UK GDP 1978) (UK GDP 1979) (UK GDP 1980).  Lord help me if I wanted three times as many data points.

For the record I can get exactly what I wanted in Mathematica 7 with the following two lines of code and I worked out how to do it with a moments thought.  Wolfram Alpha needs to be this easy!

data = Table[{x, CountryData["UK", {"GDP", x}]}, {x, 1970, 1980}];
Export["GDP.csv", data]

So, after some blood sweat and tears I had some actual numerical data but how could I export it to something useful.  Wolfram Alpha always returns results as images by default.

Which are not particularly useful if you want to do your own analysis.  I can also get it as copyable plaintext and for this data set it looks like this

$123.7 billion per year (US dollars per year) |$139.9 billion per year  (US dollars per year)
|  $160.8 billion per year (US dollars per year)|$181.5 billion per year  (US dollars per year)
|  $196 billion per year (US dollars per year) |$234.4 billion per year  (US dollars per year)
|  $225.2 billion per year (US dollars per year) |$254.4 billion per year  (US dollars per year)
|  $322.3 billion per year (US dollars per year) |$418.9 billion per year  (US dollars per year)  |
$537.2 billion per year (US dollars per year) Hmmm. That’s going to need some pre-processing before I can import it into Excel I think – a job for a student or a Python script I think. Now onto the Source information. It listed it’s primary sources as ‘Wolfram Alpha Curated data 2009’ and ‘Wolfram Mathematica CountryData’ with a shed load of Secondary sources such as ‘The US CIA WorldFactbook’. I have to say that I was a little surprised at this – how is Wolfram Alpha the Primary source of this data set? They must have got it from somewhere and THAT somewhere would be the primary source (or closer to it at least) IMHO. In all honesty, I feel that putting itself as the primary source for data such as this is a bit like a student writing an essay and under ‘references‘ simply putting ‘My head‘. Don’t get me wrong, I am starting to love Wolfram Alpha and think it’s got amazing potential but when you love someone you always want to see them do better for themselves. In this particular area I think that Wolfram needs to address the following • Make it easier to get lists of data out of WA. Being able to parse Table[] might be a good start • Allow export of tabular data in popular formats such as CSV and Excel. • Work on the sources information a little. Wolfram Alpha didn’t actually generate this GDP data – they must have got it from somewhere and that should be listed as primary source. Wolfram Alpha is a constantly moving target and it is quite possible that all of these issues will be addressed in no time (if Wolfram agrees that they are issues of course) so feel free to point out if any of the inputs I have linked to give different results from those stated here. I am also aware that I don’t know everything about this system so if I am being an idiot then feel free to point out how I should have phrased my query. Finally, if any new functionality comes online that makes all of this trivial then I would love to know. Comments are, as always, welcomed. 1. Yes, this is exactly what I was talking about in the comments of your ‘First impressions’ post. WA can give lots of amazing results, but as soon as one tries to get useful data out of it for a practical application, things can get quite difficult. Even when they clearly do have the data. E.g. I wasn’t able to get the top countries by per capita birth rate. About sources: I (and many others) agree completely. http://community.wolframalpha.com/story.php?title=a-plea-for-more-citation-of-sources–more-references-please WA will not be citable until it starts citing sources properly. (Things are made even worse by them using Wikipedia as a source, which is again completely uncitable in a serious work.) But there’s certainly enormous potential in it. I’m looking forward to lots of improvements! PS. Are you really *that* mean to students? Parsing that data is a job for a computer, not for a student. (Though unfortunately it’s true that some of them would rather type it by hand than learn how to get the computer to do it for them …) 2. Maybe they plan on offering a “premium” service that lets you actually extract the raw data? It’s a reasonable business strategy since advertising is unlikely to pay for the site. Offer just enough for free to make people madly want a little more functionality. It also seems like you should be able to use the API to make the individual queries automatically, and then parse and format the data for yourself. Much more of a pain than having the website do it for you, though. 3. Hi Szabolcs It’s just my sense of humour (or lack thereof)…I wouldn’t really ask anyone to do that by hand. When I was a grad student though, there was an ongoing joke among the faculty that if you had a terrible job that needed doing then you asked a student to do it. Student time is cheaper than CPU time ;) 4. Hi meicheni Sounds like a good business plan to me if that is their intention. After all, when I couldn’t get what I wanted from Alpha, I turned to their premium product – Mathematica. The issue with the sources remains though – no matter which you use, Alpha or Mathematica. 5. According to the faq, getting data into spreadsheet form “will be possible in Wolfram|Alpha Professional”. 6. I’ve long given up on extracting the kind of data as in your post. Let’s see what their API has to offer. If they really make their data available through it, that would be superexciting. I have some doubts though. At the moment, I just don’t have the fantasy to see how they could make money off such a service… 7. To extract the data is fairly simple. Copy the data as string and paste into Mathematica as: x = “$123.7 billion per year (US dollars per year)|$139.9 billion per \ year (US dollars per year) |$160.8 billion per year (US dollars per year)|$181.5 billion per \ year (US dollars per year) |$196 billion per year (US dollars per year)|$234.4 billion per year \ (US dollars per year) |$225.2 billion per year (US dollars per year)|$254.4 billion per \ year (US dollars per year) |$322.3 billion per year (US dollars per year)|$418.9 billion per \ year (US dollars per year)|$537.2 billion per year (US dollars per \
year)”
Then: data=StringCases[x, RegularExpression[“[1-9]+.[1-9]”]] // ToExpression;
Plot: ListPlot[data]

8. Thanks for that Bing. It’s getting the data in the first place that’s really difficult though :(

9. Mike thought you might be amused by my first try at Wolfram Alpha
http://billscience.blogspot.com/2009/05/wolfram-alpha.html

10. It’s heartening to realize that things are improving over time.
In Alpha today the query UK GDP from 1970 to 1980 generates a bunch of graphs showing what one wants. Admittedly, however, they are graphs.

Doing the same within Mathematica 8 works even better
=UK GDP from 1970 to 1980
throws up a table of dates and numbers (with some other stuff that can easily be stripped from the table).

11. This is probably too little too late, but here’s what I think you wanted: