On the pain of getting data from public authorities
I am running a little research project of mining a data set made available by the Danish Environmental Authority, see http://www3.mst.dk/Miljoeoplysninger/. The data set contains data from a significant number of danish companies, which are required under Danish and EU law to report information on use of chemical substances, pollutant release and pollutant transfer.
On a first note - and disregarding the API documentation’s layout problems, preventing one in reading the API spec. - the data set seems easy to access, as the government authority has actually made an API with a REST interface for fetching the data in XML format. However, it turns out that the API is not really suited for fetching data in large quantities.
For example, to get the basic data about one company (i.e. information like company name, business registration number, etc.), you will have to fetch and parse data from at least four different requests. To get the actual environmental data reported by the company, you will have to fetch and parse data from another 4-8 requests. Having thousands of companies in the database, fetching all available data for all companies at 1-2 requests per second is an exercise in human patience.
The REST scheme is apparently meant for fetching only small amounts of data related to a company or a substance. Which should be okay, if you are on a mission to make a mash-up, e.g. showing the data on Google Maps.
So, consider the case of making a web app to expose this data by direct queries to the API. To show the detailed information from one company you would have to make (say) 10 client side http requests, parse and show the data to the user. Even as an AJAX app, and performing the queries asynchronously, the programming complexity involved and the query latency would kill the idea before it was executed.
It seems as the API is meant for fetching tiny amounts of data, and only to retrieve fragmented information about a company. I have no idea of why the API designer came up with this REST scheme. It is basically not usable for any practical purpose.
Any real web app, which should rely on using this data, is therefore forced to either pre-fetch and cache data, or do what I did: fetch it all, which takes a long time, and store it in a local database.
To end this post with some constructive criticism, I would recommend that the API is re-factored to allow for fetching more data in fewer requests, e.g, getting all basic information about one company in one query. Or better yet, make all data available as a single file download. (I should mention that a part of the data set is available in the EU database, which can be downloaded as a single file.)