5 Ways to Find Interesting Data Sets
Editor’s note: This post was written as part of a collaboration with Enigma, a public data company. Author India Kerle is a data curator at Enigma.
There are a canon of open datasets used widely in data science projects — you’ve likely come across something making use of the Iris Flower classic or New York’s Citibike data. However, it can be difficult to ask novel questions of well-trafficked data sets; a project built off of one of these classics is unlikely to produce an eye-catching portfolio. It is far easier to answer interesting questions of data sets that have not already been analyzed.
In my current role, I’ve spent a fair amount of time in the nooks and crannies of government open data portals, lurking around sites with data that could be scraped and on the phone to FOIA departments across the country. Here are some tips and tricks I’ve developed for finding the most interesting data sets.
1. Follow newsletters at the forefront of data
Newsletters are a great source to keep a pulse on the latest in data. The best ones serve as consistent seeds of creativity — scanning through the email should spark at least one new idea. I rely on a mix of those focused on data ideas, highlights of unusual visualization or analysis techniques and reports on the latest in open data policy. My personal favorites include:
- Buzzfeed’s Data is Plural for a no fuss list of interesting datasets
- Best in Visual Storytelling for the compelling stories told with data
- Open Data Institute’s This Week in Data to keep a pulse on the open data community at large
- Enigma’s newsletter, Between Two Rows
Last month’s Between Two Rows data visualization on Migrant and Seasonal Agricultural Worker Protection Act data
2. Keep up with media that make use of data
From Bloomberg’s video game on the Demise of the American Shopping Mall to ProPublica’s release of Trump’s White House Visitor Records, cutting-edge media institutions have long used open data for meaningful storytelling. In fact, the New York Times launched a series called What’s Going On in This Graph? to better educate their readers on their data visualizations. These articles are a great place to see what can be done with data and to investigate their open data sources.
The Washington Post’s visualization on the middle of nowhere using data from the Malaria Atlas Project, the Census Bureau and NASA.
3. Listen to prominent voices in the open data space
Not only is the practice of data science evolving, more data is getting released every day. Data advocacy groups like the Sunlight Foundation, Open Knowledge Foundation, Opencorporates, and the Open Data Institute are active in shaping the open data space. These organizations often showcase exemplary open data sets, and where transparency is lacking, put pressure on governments to improve. By following their work, you’ll be the first to learn about newly open data sets.
4. Request data that’s never seen the light
The Freedom of Information Act (FOIA) allows the public to request government agency documents and other data. Requesting data via a FOIA will almost guarantee you data that has never been analysed (although it is often the ultimate test of patience). To figure out what kind of data you want to ask for from federal or state government agencies, take a peek at FOIA advocacy group, MuckRock. Quick tip for those new to FOIA: be as specific as possible. Request the exact name of the file you want (if you know it!), the format you’d prefer it in and the date ranges you’re interested in. The more specific the request, the more likely you are to get data in return.
Enigma Public’s FOIA correspondence with the Internal Revenue Service.
5. Use metadata to your advantage
A data set accompanied by a data dictionary, or a related set of metadata describing the contents of the data set, says that the source is serious about their data game. I often investigate other data sets released by the same source, safe in the knowledge that they hold their data sets to a high standard. I am consistently impressed by the team behind the NYC open data portal who often provide data dictionaries in addition to the name of the data set owner, the agency that releases the data and its update frequency.
While these tricks have helped me unearth some true data treasures, I’m always on the hunt for other sources of inspiration. If you had any additional advice, do send it my way.
I work for Enigma Public, the world’s broadest collection of public data.