Hackspiration: Great Open Data Sets!

We know that coming up with a hack idea can be one of the most intimidating parts of attending your first (or fifteenth) hackathon.

All the datasets are free to use, and are small enough to download onto your laptop and mess around with, but large enough to make something cool. Relax, open up a data set, and see if anything catches your eye. You might see a pattern you’d like to visualize, a system to model, or a need a great mobile app could fill.

GitHub Archive is “a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. GitHub provides 18 event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly archives, which you can access with any HTTP client. You could use this data to analyze programmer productivity by hour of the day, compare project activity (e.g commits, pull requests, etc.) to project hype (e.g. mentions on social media).

Open Recipes is an open database of recipe bookmarks. Does not contain preparation instructions, but lists preparation and cooking times, ingredients, etc. These recipes could be used to build a recipe search engine. Interesting features might include ability to filter out foods by allergies, or recommendations based on “liked” recipes. Another project idea is to attempt to classify various recipes by diet type (vegan, vegetarian, paleo, kosher, halal, etcetera). Or if these sound a bit too hard, write a scraper to contribute more data from your favorite cooking blog to the open recipes site.

Dronestream provides “real-time and historical data about every reported United States drone strike”. There’s a publicly accessible API, a real-time Twitter account, and a searchable database. Build something that increases awareness among the American public of the killings being conducted from afar on their behalf. Or create some sort of visualization. A heat map of strike locations might be interesting. Can you find out the affiliation of targets? Is there any correlation between that and other data points? Can the frequency of strikes be shown to decrease after periods of heavy criticism in the media?

Football.db provides public domain football data. A cool open source database that contains data about all past matches and upcoming matches compatible with any language, including scoresheets and player data. Also includes all 2014 Brazil World cup data and Realtime Score HTTP JSON Api. Possible hacks include a data analyzer to show which player can do well in the future, or a match simulator using existing data.

Wifi Hotspot Locations maps free and fee-based wireless internet hotspots around NYC. You can also find many more great data sources on the NYC Open Data site.

National Survey on Drug Use and Health “measures the prevalence and correlates of drug use in the United States. The surveys are designed to provide quarterly, as well as annual, estimates. Information is provided on the use of illicit drugs, alcohol, and tobacco among members of United States households aged 12 and older. Questions included age at first use as well as lifetime, annual, and past-month usage for the following drug classes: marijuana, cocaine, hallucinogens, heroin, inhalants, alcohol, tobacco, and nonmedical use of prescription drugs, including pain relievers, tranquilizers, stimulants, and sedatives. The survey covered substance abuse treatment history and perceived need for treatment, and included questions from the Diagnostic and Statistical Manual of Mental Disorders that allow diagnostic criteria to be applied.”

Survey of Inmates in State and Federal Correctional Facilities “provides nationally representative data on inmates held in state prisons and federally-owned and operated prisons. Through personal interviews conducted from October 2003 through May 2004, inmates in both state and federal prisons provided information about their current offense sentence, criminal history, family background and personal characteristics, prior drug and alcohol use and treatment programs, gun possession and use, and prison activities, programs, and services.”

The Marvel Universe Social Network provides a graph of connection between Marvel characters.

WordNet is a large lexical database of English. “WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms— strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.”

Amazon Reviews consists of nearly 35 million reviews of Amazon products. Can you use this data to tell us the most common reasons for 1-star reviews? Or can you show that more frequent reviewers give higher/lower ratings on average?

Wikipedia Vote Network shows election data for Wikipedia admins. You could become an investigative journalist for a day and expose voting cabals, or complete an analysis of why some candidacies are more successful than others.