by Hiren Patel
What is Open Data?
In simple terms, Open Data means the kind of data which is open for anyone and everyone for access, modification, reuse, and sharing.
Open Data derives its base from various “open movements” such as open source, open hardware, open government, open science etc.
Governments, independent organizations, and agencies have come forward to open the floodgates of data to create more and more open data for free and easy access.
Why Is Open Data Important?
Open data is important because the world has grown increasingly data-driven. But if there are restrictions on the access and use of data, the idea of data-driven business and governance will not be materialized.
Therefore, open data has its own unique place. It can allow a fuller understanding of the global problems and universal issues. It can give a big boost to businesses. It can be a great impetus for machine learning. It can help fight global problems such as disease or crime or famine. Open data can empower citizens and hence can strengthen democracy. It can streamline the processes and systems that the society and governments have built. It can help transform the way we understand and engage with the world.
So here’s my list of 15 awesome Open Data sources:
As a repository of the world’s most comprehensive data regarding what’s happening in different countries across the world, World Bank Open Data is a vital source of Open Data. It also provides access to other datasets as well which are mentioned in the data catalog.
World Bank Open Data is massive because it has got 3000 datasets and 14000 indicators encompassing microdata, time series statistics, and geospatial data.
Accessing and discovering the data you want is also quite easy. All you need to do is to specify the indicator names, countries or topics and it will open up the treasure-house of Open Data for you. It also allows you to download data in different formats such as CSV, Excel, and XML.
If you are a journalist or academic, you will be enthralled by the array of tools available to you. You can get access to analysis and visualization tools that can bolster your research. It can felicitate a deeper and better understanding of global problems.
You can get access to the API which can help you create the data visualizations you need, live combinations with other data sources and many more such features.
Therefore, it’s no surprise that World Bank Open Data tops any list of Open Data sources!
WHO’s Open Data repository is how WHO keeps track of health-specific statistics of its 194 Member States.
The repository keeps the data systematically organized. It can be accessed as per different needs. For instance, whether it is mortality or burden of diseases, one can access data classified under 100 or more categories such as the Millennium Development Goals (child nutrition, child health, maternal and reproductive health, immunization, HIV/AIDS, tuberculosis, malaria, neglected diseases, water and sanitation), non communicable diseases and risk factors, epidemic-prone diseases, health systems, environmental health, violence and injuries, equity etc.
For your specific needs, you can go through the datasets according to themes, category, indicator, and country.
The good thing is that it is possible to download whatever data you need in Excel Format. You can also monitor and analyze data by making use of its data portal.
The API to the World Health Organization’s data and statistics content is also available.
Launched in 2010, Google Public Data Explorer can help you explore vast amounts of public-interest datasets. You can visualize and communicate the data for your respective uses.
It makes the data from different agencies and sources available. For instance, you can access data from World Bank, U. S. Bureau of Labor Statistics and U.S. Bureau, OECD, IMF, and others.
Different stakeholders access this data for a variety of purposes. Whether you are a student or a journalist, whether you are a policy maker or an academic, you can leverage this tool in order to create visualizations of public data.
You can deploy various ways of representing the data such as line graphs, bar graphs, maps and bubble charts with the help of Data Explorer.
The best part is that you would find these visualizations quite dynamic. It means that you will see them change over time. You can change topics, focus on different entries and modify the scale.
It is easily shareable too. As soon as you get the chart ready, you can embed it on your website or blog or simply share a link with your friends.
This is a repository containing public datasets. It is data which is available from AWS resources.
As far as RODA is concerned, you can discover and share the data which is publicly available.
In RODA, you can use keywords and tags for common types of data such as genomic, satellite imagery and transportation in order to search whatever data that you are looking for. All of this is possible on a simple web interface.
For every dataset, you will discover detail page, usage examples, license information and tutorials or applications that use this data.
By making use of a broad range of compute and data analytics products, you can analyze the open data and build whatever services you want.
While the data you access is available through AWS resources, you need to bear in mind that it is not provided by AWS. This data belongs to different agencies, government organizations, researchers, businesses and individuals.
You can access whatever open data EU institutions, agencies and other organizations publish on a single platform namely European Union Open Data Portal.
The EU Open Data Portal is home to vital open data pertaining to EU policy domains. These policy domains include economy, employment, science, environment, and education.
Around 70 EU institutions, organizations or departments such as Eurostat, the European Environment Agency, the Joint Research Centre and other European Commission Directorates General and EU Agencies have made their datasets public and allowed access. These datasets have crossed the number of 11700 till date.
The portal enables easy access. You can easily search, explore, link, download and reuse the data through a catalog of common metadata. You can do so for your specific purposes. It could be commercial or non-commercial purposes.
You can search the metadata catalog through an interactive search engine (Data tab) and SPARQL queries (Linked data tab).
By making use of this catalog, you can gain access to the data stored on the different websites of the EU institutions, agencies and organizations.
It is a great site for data-driven journalism and story-telling.
It provides its various sources of data for a variety of sectors such as politics, sports, science, economics etc. You can download the data as well.
When you access the data, you will come across a brief explanation regarding each dataset with respect to its source. You will also get to know what it stands for and how to use it.
In order to render this data user-friendly, it provides datasets in as simple, non-proprietary formats such as CSV files as possible. Needless to say, these formats can be easily accessed and processed by humans as well as machines.
With the help of these datasets, you can create stories and visualizations as per your own requirements and preference.
U.S. Census Bureau is the biggest statistical agency of the federal government. It stores and provides reliable facts and data regarding people, places, and economy of America.
The Census Bureau considers its noble mission to extend its services as the most reliable provider of quality data.
Whether it is a federal, state, local or tribal government, all of them make use of census data for a variety of purposes. These governments use this data to determine the location of new housing and public facilities. They also make use of it at the time of examining the demographic characteristics of communities, states, and the USA.
This data is also made use of in planning of transportation systems and roadways. When it comes to deciding quotas and creating police and fire precincts, this data comes in handy. When governments create localized areas of elections, schools, utilities etc, they make use of this data. It is a practice to compile population information once a decade and this data are quite useful in accomplishing the same.
There are various tools such as American Fact Finder, Census Data Explorer and Quick Facts which are useful in case you want to search, customize and visualize data.
For instance, Quick Facts alone contains statistics for all the states, counties, cities and even towns with a population of 5000 or more.
Likewise, American Fact Finder can help you discover popular facts such as population, income etc. It provides information that is frequently requested.
The good thing is that you can search, interact with the data, get to know about popular statistics and see the related charts through Census Data Explorer. Moreover, you can also use visual tool to customize data on an interactive maps experience.
Data.gov is the treasure-house of US government’s open data. It was only recently that the decision was made to make all government data available for free.
When it was launched, there were only 47. There are now 180,000 datasets.
Why Data.gov is a great resource is because you can find data, tools, and resources that you can deploy for a variety of purposes. You can conduct your research, develop your web and mobile applications and even design data visualizations.
All you need to do is enter keywords in the search box and browse through types, tags, formats, groups, organization types, organizations, and categories. This will facilitate easy access to data or datasets that you need.
Data.gov follows the Project Open Data Schema — a set of requisite fields (Title, Description, Tags, Last Update, Publisher, Contact Name, etc.) for every data set displayed on Data.gov.
As you know, Wikipedia is a great source of information. DBpedia aims at getting structured content from the valuable information that Wikipedia created.
With DBpedia, you can semantically search and explore relationships and properties of Wikipedia resource. This includes links to other related datasets as well.
There are around 4.58 million entities in the DBpedia dataset. 4.22 million are classified in ontology, including 1,445,000 persons, 735,000 places, 123,000 music albums, 87,000 films, 19,000 video games, 241,000 organizations, 251,000 species and 6,000 diseases.
There are labels and abstracts for these entities in around 125 languages. There are 25.2 million links to images. There are 29.8 million links to external web pages.
All you need to do in order to use DBpedia is write SPARQL queries against endpoint or by downloading their dumps.
DBpedia has benefitted several enterprises, such as Apple (via Siri), Google (via Freebase and Google Knowledge Graph), and IBM (via Watson), and particularly their respective prestigious projects associated with artificial intelligence.
It is an open source community. Why it matters is because it enables you to code, build pro bono projects after nonprofits and grab a job as a developer.
In order to make this happen, the freeCodeCamp.org community makes available enormous amounts of data every month. They have turned it into open data.
You will find a variety of things in this repository. You can find datasets, analysis of the same and even demos of projects based on the freeCodeCamp data. You can also find links to external projects involving the freeCodeCamp data.
It can help you with a diversity of projects and tasks that you may have in mind. Whether it is web analytics, social media analytics, social network analysis, education analysis, data visualization, data-driven web development or bots, the data offered by this community can extremely useful and effective.
The Yelp dataset is basically a subset of nothing but our own businesses, reviews and user data for use in personal, educational and academic pursuits.
There are 5,996,996 reviews, 188,593 businesses, 280,991 pictures and 10 metropolitan areas included in Yelp Open Datasets.
You can use them for different purposes. Since they are available as JSON files, you can use them in order to teach students about databases. You can use them to learn NLP or for sample production data while you understand how to design mobile apps.
In this dataset, you will find each file composed of a single object type, one JSON-object per-line.
12. UNICEF Dataset
Since UNICEF concerns itself with a wide variety of critical issues, it has compiled relevant data on education, child labor, child disability, child mortality, maternal mortality, water and sanitation, low birth-weight, antenatal care, pneumonia, malaria, iodine deficiency disorder, female genital mutilation/cutting, and adolescents.
UNICEF’s open datasets published on the IATI Registry: http://www.iatiregistry.org/publisher/unicef has been extracted directly from UNICEF’s operating system (VISION) and other data systems, and it reflects inputs made by individual UNICEF offices.
The good thing is that there is a regular update when it comes to these datasets. Every month, the data is updated in order to make it more comprehensive, reliable and accurate.
You can freely and easily access this data. In order to do so, you can download this data in CSV format. You can also preview sample data prior to downloading it.
While anybody can explore and visualize UNICEF’s datasets, there are three principal publishers:
UNICEF’s AID TRANSPARENCY PORTAL : You can far more easily access the datasets if you use this portal. It also includes details for each country that UNICEF works in.
Publisher d-portal : It is, at the moment, in BETA. With this, portal, you can explore IATI data.
You can search the information related to development activities, budgets etc. You can explore this information country-wise.
Publisher’s data platform : On this platform, you can easily access statistics, charts, and metrics on data accessed via the IATI Registry. If you click on the headers, you can also sort many of the tables that you see on the platform. You will also find many of the datasets in the platforms in machine-readable JSON format.
Kaggle is great because it promotes the use of different dataset publication formats. However, the better part is that it strongly recommends that the dataset publishers share their data in an accessible, non-proprietary format.
The platform supports open and accessible data formats. It is important not just for access but also for whatever you want to do with this data. Therefore, Kaggle Dataset clearly defines the file formats which are recommended while sharing data.
The unique thing about Kaggle datasets is that it is not just a data repository. Each dataset stands for a community that enables you to discuss data, find out public codes and techniques, and conceptualize your own projects in Kernels.
CSV, JSON, SQLite, Archive, Big Query etc. are files types that Kaggle supports. You can find a variety of resources in order to start working on your open data project.
The best part is that Kaggle allows you to publish and share datasets privately or publicly.
It is the Open Data initiative of the University of Münster. Under this initiative, it is made possible for anyone to access any public information about the university in machine-readable formats. You can easily access and reuse it as per your needs.
Open data about scientific artifacts and encoded as linked data is made available under this project.
With the help of Linked Data, it is possible to share and use data, ontologies and various metadata standards. It is, in fact, envisaged that it will be the accepted standard for providing metadata, and the data itself on the Web.
SPARQL Package enables to connect to a SPARQL endpoint over HTTP, pose a SELECT query or an update query (LOAD, INSERT, DELETE).
It serves as a comprehensive repository of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
In this repository, there are, at present, 463 datasets as a service to the machine learning community.
The Center for Machine Learning and Intelligent Systems at the University of California, Irvine hosts and maintains it. David Aha had originally created it as a graduate student at UC Irvine.
Since then, students, educators, and researchers all over the world make use of it as a reliable source of machine learning datasets.
How it works is that each dataset has its distinct webpage which enlists all the known details including any relevant publications that investigate it. You can download these datasets as ASCII files, often the useful CSV format.
The details of datasets are summarized by aspects like attribute types, number of instances, number of attributes and year published that can be sorted and searched.
Open Data Portals and Search Engines:
While there are plenty of datasets published by numerous agencies every year, very few datasets become recognized and established.
The reason why very few such datasets sustain as useful resource is that it is a challenge to develop, manage and provide the data in a way that people and organizations find it useful and easy to use.
However, please find below a list of other few important open data portals and platforms that permit users to access open data quite easily, study the impact and glean valuable insights.
Open data is the order of the day. The world has gradually started moving towards open systems and open data is rightly in sync with that.
The business and organizations which leverage open data will gain a competitive edge and will be able to dominate the future.