by Quincy Larson
We just released 3 years of freeCodeCamp chat history as Open Data — all 5 million messages of it
Two years ago, our nonprofit started a tradition of releasing large public datasets for researchers and data scientists. And today I’m thrilled to announce the release of our biggest open dataset yet.
Gitter.im is an open source chat platform designed specifically with open source in mind. Unlike Slack or Discord, Gitter is truly public. Anyone can join a chatroom, and anyone can observe a chatroom without even needing to create a Gitter account.
Gitter also has a robust public API, which we’ve used to export 3 years of chat history — more than 5 million messages — into a few conveniently-organized files.
Do you prefer CSV or JSON? Take your pick. 👍 You can download the full dataset from Kaggle.
What the world can learn from this dataset
This dataset is a record of activity from freeCodeCamp’s general chatroom, which the Gitter team has told me was once the most active room on all of Gitter. (We’ve closed the room to instead focus on more specialized rooms.)
The dataset contains posts from learners, bots, moderators, and contributors between December 31, 2014 and December 9, 2017.
Already a couple data scientists have started analyzing the dataset.
For example, here’s a chart of overall activity in this room.
Activity peaked in July 2017, when we stopped sending new campers to the room after they created their freeCodeCamp.org account, opting instead to let campers discover the chatroom system for themselves.
Even though the total volume of posts has dropped, conversations have become richer, and the room still averages about 1 post per minute.
And an analyst at Kaggle asked questions of the data like: who are the chatroom regulars, and how many of the messages come from them?
But these analyses are just the tip of the ice berg.
“There is lots more to do with this dataset. I haven’t even taken a look at by far the most interesting feature of all: its textual NLP content. This dataset is a record of millions of messages from novice computer science learners and learning community members; what can you discover about the language they use, and/or about freeCodeCamp itself, by examining the text content of their speech? The possibilities are major!” — Aleksey Bilogur
Who is this dataset for?
If you’ve been looking for an open dataset for training natural language processing algorithms, give this a shot. It’s several times larger than The Brown Corpus, and it’s messages are authored by tens of thousands of people from all around the world.
Also, if you’re interested in education — specifically technology education — then this dataset may be particularly relevant to your research. (You may want to check out our 2017 New Coder Survey dataset as well.)
And if you’re learning data science or machine learning, this dataset is a great place to start. The data is well-organized, the contents are diverse, and there are undoubtedly a lot of insights about human behavior in there.
To get your creative juices cooking, here are some fun articles Evaristo Caraballo has written based on insights from this dataset:
The Emoji developers use most — based on my analysis of 3.5GB of chat logs
Emoji have drastically changed the way we communicate in social media.medium.freecodecamp.orgThe 12 YouTube videos new developers mention the most
The freeCodeCamp community generates gigabytes of data each week. One of the most active parts of the community is the…medium.freecodecamp.orgThe 10 GitHub repos new developers mention the most
The freeCodeCamp community generates gigabytes of data each week. One of the most active parts of the community is the…medium.freecodecamp.org
That’s all. Have fun, and happy crunching!
And if you found this article interesting, you should follow me on Twitter. I only tweet about programming and technology, and I won’t waste your time :)