by Felipe Hoffa
Want people to actually answer your Stack Overflow question? Add a question mark.
Last week, my team at Google announced that we’d be hosting all of Stack Overflow’s Q&A data on BigQuery.
Here are some of the most interesting insights about Stack Overflow that we’ve uncovered so far.
Setting up the data dump
Nick Craver at Stack Overflow announced a new dataset dump on Friday:
We quickly loaded the full data dump into BigQuery:
If you want an answer, use a question mark
Sara Robinson discovered that only 22% of Stack Overflow questions end with a question mark.
So I thought — hm… that’s interesting. But does adding a “?” actually help you get answers?
So I did an analysis of how many questions got an “accepted answer.” I then grouped them by whether or not they ended with a question mark.
It turns out that in 2016, 78% of questions ending in “?” got an accepted answer versus only only 73% of questions that didn’t end in “?”. And this pattern remains consistent if you look back through the years.
So if you want people to actually answer your Stack Overflow questions, end them with a question mark.
What about the number of answers a given question gets? Do questions that end with a “?” get more replies?
Yes, they do:
Using a question mark in 2015 and 2016 gave questions at least 7% more answers. This is even more noticeable in 2008 and 2009, during which questions with a “?” have received 23% more answers than questions without one.
Here’s the query I ran to get these results:
#standardSQL SELECT EXTRACT(YEAR FROM creation_date) year, IF(title LIKE '%?', 'ends with ?', 'does not') ends_with_question, ROUND(COUNT(accepted_answer_id )* 100/COUNT(*), 2) as answered , ROUND(AVG(answer_count), 3) as avg_answers FROM `bigquery-public-data.stackoverflow.posts_questions` WHERE creation_date < (SELECT TIMESTAMP_SUB(MAX(creation_date), INTERVAL 24*90 HOUR) FROM `bigquery-public-data.stackoverflow.posts_questions` ) GROUP BY 1,2 ORDER BY 1,2
I built the above visualizations using re:dash.
Here’s a bonus visualization I did of how long it takes to get an answer depending on which programming language you’re asking about — and the total volume of questions and answers for each language:
Here’s an interactive version.
And here’s the query I ran to get these results:
#standardSQL SELECT tag, COUNT(*) c, COUNT(DISTINCT b.owner_user_id) answerers, AVG(TIMESTAMP_DIFF(b.creation_date,a.creation_date, MINUTE)) time_to_answer FROM ( SELECT * FROM ( SELECT id, EXTRACT(YEAR FROM creation_date) year, SPLIT(tags, '|') tags, accepted_answer_id, creation_date FROM `bigquery-public-data.stackoverflow.posts_questions` ), UNNEST(tags) tag WHERE accepted_answer_id IS NOT null ) a LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` b ON a.accepted_answer_id=b.id GROUP BY 1 HAVING c>300 ORDER BY 2 DESC LIMIT 1000
Here’s Stack Overflow’s CEO announcing the fully query-able dataset:
Stack Overflow's entire data set is freely available, and now you can query it using Google's BigQuery. https://t.co/x1EAmY6NnD— Joel Spolsky (@spolsky) December 16, 2016
One final interesting study: Graham Polley wrote a great post showing how to take Stack Overflow comments from BigQuery, run a sentiment analysis process on them with our Natural Language API and Dataflow, then bring them back to BigQuery to discover the most positive/negative communities.
Want to learn more?
Check the GCP Big Data blog post, which includes queries on how to JOIN Stack Overflow’s data with other datasets like Hacker News and GitHub.
- Infoworld’s report: Google BigQuery provides insight into Stack Overflow discussion data
- Comments from the Hacker News discussion
- Kaitlin Pike’s announcement on the Stack Overflow official blog.