<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ bigquery - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ bigquery - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 26 May 2026 10:37:30 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/bigquery/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Google BigQuery Beginner's Guide – How to Analyze Large Datasets ]]>
                </title>
                <description>
                    <![CDATA[ By Ambreen Khan Gone are the days of storing your data in a CSV file or an Excel spreadsheet. If you want to quickly analyze millions of data rows in seconds, BigQuery is the way to go. In this getting started guide, we'll learn about BigQuery and ho... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/google-bigquery-beginners-guide/</link>
                <guid isPermaLink="false">66d45d99aad1510d0766b5e3</guid>
                
                    <category>
                        <![CDATA[ bigquery ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Google ]]>
                    </category>
                
                    <category>
                        <![CDATA[ google cloud ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 12 Jul 2021 11:26:39 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/07/web-1.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Ambreen Khan</p>
<p>Gone are the days of storing your data in a CSV file or an Excel spreadsheet. If you want to quickly analyze millions of data rows in seconds, BigQuery is the way to go.</p>
<p>In this getting started guide, we'll learn about BigQuery and how we can use it to query and analyze data.</p>
<h2 id="heading-what-is-bigquery">What is BigQuery?</h2>
<p>BigQuery is an enterprise data warehouse that many companies use who need a fully-managed cloud based solution for their massive datasets. </p>
<p>BigQuery's serverless architecture allows you to quickly execute standard SQL queries and analyze millions of data rows in seconds.  You can then store your data both in Google Cloud Storage in files and buckets or in BigQuery storage. </p>
<p>BigQuery also has excellent integrations with other GCP products, like Data Flow and Data Studio that makes it a great choice for data analytics tasks.</p>
<h2 id="heading-before-you-begin">Before You Begin:</h2>
<p>We are going to query tables in a public dataset that Google has provided to try out BigQuery using the Google Cloud Platform. Therefore, this guide assumes that:</p>
<ul>
<li>You have an access on <a target="_blank" href="https://cloud.google.com/free/?gclid=CjwKCAjw55-HBhAHEiwARMCsziVtllCq8mRIWlXVVztmn6HkzAlkuajtZeYMInLQmykNGfbEjz2tfRoCFs0QAvD_BwE&amp;gclsrc=aw.ds">Google Cloud Platform</a>.</li>
<li>You have already created a <a target="_blank" href="https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui#before-you-begin">Google Cloud project</a>.</li>
<li>Google sandbox environment is up and running. </li>
</ul>
<h2 id="heading-how-to-access-a-public-dataset">How to Access a Public Dataset</h2>
<p>A public dataset is available to the general public through the <a target="_blank" href="https://cloud.google.com/public-datasets">Google Cloud Public Dataset Program</a>. We'll use a Hacker News dataset that contains all stories and comments from Hacker News from its launch in 2006 to present. Let's get started.</p>
<p>Navigate to <a target="_blank" href="https://console.cloud.google.com/marketplace/product/y-combinator/hacker-news">Hacker News dataset</a> and click the VIEW DATASET button. It will take you to the Google Cloud Platform login screen. Login to the account and it will open the BigQuery Editor window with the dataset. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-51.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-the-bigquery-interface-is-organized">How the BigQuery Interface Is Organized</h2>
<p>BigQuery is structured as a hierarchy with 4 levels:</p>
<ul>
<li>Projects: Top-level containers that store the data</li>
<li>Datasets: Within projects, datasets allow you to organize your data and hold one or more tables of data</li>
<li>Tables: Within datasets, tables hold actual data.</li>
<li>Jobs: task performed on data such as running queries, loading data, and exporting data.</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-53.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><strong>Note:</strong> Please note that while working with tables, you'll also notice that:</p>
<ul>
<li>Tables are broken out by day meaning that you will need to use a wildcard, or * to pull a larger date range.</li>
<li>There is also an “intraday” table that will give you data for the last 24 hours.</li>
</ul>
<h2 id="heading-how-to-check-the-table-schema">How to Check the Table Schema</h2>
<p>Click on the table name. This will allow you to see what columns are in the table, as well as some buttons to perform various operations on the table.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-55.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-preview-the-data">How to Preview the Data</h2>
<p>Use the preview button to get a sample of some rows in the table. <a target="_blank" href="https://cloud.google.com/bigquery/docs/best-practices-costs#avoid_select_">Don’t do a <code>SELECT *</code> in BigQuery</a>:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-56.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-query-big-data">How to Query Big Data</h2>
<p>SQL statements are used to perform various database tasks, such as querying data, creating tables, and updating databases.</p>
<h3 id="heading-basic-queries">Basic Queries</h3>
<p>Basic queries contain the following components:</p>
<ul>
<li><code>SELECT</code> (required): identifies the columns to be included in the query</li>
<li><code>FROM</code> (required): the table that contains the columns in the SELECT statement</li>
<li><code>WHERE</code>: a condition for filtering records</li>
<li><code>ORDER BY</code>: Used to sort the result-set in ascending or descending order.</li>
<li><code>GROUP BY</code>: how to aggregate data in the result set</li>
</ul>
<h2 id="heading-how-to-compose-a-query-in-bigquery">How to Compose a Query in BigQuery</h2>
<p>For our first query, let’s find out what are the top 5 domains shared in Hacker News in year 2021 so far (query executed on July 9th 2021).</p>
<p>Click the <strong>Compose New query</strong> button. It will open the editor tab.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-41.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Write your first query as below:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> REGEXP_EXTRACT(<span class="hljs-keyword">url</span>, <span class="hljs-string">'//([^/]*)/?'</span>) <span class="hljs-keyword">domain</span>, <span class="hljs-keyword">COUNT</span>(*) total
<span class="hljs-keyword">FROM</span> <span class="hljs-string">`bigquery-public-data.hacker_news.full`</span>
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">url</span>!=<span class="hljs-string">''</span> <span class="hljs-keyword">AND</span> <span class="hljs-keyword">EXTRACT</span>(<span class="hljs-keyword">YEAR</span> <span class="hljs-keyword">FROM</span> <span class="hljs-built_in">timestamp</span>)=<span class="hljs-number">2021</span>
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">domain</span> <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> total <span class="hljs-keyword">DESC</span> <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">5</span>
</code></pre>
<p>You'll notice that BigQuery debugs your code as you construct it. If the query is valid, then a check mark appears along with the amount of data that the query will process. This helps you determine the cost of running the query. </p>
<p>If the query is invalid, then an exclamation point appears along with an error message.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-59.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To run this query, click on the Run button. In a few seconds, you should see results returned from the query:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-60.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can click on the <strong>JSON</strong> tab if you want the results in JSON format. You'll also find interesting details under the 'Execution details' column.</p>
<h2 id="heading-how-to-query-multiple-tables-using-a-wildcard-table"><strong>How to Query Multiple Tables Using a Wildcard Table</strong></h2>
<p>Wildcard tables enable you to query multiple tables using concise SQL statements. A wildcard table represents a union of all the tables that match the wildcard expression:</p>
<p><code>FROM</code>tablename.stories_*`` </p>
<h3 id="heading-tablesuffix-pseudo-column">_TABLE_SUFFIX Pseudo Column</h3>
<p>Queries with wildcard tables support the <code>_TABLE_SUFFIX</code> pseudo column in the <code>WHERE</code> clause. To restrict a query so that it scans only a specified set of tables, use the <code>_TABLE_SUFFIX</code> pseudo column in a <code>WHERE</code> clause with a condition that is a constant expression.</p>
<p>Using <code>_TABLE_SUFFIX</code> can greatly reduce the number of bytes scanned, which helps reduce the cost of running your queries.</p>
<h3 id="heading-how-to-get-data-by-providing-a-date-range">How to Get Data by Providing a Date Range</h3>
<pre><code>WHERE _TABLE_SUFFIX BETWEEN
    FORMAT_DATE(‘%Y%m%d’,DATE_SUB(CURRENT_DATE(), INTERVAL <span class="hljs-number">36</span> MONTH))
    AND
    FORMAT_DATE(‘%Y%m%d’,DATE_SUB(CURRENT_DATE(), INTERVAL <span class="hljs-number">1</span> DAY))
</code></pre><h3 id="heading-how-to-use-unnest-to-flatten-the-date">How to Use UNNEST to Flatten the Date</h3>
<p>To convert an <code>ARRAY</code> into a set of rows, also known as "flattening," use the <a target="_blank" href="https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unnest_operator"><code>UNNEST</code></a> operator. <code>UNNEST</code> takes an <code>ARRAY</code> and returns a table with a single row for each element in the <code>ARRAY</code>:</p>
<pre><code>SELECT * FROM UNNEST ([<span class="hljs-string">'Ambreen'</span>, <span class="hljs-string">'Abdul'</span>, <span class="hljs-string">'Adam'</span>, <span class="hljs-string">'David'</span>]) AS names;
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-45.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-save-and-share-queries">How to Save and Share Queries</h2>
<p>You can save your queries for later use. There are 3 types of saved queries:</p>
<ul>
<li><strong>Private:</strong> Private saved queries are visible only to the user who creates them.</li>
<li><strong>Project-level:</strong> Project-level saved queries are visible to members of the predefined BigQuery IAM roles with the required <a target="_blank" href="https://cloud.google.com/bigquery/docs/saving-sharing-queries#permissions">permissions</a>.</li>
<li><strong>Public:</strong> Public saved queries are visible to anyone with a link to the query.</li>
</ul>
<h2 id="heading-summary">Summary</h2>
<p>BigQuery is much more sophisticated than what we explored in this simple tutorial. You can also export Firebase Analytics data to BigQuery, which will let you run sophisticated ad hoc queries against your analytics data. </p>
<p>And with BigQuery ML, you can create and execute machine learning models using standard SQL queries. </p>
<p>If you’re feeling excited and want to learn more about BigQuery, check out the links below.</p>
<h2 id="heading-resources">Resources:</h2>
<ul>
<li><a target="_blank" href="https://support.google.com/analytics/answer/4419694?hl=en#zippy=%2Cin-this-article">BigQuery cookbook</a> </li>
<li><a target="_blank" href="https://cloud.google.com/bigquery/docs/querying-wildcard-tables#filtering_selected_tables_using_table_suffix">Filtering selected tables using _TABLE_SUFFIX</a> </li>
<li><a target="_blank" href="https://firebase.googleblog.com/2017/03/bigquery-tip-unnest-function.html">BigQuery Tip: The UNNEST Function</a></li>
<li><a target="_blank" href="https://towardsdatascience.com/bigquery-unnest-how-to-work-with-nested-data-in-bigquery-f27006a64c3">BigQuery UNNEST: How to work with nested data in BigQuery</a></li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Who contributed the most to open source in 2017 and 2018? Let’s analyze GitHub’s data and find out. ]]>
                </title>
                <description>
                    <![CDATA[ By Felipe Hoffa For this analysis we’ll look at all the PushEvents published by GitHub during 2017. For each GitHub user we’ll have to make our best guess to determine to which organization they belong. We’ll only look at repositories that have recei... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-top-contributors-to-github-2017-be98ab854e87/</link>
                <guid isPermaLink="false">66c362a41a1cf73cbc81f131</guid>
                
                    <category>
                        <![CDATA[ bigquery ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ open source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 24 Oct 2017 21:01:28 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2019/12/1_Qgsj8DRAKucO4yMpUiwHvw-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Felipe Hoffa</p>
<p>For this analysis we’ll look at all the <code>PushEvents</code> published by GitHub during 2017. For each GitHub user we’ll have to make our best guess to determine to which organization they belong. We’ll only look at repositories that have received at least 20 stars this year.</p>
<p>Here are the results I got, which you can <a target="_blank" href="https://datastudio.google.com/open/0ByGAKP3QmCjLU1JzUGtJdTlNOG8">tinker with in my the interactive Data Studio report</a>.</p>
<h3 id="heading-comparing-the-top-cloud-providers">Comparing the top cloud providers</h3>
<p>Looking at GitHub during 2017:</p>
<ul>
<li>Microsoft appears to have ~1,300 employees actively pushing code to 825 top repositories on GitHub.</li>
<li>Google displays ~900 employees active on GitHub, who are pushing code to ~1,100 top repositories.</li>
<li>Amazon appears to have only 134 active employees on GitHub, pushing code to only 158 top projects.</li>
<li>Not all projects are equal: While Googlers are contributing code to 25% more repositories than Microsoft, these repositories have collected way more stars (530,000 vs 260,000). Amazon repositories sum of 2017 stars? 27,000.</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*EfhT-K6feRjyifX_K49AFg.png" alt="Image" width="800" height="608" loading="lazy"></p>
<h3 id="heading-redhat-ibm-pivotal-intel-and-facebook">RedHat, IBM, Pivotal, Intel, and Facebook</h3>
<p>If Amazon seems so far behind Microsoft and Google — what are the companies in between? According to this ranking RedHat, Pivotal, and Intel are pushing great contributions to GitHub:</p>
<p>Note that the following table combines all of IBM regional domains — while the individual regions still show up in the subsequent tables.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*KnaOtVpdmPFabCtk-saYUw.png" alt="Image" width="800" height="434" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*Dy08nNIdjxBQRqQ6zXTThg.png" alt="Image" width="800" height="616" loading="lazy"></p>
<p>Facebook and IBM (US) have a similar number of GitHub users than Amazon, but the projects they contribute to have collected more stars (especially Facebook):</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*ZJP36ojAFyo7BcZnJ-PT3Q.png" alt="Image" width="800" height="611" loading="lazy"></p>
<p>Followed by Alibaba, Uber, and Wix:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*yG3X8Sq35S8Z9mNLv9pliA.png" alt="Image" width="800" height="602" loading="lazy"></p>
<p>GitHub itself, Apache, Tencent:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*Ij2hSTZiQndHdFRsFNwb-g.png" alt="Image" width="800" height="591" loading="lazy"></p>
<p>Baidu, Apple, Mozilla:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*ZRjQ0fNe39-qox3cy6OGUQ.png" alt="Image" width="800" height="626" loading="lazy"></p>
<p>Oracle, Stanford, Mit, Shopify, MongoDb, Berkeley, VmWare, Netflix, Salesforce, Gsa.gov:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*mi1gdgVUYRbTBoBuo14gtA.png" alt="Image" width="800" height="591" loading="lazy"></p>
<p>LinkedIn, Broad Institute, Palantir, Yahoo, MapBox, Unity3d, Automattic, Sandia, Travis-ci, Spotify:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*yQzsoab7AFbQ2BTnPCGbXg.png" alt="Image" width="800" height="572" loading="lazy"></p>
<p>Chromium, UMich, Zalando, Esri, IBM (UK), SAP, EPAM, Telerik, UK Cabinet Office, Stripe:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*TCbZaq4sgpjFQ9f4yFoWoQ.png" alt="Image" width="800" height="563" loading="lazy"></p>
<p>Cern, Odoo, Kitware, Suse, Yandex, IBM (Canada), Adobe, AirBnB, Chef, The Guardian:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*zXxtygHJUi4tdNr1JRNlyg.png" alt="Image" width="800" height="598" loading="lazy"></p>
<p>Arm, Macports, Docker, Nuxeo, NVidia, Yelp, Elastic, NYU, WSO2, Mesosphere, Inria</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*f6AK5xHrJIAhEn7t9569lQ.png" alt="Image" width="800" height="661" loading="lazy"></p>
<p>Puppet, Stanford (CS), DatadogHQ, Epfl, NTT Data, Lawrence Livermore Lab:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*RP5nyYdwn2d2pb05xnMxyA.png" alt="Image" width="800" height="348" loading="lazy"></p>
<h3 id="heading-my-methodology">My Methodology</h3>
<h4 id="heading-how-i-linked-github-users-to-companies">How I linked GitHub users to companies</h4>
<p>Determining the organization to which each GitHub user belongs it’s not easy — but we can use the email domains that show up in each commit message contained on PushEvents:</p>
<ul>
<li>The same email can show up in more than one user, so I only considered GitHub users able to push code to GitHub projects with more than 20 stars during the period.</li>
<li>I only counted GitHub users with more than 3 pushes during the period.</li>
<li>Users pushing code to GitHub can display many different emails on their pushes — part of how Git works. To determine the organization for each user, I looked into the email their pushes shows up most frequently.</li>
<li>Not everyone uses their organization email on GitHub. There are a lot of gmail.com, users.noreply.github.com, and other email hosting providers. Sometimes the reason for this is anonymity and protecting their corporate inboxes — but if I couldn’t see their email domain, I couldn’t count them. Sorry.</li>
<li>Sometimes employees switch organizations. I assigned them to the one that got the more pushes according to these rules.</li>
</ul>
<h4 id="heading-my-query">My query</h4>
<pre><code>#standardSQLWITHperiod AS (  SELECT *  FROM <span class="hljs-string">`githubarchive.month.2017*`</span> a),repo_stars AS (  SELECT repo.id, COUNT(DISTINCT actor.login) stars, APPROX_TOP_COUNT(repo.name, <span class="hljs-number">1</span>)[OFFSET(<span class="hljs-number">0</span>)].value repo_name   FROM period  WHERE type=<span class="hljs-string">'WatchEvent'</span>  GROUP BY <span class="hljs-number">1</span>  HAVING stars&gt;<span class="hljs-number">20</span>), pushers_guess_emails_and_top_projects AS (  SELECT *    # , REGEXP_EXTRACT(email, r<span class="hljs-string">'@(.*)'</span>) domain    , REGEXP_REPLACE(REGEXP_EXTRACT(email, r<span class="hljs-string">'@(.*)'</span>), r<span class="hljs-string">'.*.ibm.com'</span>, <span class="hljs-string">'ibm.com'</span>) domain  FROM (    SELECT actor.id      , APPROX_TOP_COUNT(actor.login,<span class="hljs-number">1</span>)[OFFSET(<span class="hljs-number">0</span>)].value login      , APPROX_TOP_COUNT(JSON_EXTRACT_SCALAR(payload, <span class="hljs-string">'$.commits[0].author.email'</span>),<span class="hljs-number">1</span>)[OFFSET(<span class="hljs-number">0</span>)].value email      , COUNT(*) c      , ARRAY_AGG(DISTINCT TO_JSON_STRING(STRUCT(b.repo_name,stars))) repos    FROM period a    JOIN repo_stars b    ON a.repo.id=b.id    WHERE type=<span class="hljs-string">'PushEvent'</span>    GROUP BY  <span class="hljs-number">1</span>    HAVING c&gt;<span class="hljs-number">3</span>  ))SELECT * FROM (  SELECT domain    , githubers    , (SELECT COUNT(DISTINCT repo) FROM UNNEST(repos) repo) repos_contributed_to    , ARRAY(        SELECT AS STRUCT JSON_EXTRACT_SCALAR(repo, <span class="hljs-string">'$.repo_name'</span>) repo_name        , CAST(JSON_EXTRACT_SCALAR(repo, <span class="hljs-string">'$.stars'</span>) AS INT64) stars        , COUNT(*) githubers_from_domain FROM UNNEST(repos) repo         GROUP BY <span class="hljs-number">1</span>, <span class="hljs-number">2</span>         HAVING githubers_from_domain&gt;<span class="hljs-number">1</span>         ORDER BY stars DESC LIMIT <span class="hljs-number">3</span>      ) top    , (SELECT SUM(CAST(JSON_EXTRACT_SCALAR(repo, <span class="hljs-string">'$.stars'</span>) AS INT64)) FROM (SELECT DISTINCT repo FROM UNNEST(repos) repo)) sum_stars_projects_contributed_to  FROM (    SELECT domain, COUNT(*) githubers, ARRAY_CONCAT_AGG(ARRAY(SELECT * FROM UNNEST(repos) repo)) repos    FROM pushers_guess_emails_and_top_projects    #WHERE domain IN UNNEST(SPLIT(<span class="hljs-string">'google.com|microsoft.com|amazon.com'</span>, <span class="hljs-string">'|'</span>))    WHERE domain NOT IN UNNEST(SPLIT(<span class="hljs-string">'gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru|126.com|protonmail.com'</span>, <span class="hljs-string">'|'</span>)) # email hosters    GROUP BY <span class="hljs-number">1</span>    HAVING githubers &gt; <span class="hljs-number">30</span>  )  WHERE (SELECT MAX(githubers_from_domain) FROM (SELECT repo, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo  GROUP BY repo))&gt;<span class="hljs-number">4</span> # second filter email hosters)ORDER BY githubers DESC
</code></pre><h3 id="heading-faq">FAQ</h3>
<h4 id="heading-if-an-organization-has-1500-repositories-why-do-you-only-count-200-if-a-repository-has-7000-stars-why-do-you-only-show-1500"><strong>If an organization has 1,500 repositories, why do you only count 200? If a repository has 7,000 stars, why do you only show 1,500?</strong></h4>
<p>I’m filtering for relevancy. I’m only counting stars given during 2017. For example, Apache has &gt;1,500 repositories on GitHub, but only 205 have received more than 20 stars this year.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*wf86s1GygY1u283nA6LoYQ.png" alt="Image" width="800" height="678" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*vjycrF8zFYdJIBCV2HEkCg.png" alt="Image" width="800" height="316" loading="lazy"></p>
<h4 id="heading-is-this-the-state-of-open-source">Is this the state of open source?</h4>
<p>Note that analyzing GitHub doesn’t include top communities like Android, Chromium, GNU, Mozilla, nor the the Apache or Eclipse Foundation, and <a target="_blank" href="https://developers.google.com/open-source/organizations">other</a> projects that choose to run most of their activities outside of GitHub.</p>
<h4 id="heading-you-were-unfair-to-my-organization"><strong>You were unfair to my organization.</strong></h4>
<p>I can only count what I can see. Please challenge my assumptions and tell me how you would measure things in a better way. Working queries would be the best way.</p>
<p>For example, see how their ranking changes when I combine IBM’s region-based domains into their top one with one SQL transformation:</p>
<pre><code>SELECT *, REGEXP_REPLACE(REGEXP_EXTRACT(email, r<span class="hljs-string">'@(.*)'</span>), r<span class="hljs-string">'.*.ibm.com'</span>, <span class="hljs-string">'ibm.com'</span>) domain
</code></pre><p><img src="https://cdn-media-1.freecodecamp.org/images/1*sKjuzOO2OYPcKGAzq9jDYw.png" alt="Image" width="800" height="455" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*ywkHH3kMMVdGhXe6LDq7IA.png" alt="Image" width="800" height="449" loading="lazy">
<em>IBM’s relative position moves significantly when you combine their regional email domains.</em></p>
<h4 id="heading-reactions">Reactions</h4>
<p><a target="_blank" href="https://redmonk.com/jgovernor/2017/10/25/some-thoughts-on-the-top-contributors-to-github-2017/"><strong>Some thoughts on "the top contributors to GitHub 2017".</strong></a><br><a target="_blank" href="https://redmonk.com/jgovernor/2017/10/25/some-thoughts-on-the-top-contributors-to-github-2017/">_Yesterday Felipe Hoffa from the Google Dev Rel team published some interesting research looking at corporate usage of…_redmonk.com</a></p>
<h3 id="heading-next-steps">Next steps</h3>
<p>I’ve been wrong before — and it will probably happen again. Please take a look at all the raw data available and question all my assumptions — it will be cool to see what results you get.</p>
<p><a target="_blank" href="https://datastudio.google.com/open/0ByGAKP3QmCjLU1JzUGtJdTlNOG8">Play with the interactive Data Studio report</a>.</p>
<p>Thanks to <a target="_blank" href="https://www.freecodecamp.org/news/the-top-contributors-to-github-2017-be98ab854e87/undefined">Ilya Grigorik</a> for keeping <a target="_blank" href="http://githubarchive.org">GitHub Archive</a> well fed and full of GitHub data all these years!</p>
<p>Want more stories? Check my <a target="_blank" href="http://medium.com/@hoffa/">Medium</a>, <a target="_blank" href="http://twitter.com/felipehoffa">follow me on twitter</a>, and subscribe to <a target="_blank" href="https://reddit.com/r/bigquery">reddit.com/r/bigquery</a>. And <a target="_blank" href="https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/">try BigQuery</a> — every month you get a full terabyte of analysis for <a target="_blank" href="https://cloud.google.com/blog/big-data/2017/01/how-to-run-a-terabyte-of-google-bigquery-queries-each-month-without-a-credit-card">free</a>.</p>
<p><a target="_blank" href="https://hackernoon.com/winning-arguments-with-data-leading-with-commas-in-sql-672b3b81eac9"><strong>Leading with commas — ugly or efficient? An investigation over 320 GB of SQL code</strong></a><br><a target="_blank" href="https://hackernoon.com/winning-arguments-with-data-leading-with-commas-in-sql-672b3b81eac9">_Winning arguments with data: Let’s analyze 320 Gigabytes of open source SQL code to determine if we should use trailing…_hackernoon.com</a><a target="_blank" href="https://hackernoon.com/some-coders-like-it-hot-but-most-prefer-colder-climates-4703c3f02fbb"><strong>Some coders like it hot — but most prefer colder climates</strong></a><br><a target="_blank" href="https://hackernoon.com/some-coders-like-it-hot-but-most-prefer-colder-climates-4703c3f02fbb">_Previously we found some of the major concentration of open source coders around pretty cold places (Iceland, Sweden…_hackernoon.com</a></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
