<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ data migration - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ data migration - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 05 May 2026 22:20:39 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/data-migration/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Perform Database Migrations using Go Migrate ]]>
                </title>
                <description>
                    <![CDATA[ By Rwitesh Bera Since its introduction, the programming language Go (also known as Golang) has become increasingly popular. It is known for its simplicity and efficient performance, similar to that of a lower-level language such as C++.  While workin... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/database-migration-golang-migrate/</link>
                <guid isPermaLink="false">66d460c7052ad259f07e4b2c</guid>
                
                    <category>
                        <![CDATA[ data migration ]]>
                    </category>
                
                    <category>
                        <![CDATA[ database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ golang ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 26 Jan 2023 23:17:14 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/01/Blue-and-Pink-3D-Elements-Student-Part-Time-Graphic-Designer-Video-Resume-Talking-Presentation.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Rwitesh Bera</p>
<p>Since its introduction, the programming language Go (also known as Golang) has become increasingly popular. It is known for its simplicity and efficient performance, similar to that of a lower-level language such as C++. </p>
<p>While working with a database, schema migration is one of the most important tasks that developers do throughout the project lifecycle. In this article, I will explain what database migration is and how to manage it using <a target="_blank" href="https://github.com/golang-migrate/migrate">go-migrate</a>.</p>
<h2 id="heading-what-is-a-database-migration">What is a Database Migration?</h2>
<p>A database migration, also known as a schema migration, is a set of changes to be made to a structure of objects within a relational database. </p>
<p>It is a way to manage and implement incremental changes to the structure of data in a controlled, programmatic manner. These changes are often reversible, meaning they can be undone or rolled back if required. </p>
<p>The process of migration helps to change the database schema from its current state to a new desired state, whether it involves adding tables and columns, removing elements, splitting fields, or changing types and constraints. </p>
<p>By managing these changes in a programmatic way, it becomes easier to maintain consistency and accuracy in the database, as well as keep track of the history of modifications made to it.</p>
<h2 id="heading-setup-and-installation">Setup and Installation</h2>
<p><a target="_blank" href="https://github.com/golang-migrate/migrate">migrate</a> is a CLI tool that you can use to run migrations. You can easily install it on various operating systems such as Linux, Mac and Windows by using package managers like curl, brew, and scoop, respectively. </p>
<p>For more information on how to install and use the tool, you can refer to the official documentation.</p>
<p>To install the migrate CLI tool using <a target="_blank" href="https://scoop.sh/">scoop</a> on Windows, you can follow these steps:</p>
<pre><code class="lang-bash">$ scoop install migrate
</code></pre>
<p>To install the migrate CLI tool using <strong>curl</strong> on Linux, you can follow these steps: </p>
<pre><code class="lang-bash">$ curl -L https://packagecloud.io/golang-migrate/migrate/gpgkey| apt-key add -
$ <span class="hljs-built_in">echo</span> <span class="hljs-string">"deb https://packagecloud.io/golang-migrate/migrate/ubuntu/ <span class="hljs-subst">$(lsb_release -sc)</span> main"</span> &gt; /etc/apt/sources.list.d/migrate.list
$ apt-get update
$ apt-get install -y migrate
</code></pre>
<p>To install the migrate CLI tool using on Mac, you can follow these steps: </p>
<pre><code class="lang-bash">$ brew install golang-migrate
</code></pre>
<h2 id="heading-how-to-create-a-new-migration">How to Create a New Migration</h2>
<p>Create a directory like <code>database/migration</code> to store all the migration files.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-263.png" alt="Image" width="600" height="400" loading="lazy">
<em>Source files structure in GoLand IDE</em></p>
<p>Next, create migration files using the following command:</p>
<pre><code class="lang-bash">$ migrate create -ext sql -dir database/migration/ -seq init_mg
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-267.png" alt="Image" width="600" height="400" loading="lazy">
<em>Terminal output displaying successful creation of migration</em></p>
<p>You use <code>-seq</code> to generate a sequential version and <code>init_mg</code> is the name of the migration.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-269.png" alt="Image" width="600" height="400" loading="lazy">
<em>Source files structure in GoLand IDE</em></p>
<p>A migration typically consists of two distinct files, one for moving the database to a new state (referred to as "up") and another for reverting the changes made to the previous state (referred to as "down"). </p>
<p>The "up" file is used to implement the desired changes to the database, while the "down" file is used to undo those changes and return the database to its previous state.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/Database-migration.jpg" alt="Image" width="600" height="400" loading="lazy">
<em>Flow of database migration</em></p>
<p>The format of those files for SQL are:</p>
<pre><code class="lang-bash">{version}_{title}.down.sql
{version}_{title}.up.sql
</code></pre>
<p>When you create migration files, they will be empty by default. To implement the changes you want, you will need to fill them with the appropriate SQL queries.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-282.png" alt="Image" width="600" height="400" loading="lazy">
<em>SQL queries for migrating data</em></p>
<h3 id="heading-how-to-run-migration-up">How to Run Migration Up</h3>
<p>In order to execute the SQL statements in the migration files, migrate requires a valid connection to a Postgres database. </p>
<p>To accomplish this, you will need to provide a connection string in the proper format.</p>
<pre><code class="lang-bash">$ migrate -path database/migration/ -database <span class="hljs-string">"postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable"</span> -verbose up
</code></pre>
<p>Now in your Postgres shell, you can check newly created tables by using the following commands:</p>
<pre><code class="lang-bash">\d+

\d+ table_name DESCRIBE TABLE
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/image-286.png" alt="Image" width="600" height="400" loading="lazy">
<em>Displaying table data in Postgres</em></p>
<h3 id="heading-how-to-rollback-migrations">How to Rollback Migrations</h3>
<p>If you want to revert back the migration, you can do that using the following <code>down</code> tag:</p>
<pre><code class="lang-bash">$ migrate -path database/migration/ -database <span class="hljs-string">"postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable"</span> -verbose down
</code></pre>
<p>It will delete the <code>email</code> column from both tables as mentioned in the <code>000002_init_mg.up.sql</code> file.</p>
<p>Now, let's check the database and see if <code>email</code> has been deleted or not:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/Screenshot_20230126_102731.png" alt="Image" width="600" height="400" loading="lazy">
<em>Displaying updated table data in Postgres</em></p>
<h3 id="heading-how-to-resolve-migration-errors">How to Resolve Migration Errors</h3>
<p>If a migration contains an error and is executed, migrate will prevent any further migrations from being run on the same database. </p>
<p>An error message like <code>Dirty database version 1. Fix and force version</code> will be displayed, even after the error in the migration is fixed. This indicates that the database is "dirty" and needs to be investigated. </p>
<p>It is necessary to determine if the migration was applied partially or not at all. Once you've determined this, the database version should be forced to reflect its true state using the force command.</p>
<pre><code class="lang-bash">$ migrate -path database/migration/ -database <span class="hljs-string">"postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable"</span> force &lt;VERSION&gt;
</code></pre>
<h3 id="heading-how-to-add-commands-in-a-makefile">How to Add Commands in a Makefile</h3>
<pre><code class="lang-makefile"><span class="hljs-section">migration_up: migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" -verbose up</span>

<span class="hljs-section">migration_down: migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" -verbose down</span>

<span class="hljs-section">migration_fix: migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" force VERSION</span>
</code></pre>
<p>Now, you can run <code>$ make migration_up</code> for 'up', <code>$ make migration_down</code> for 'down', and <code>$ make migration_fix</code> to fix the migration issue. </p>
<p>Before running the makefile, ensure that the correct version number is included in the <code>migration_fix</code> command.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Migration systems typically generate files that can be shared across developers and multiple teams. They can also be applied to multiple databases and maintained in version control. </p>
<p>Keeping a record of changes to the database makes it possible to track the history of modifications made to it. This way, the database schema and the application's understanding of that structure can evolve together.</p>
<p>That concludes our discussion on database migration. I hope you found the information to be useful and informative. </p>
<p>If you enjoyed reading this article, please consider sharing it with your colleagues and friends on social media. Additionally, please follow me on <a target="_blank" href="https://twitter.com/RwiteshBera/">Twitter</a> for more updates on technology and coding. Thank you for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Migrate a Database in PHP Using Phinx ]]>
                </title>
                <description>
                    <![CDATA[ Building modern web applications usually involves a lot of data. Managing these data (databases) during development and production can be a lot.  This is especially true if there's more than one developer, and multiple environments where changes have... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/easy-database-migrations-in-php-using-phinx/</link>
                <guid isPermaLink="false">66c4c6964173ed342943d0ac</guid>
                
                    <category>
                        <![CDATA[ data migration ]]>
                    </category>
                
                    <category>
                        <![CDATA[ database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ PHP ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Zubair Idris Aweda ]]>
                </dc:creator>
                <pubDate>Wed, 30 Mar 2022 23:59:31 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/03/0-ddWHLcHqIojSq_GO.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Building modern web applications usually involves a lot of data. Managing these data (databases) during development and production can be a lot. </p>
<p>This is especially true if there's more than one developer, and multiple environments where changes have to be manually implemented.</p>
<p>Database migrations help developers manage these changes easily, across multiple environments and developers. </p>
<p>This article explains:</p>
<ul>
<li>What database migrations are.</li>
<li>How to get started with database migrations in PHP using Phinx.</li>
<li>How to manage tables in your database.</li>
</ul>
<p>This article is meant for readers with basic PHP knowledge. It will help you learn to easily (and better) manage your databases.</p>
<h2 id="heading-what-are-database-migrations">What Are Database Migrations?</h2>
<p>In basic terms, migrations contain changes that you wish to make to your database. These changes could be creating or dropping a table, adding or removing some field(s) from a table, changing column types, and many more. </p>
<p>These files make it easy to make these same changes across multiple systems as anyone with the files can just run them, and have their database updated.</p>
<p>So in a real life scenario, some developer on the team could make a change to the <em>users</em> table to allow the <em>gender</em> field to accept more than the default <em>male</em> and <em>female</em> options, maybe a third <em>other</em> option. </p>
<p>After making this change, the developer creates a migration. This migration includes changes that they have made to the database – in this case a change to a column on a table – and other developers can easily get this change to their own local databases by running the migrations.</p>
<blockquote>
<p>Migrations are like version control for your database, allowing your team to define and share the application's database schema definition. If you have ever had to tell a teammate to manually add a column to their local database schema after pulling in your changes from source control, you've faced the problem that database migrations solve. - <a target="_blank" href="https://laravel.com/docs/9.x/">Laravel</a></p>
</blockquote>
<p>Many popular web frameworks already have support for migrations built in. But in this article, we explore using migrations in vanilla PHP.</p>
<p>Learn more about database migrations <a target="_blank" href="https://www.cloudbees.com/blog/database-migration">here</a>.</p>
<h2 id="heading-what-is-phinx">What Is Phinx?</h2>
<blockquote>
<p>Phinx is a PHP library that makes it ridiculously easy to manage the database migrations for your PHP app. - Phinx</p>
</blockquote>
<p>Phinx makes it possible to manage migrations easily regardless of whether you're using a PHP framework or not. It is also very easy to install (as we will see later on). </p>
<p>It ships with a couple of commands to make operations easier. It is fully customisable (you can do whatever you want with it 🙃). It also works in multiple environments, meaning you can have some production migrations, testing migrations, and dev migrations.</p>
<h2 id="heading-phinx-installation">Phinx Installation</h2>
<p>You can add Phinx to any PHP project using composer.</p>
<pre><code class="lang-bash">$ mkdir php-migrations
$ <span class="hljs-built_in">cd</span> php-migrations
$ composer init
</code></pre>
<p>The first command creates a folder in your current directory, <code>php-migrations</code>, and the second command moves into it. The last command starts an interactive shell.</p>
<p>Follow the prompt, filling in the details as required (the default values are fine). You can set the project description, author name (or contributors' names), minimum stability for dependencies, project type, license, and define your dependencies.</p>
<p>When you get to the dependencies part, install the <code>robmorgan/phinx</code> <em>phinx</em> package as a dependency.</p>
<p>Accept the other defaults and proceed to generating the <code>composer.json</code> file. The generated file should look like this currently:</p>
<pre><code class="lang-php">{
    <span class="hljs-string">"name"</span>: <span class="hljs-string">"zubair/php-migrations"</span>,
    <span class="hljs-string">"description"</span>: <span class="hljs-string">"A simple tutorial on how to use and manage migrations in PHP applications."</span>,
    <span class="hljs-string">"type"</span>: <span class="hljs-string">"project"</span>,
    <span class="hljs-string">"require"</span>: {
        <span class="hljs-string">"robmorgan/phinx"</span>: <span class="hljs-string">"^0.12.10"</span>
    },
    <span class="hljs-string">"license"</span>: <span class="hljs-string">"ISC"</span>,
    <span class="hljs-string">"autoload"</span>: {
        <span class="hljs-string">"psr-4"</span>: {
            <span class="hljs-string">"Zubs\\"</span>: <span class="hljs-string">"src/"</span>
        }
    },
    <span class="hljs-string">"authors"</span>: [
        {
            <span class="hljs-string">"name"</span>: <span class="hljs-string">"Zubs"</span>,
            <span class="hljs-string">"email"</span>: <span class="hljs-string">"zubairidrisaweda@gmail.com"</span>
        }
    ]
}
</code></pre>
<h2 id="heading-init-phinx">Init Phinx</h2>
<p>After installing Phinx, you need to initialise it. You can do this very easily using its binary installed in the <code>vendor</code> folder.</p>
<pre><code class="lang-bash">$ ./vendor/bin/phinx init
</code></pre>
<p>This creates phinx's configuration file as a PHP file. It could be created as a JSON file too. I prefer JSON for configurations, so I will use the JSON format.</p>
<pre><code class="lang-bash">$ ./vendor/bin/phinx init --format=json
</code></pre>
<p>Here's what the default configuration file looks like:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"paths"</span>: {
        <span class="hljs-attr">"migrations"</span>: <span class="hljs-string">"%%PHINX_CONFIG_DIR%%/db/migrations"</span>,
        <span class="hljs-attr">"seeds"</span>: <span class="hljs-string">"%%PHINX_CONFIG_DIR%%/db/seeds"</span>
    },
    <span class="hljs-attr">"environments"</span>: {
        <span class="hljs-attr">"default_migration_table"</span>: <span class="hljs-string">"phinxlog"</span>,
        <span class="hljs-attr">"default_environment"</span>: <span class="hljs-string">"development"</span>,
        <span class="hljs-attr">"production"</span>: {
            <span class="hljs-attr">"adapter"</span>: <span class="hljs-string">"mysql"</span>,
            <span class="hljs-attr">"host"</span>: <span class="hljs-string">"localhost"</span>,
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"production_db"</span>,
            <span class="hljs-attr">"user"</span>: <span class="hljs-string">"root"</span>,
            <span class="hljs-attr">"pass"</span>: <span class="hljs-string">""</span>,
            <span class="hljs-attr">"port"</span>: <span class="hljs-number">3306</span>,
            <span class="hljs-attr">"charset"</span>: <span class="hljs-string">"utf8"</span>
        },
        <span class="hljs-attr">"development"</span>: {
            <span class="hljs-attr">"adapter"</span>: <span class="hljs-string">"mysql"</span>,
            <span class="hljs-attr">"host"</span>: <span class="hljs-string">"localhost"</span>,
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"development_db"</span>,
            <span class="hljs-attr">"user"</span>: <span class="hljs-string">"root"</span>,
            <span class="hljs-attr">"pass"</span>: <span class="hljs-string">""</span>,
            <span class="hljs-attr">"port"</span>: <span class="hljs-number">3306</span>,
            <span class="hljs-attr">"charset"</span>: <span class="hljs-string">"utf8"</span>
        },
        <span class="hljs-attr">"testing"</span>: {
            <span class="hljs-attr">"adapter"</span>: <span class="hljs-string">"mysql"</span>,
            <span class="hljs-attr">"host"</span>: <span class="hljs-string">"localhost"</span>,
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"testing_db"</span>,
            <span class="hljs-attr">"user"</span>: <span class="hljs-string">"root"</span>,
            <span class="hljs-attr">"pass"</span>: <span class="hljs-string">""</span>,
            <span class="hljs-attr">"port"</span>: <span class="hljs-number">3306</span>,
            <span class="hljs-attr">"charset"</span>: <span class="hljs-string">"utf8"</span>
        }
    },
    <span class="hljs-attr">"version_order"</span>: <span class="hljs-string">"creation"</span>
}
</code></pre>
<p>In this configuration file, notice how Phinx expects that you have a <code>db/migrations</code> path (for your migrations) by default. You can change this if you want, but I think it's fine and I'll be keeping it.</p>
<pre><code class="lang-bash">$ mkdir db &amp;&amp; db/migrations
</code></pre>
<p>You can read more about these configurations in the <a target="_blank" href="https://book.cakephp.org/phinx/0/en/configuration.html">official documentation</a>.</p>
<p>Phinx also ships with commands for different actions to make it easier to use in our projects.</p>
<h2 id="heading-how-to-create-a-migration">How to Create A Migration</h2>
<p>Phinx uses classes for its migrations. To create a new migration (say, one to create a <em>posts</em> table), use the <code>create</code> command with the name of the migration.</p>
<pre><code class="lang-bash">$ ./vendor/bin/phinx create PostsTableMigration
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/03/Screenshot-2022-03-28-at-13.22.17.png" alt="Image" width="600" height="400" loading="lazy">
<em>Creating A Migration</em></p>
<p>This creates a <code>20220328122134_posts_table_migration.php</code> file in the <code>db/migrations</code> directory created earlier. This file is named using the <code>YYYYMMDDHHMMSS_my_new_migration.php</code> format. In this format, the first 14 characters, <em>YYYYMMDDHHMMSS</em>, are representations of the current timestamp.</p>
<p>The <code>20220328122134_posts_table_migration.php</code> looks like this currently:</p>
<pre><code class="lang-php"><span class="hljs-meta">&lt;?php</span>
<span class="hljs-keyword">declare</span>(strict_types=<span class="hljs-number">1</span>);

<span class="hljs-keyword">use</span> <span class="hljs-title">Phinx</span>\<span class="hljs-title">Migration</span>\<span class="hljs-title">AbstractMigration</span>;

<span class="hljs-keyword">final</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PostsTableMigration</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">AbstractMigration</span>
</span>{
    <span class="hljs-comment">/**
     * Change Method.
     *
     * Write your reversible migrations using this method.
     *
     * More information on writing migrations is available here:
     * https://book.cakephp.org/phinx/0/en/migrations.html#the-change-method
     *
     * Remember to call "create()" or "update()" and NOT "save()" when working
     * with the Table class.
     */</span>
    <span class="hljs-keyword">public</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">change</span>(<span class="hljs-params"></span>): <span class="hljs-title">void</span>
    </span>{

    }
}
</code></pre>
<p>This file (and all other migrations created using Phinx) extends the <code>Phinx\Migration\AbstractMigration</code> class. This class has all the methods you need to interact with your database.</p>
<p>This migration file also includes a <code>change</code> method. This method was introduced recently to Phinx in version 0.2.0 to implements Phinx's idea of reversible migrations. </p>
<p>These are migration files with just one method, <em>change,</em> that contains logic for performing some action, leaving Phinx to figure out how to reverse it. Rather than the traditional use of two methods, <em>up</em> and <em>down</em>, to create and reverse actions.</p>
<blockquote>
<p>Phinx still allows you use <em>up</em> and down methods. But it gives the change method preference over them when they are used together. It ignores them.</p>
</blockquote>
<h2 id="heading-how-to-manage-tables">How to Manage Tables</h2>
<p>Tables are the basis on which structured databases are built and are the most important part of what Phinx offers. </p>
<p>You can easily manage database tables using PHP code with Phinx. Phinx offers a powerful <code>table()</code> method. This method retrieves an instance of the <em>Table</em> object.</p>
<h3 id="heading-how-to-create-a-table">How to Create a Table</h3>
<p>Creating a table is really easy using Phinx. You create a new instance of the <em>Table</em> object using the <code>table()</code> method with the table name.</p>
<pre><code class="lang-php">$table = <span class="hljs-keyword">$this</span>-&gt;table(<span class="hljs-string">'posts'</span>);
</code></pre>
<p>Next, you can add columns with their settings.</p>
<pre><code class="lang-php">$table-&gt;addColumn(<span class="hljs-string">'title'</span>, <span class="hljs-string">'string'</span>, [<span class="hljs-string">'limit'</span> =&gt; <span class="hljs-number">20</span>])
    -&gt;addColumn(<span class="hljs-string">'body'</span>, <span class="hljs-string">'text'</span>)
    -&gt;addColumn(<span class="hljs-string">'cover_image'</span>, <span class="hljs-string">'string'</span>)
    -&gt;addTimestamps()
    -&gt;addIndex([<span class="hljs-string">'title'</span>], [<span class="hljs-string">'unique'</span> =&gt; <span class="hljs-literal">true</span>]);
</code></pre>
<p>Here, I've created columns <code>title</code>, <code>body</code>, <code>cover_image</code>, <code>created_at</code>, and <code>updated_at</code>. I also set the type of the <code>title</code> to be a <em>string</em> with 20 or fewer characters. </p>
<p>I set the <code>body</code> to be a text field, so it can hold long posts. The <code>cover_image</code> is also a <em>string</em> field that uses the default size of a string (255). </p>
<p>The <code>created_at</code> and <code>updated_at</code> fields are timestamps automatically generated in the <code>addTimestamps()</code> method. </p>
<p>Finally, I set the <code>title</code> field to be unique (as it would be in a real blog).</p>
<p>You can get all the available column types by checking the <a target="_blank" href="https://book.cakephp.org/phinx/0/en/migrations.html#valid-column-types">Valid Column Types</a>. You can also get all the available column options by checking the <a target="_blank" href="https://book.cakephp.org/phinx/0/en/migrations.html#valid-column-options">Valid Column Options</a>.</p>
<p>Finally, you can say that the database should be created by using the <code>create</code> method.</p>
<pre><code class="lang-php">$table-&gt;create();
</code></pre>
<p>In the end, your migration file's <em>change</em> method should look like this:</p>
<pre><code class="lang-php"><span class="hljs-keyword">public</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">change</span>(<span class="hljs-params"></span>): <span class="hljs-title">void</span>
</span>{
    $table = <span class="hljs-keyword">$this</span>-&gt;table(<span class="hljs-string">'posts'</span>);

    $table-&gt;addColumn(<span class="hljs-string">'title'</span>, <span class="hljs-string">'string'</span>, [<span class="hljs-string">'limit'</span> =&gt; <span class="hljs-number">20</span>])
        -&gt;addColumn(<span class="hljs-string">'body'</span>, <span class="hljs-string">'text'</span>)
        -&gt;addColumn(<span class="hljs-string">'cover_image'</span>, <span class="hljs-string">'string'</span>)
        -&gt;addTimestamps()
        -&gt;addIndex([<span class="hljs-string">'title'</span>], [<span class="hljs-string">'unique'</span> =&gt; <span class="hljs-literal">true</span>]);

     $table-&gt;create();
}
</code></pre>
<p>We can now run this migration to create our table.</p>
<h2 id="heading-how-to-run-migrations">How to Run Migrations</h2>
<p>After creating migrations, the next step is to enforce these desired changes in the database. Running migrations actually enforces these changes.</p>
<pre><code class="lang-php">$ ./vendor/bin/phinx migrate
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/03/Screenshot-2022-03-29-at-18.54.56.png" alt="Image" width="600" height="400" loading="lazy">
<em>Running a migration</em></p>
<p>This image shows the result of the migration. You can see the time taken to run the migration.</p>
<p>Now, if you check your database GUI tool, you'll notice that the <em>posts</em> table was created with an additional field, the <em>id</em> field. This field is also the primary field by default. And it also auto-increments.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/03/Screenshot-2022-03-29-at-19.00.27.png" alt="Image" width="600" height="400" loading="lazy">
<em>posts table.</em></p>
<p>You may change the primary key to some other key by either specifying some other field as the primary field, or by mapping the <em>id</em> field to the desired primary field. The latter includes the auto incrementing ability of the normal <em>id</em> field.</p>
<pre><code class="lang-php">$table = <span class="hljs-keyword">$this</span>-&gt;table(<span class="hljs-string">'posts'</span>, [
    <span class="hljs-string">'id'</span> =&gt; <span class="hljs-literal">false</span>,
    <span class="hljs-string">'primary_key'</span> =&gt; [<span class="hljs-string">'posts_key'</span>]
]);

$table = <span class="hljs-keyword">$this</span>-&gt;table(<span class="hljs-string">'posts'</span>, [
    <span class="hljs-string">'id'</span> =&gt; <span class="hljs-string">'posts_key'</span>,
]);
</code></pre>
<p>In the first method, the primary key to be used has to be a column on the table (it is not auto-created).</p>
<p>You may also set which environment you want to run the migrations. </p>
<pre><code class="lang-bash">$ ./vendor/bin/phinx migrate -e testing
</code></pre>
<h3 id="heading-how-to-reversemigrations">How to ReverseMigrations</h3>
<p>Migrations can be reversed by being <em>run down</em>. This is the reverse of migrating <em>up</em>. The table previously created will be dropped, columns added will be removed, and the database will be returned to its initial pre-migration state.</p>
<pre><code class="lang-bash">$ ./vendor/bin/phinx rollback
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/03/Screenshot-2022-03-29-at-18.56.46.png" alt="Image" width="600" height="400" loading="lazy">
<em>Reversing a migration</em></p>
<h3 id="heading-how-to-check-migration-status">How to Check Migration Status</h3>
<p>As your application size increases, it is expected that your database migrations will increase. Due to this, at some point, you may wish to check the status of your migrations, to know which have been run, and which have not.</p>
<pre><code class="lang-bash">$ ./vendor/bin/phinx status
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/03/Screenshot-2022-03-29-at-18.58.07.png" alt="Image" width="600" height="400" loading="lazy">
<em>Checking migration status</em></p>
<h3 id="heading-how-to-drop-a-table">How to Drop a Table</h3>
<p>You can easily use the <code>drop</code> method, followed by the <code>save</code> method to persist the change, on the <em>Table</em> object.</p>
<pre><code class="lang-php"><span class="hljs-keyword">$this</span>-&gt;table(<span class="hljs-string">'posts'</span>)-&gt;drop()-&gt;save();
</code></pre>
<h3 id="heading-how-to-rename-a-table">How to Rename a Table</h3>
<pre><code class="lang-php">$table = <span class="hljs-keyword">$this</span>-&gt;table(<span class="hljs-string">'posts'</span>);

$table-&gt;rename(<span class="hljs-string">'articles'</span>)
    -&gt;update();
</code></pre>
<p>To drop a table, get the table. Then use the <code>rename</code> method with the new name, followed by the <code>update</code> method to persist this change.</p>
<h3 id="heading-how-to-change-a-tables-primary-key">How to Change a Table's Primary Key</h3>
<p>You can also change a table's primary key very easily.</p>
<pre><code class="lang-php">$table = <span class="hljs-keyword">$this</span>-&gt;table(<span class="hljs-string">'posts'</span>);

$table-&gt;changePrimaryKey(<span class="hljs-string">'new_primary_key'</span>);

$table-&gt;update();
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Now you know how to set up migrations in your PHP applications.</p>
<p>If you have any questions or relevant advice, please get in touch with me to share them.</p>
<p>To read more of my articles or follow my work, you can connect with me on <a target="_blank" href="https://www.linkedin.com/in/idris-aweda-zubair-5433121a3/">LinkedIn</a>, <a target="_blank" href="https://twitter.com/AwedaIdris">Twitter</a>, and <a target="_blank" href="https://github.com/Zubs">Github</a>. It’s quick, it’s easy, and it’s free!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to migrate from Elasticsearch 1.7 to 6.8 with zero downtime ]]>
                </title>
                <description>
                    <![CDATA[ By dor sever My last task at BigPanda was to upgrade an existing service that was using Elasticsearch version 1.7 to a newer Elasticsearch version, 6.8.1. In this post, I will share how we migrated from Elasticsearch 1.6 to 6.8 with harsh constraints... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-migrate-from-elasticsearch-1-7-to-6-8-with-zero-downtime/</link>
                <guid isPermaLink="false">66d45e414a7504b7409c338a</guid>
                
                    <category>
                        <![CDATA[ availability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data migration ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ elasticsearch ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 25 Dec 2019 09:45:32 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2019/12/es-3.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By dor sever</p>
<p>My last task at <a target="_blank" href="https://www.bigpanda.io">BigPanda</a> was to upgrade an existing service that was using Elasticsearch version 1.7 to a newer Elasticsearch version, 6.8.1.</p>
<p>In this post, I will share how we migrated from Elasticsearch 1.6 to 6.8 with harsh constraints like zero downtime, no data loss, and zero bugs. I'll also provide you with a script that does the migration for you.</p>
<p>This post contains 6 chapters (and one is optional):</p>
<ul>
<li>What’s in it for me? --&gt; What were the new features that led us to upgrade our version?</li>
<li>The constraints --&gt; What were our business requirements?</li>
<li>Problem solving --&gt; How did we address the constraints?</li>
<li>Moving forward --&gt; The plan.</li>
<li>[Optional chapter] --&gt; How did we handle the infamous mapping explosion problem?</li>
<li>Finally --&gt; How to do data migration between clusters.</li>
</ul>
<h1 id="heading-chapter-1-whats-in-it-for-me">Chapter 1 — What’s in it for me?</h1>
<p>What benefits were we expecting to solve by upgrading our data store?</p>
<p>There were a couple of reasons:</p>
<ol>
<li>Performance and stability issues — We were experiencing a huge number of outages with long MTTR that caused us a lot of headaches. This was reflected in frequent high latencies, high CPU usage, and more issues.</li>
<li>Non-existent support in old Elasticsearch versions — We were missing some operative knowledge in Elasticsearch, and when we searched for outside consulting we were encouraged to migrate forward to receive support.</li>
<li>Dynamic mappings in our schema — Our current schema in Elasticsearch 1.7 used a feature called dynamic mappings that made our cluster <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/6.1/mapping.html#mapping-limit-settings">explode</a> multiple times. So we wanted to address this issue.</li>
<li>Poor visibility on our existing cluster — We wanted a better view under the hood and saw that later versions had great metrics exporting tools.</li>
</ol>
<h1 id="heading-chapter-2-the-constraints">Chapter 2 — The constraints</h1>
<ul>
<li>ZERO downtime migration — We have active users on our system, and we could not afford for the system to be down while we were migrating.</li>
<li>Recovery plan — We could not afford to “lose” or “corrupt” data, no matter the cost. So we needed to prepare a recovery plan in case our migration failed.</li>
<li>Zero bugs — We could not change existing search functionality for end-users.</li>
</ul>
<h1 id="heading-chapter-3-problem-solving-and-thinking-of-a-plan">Chapter 3 — Problem solving and thinking of a plan</h1>
<p>Let’s tackle the constraints from the simplest to the most difficult:</p>
<h2 id="heading-zero-bugs">Zero bugs</h2>
<p>In order to address this requirement, I studied all the possible requests the service receives and what its outputs were. Then I added unit-tests where needed.</p>
<p>In addition, I added multiple metrics (to the <code>Elasticsearch Indexer</code> and the <code>new Elasticsearch Indexer</code> ) to track latency, throughput, and performance, which allowed me to validate that we only improved them.</p>
<h2 id="heading-recovery-plan">Recovery plan</h2>
<p>This means that I needed to address the following situation: I deployed the new code to production and stuff was not working as expected. What can I do about it then</p>
<p>Since I was working in a service that used <a target="_blank" href="https://www.youtube.com/watch?v=STKCRSUsyP0">event-sourcing,</a> I could add another listener (diagram attached below) and start writing to a new Elasticsearch cluster without affecting production status</p>
<h2 id="heading-zero-downtime-migration">Zero downtime migration</h2>
<p>The current service is in live mode and cannot be “deactivated” for periods longer than 5–10 minutes. The trick to getting this right is this:</p>
<ul>
<li>Store a log of all the actions your service is handling (we use Kafka in production)</li>
<li>Start the migration process offline (and keep track of the offset before you started the migration)</li>
<li>When the migration ends, start the new service against the log with the recorded offset and catch up the lag</li>
<li>When the lag finishes, change your frontend to query against the new service and you are done</li>
</ul>
<h1 id="heading-chapter-4-the-plan">Chapter 4 — The plan</h1>
<p>Our current service uses the following architecture (based on message passing in Kafka):</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/12/indxr2.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<ol>
<li><code>Event topic</code> contains events produced by other applications (for example, <code>UserId 3 created</code>)</li>
<li><code>Command topic</code> contains the translation of these events into specific commands used by this application (for example: <code>Add userId 3</code>)</li>
<li>Elasticsearch 1.7 — The datastore of the <code>command Topic</code> read by the <code>Elasticsearch Indexer</code>.</li>
</ol>
<p>We planned to add another consumer (<code>new Elasticsearch Indexer</code>) to the <code>command topic</code>, which will read the same exact messages and write them in parallel to Elasticsearch 6.8.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/12/indxr.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<h1 id="heading-where-should-i-start">Where should I start?</h1>
<p>To be honest, I considered myself a newbie Elasticsearch user. To feel confident to perform this task, I had to think about the best way to approach this topic and learn it. A few things that helped were:</p>
<ol>
<li>Documentation — It’s an insanely useful resource for everything Elasticsearch. Take the time to read it and take notes (don’t miss: <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">Mapping</a> and <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html">QueryDsl</a>).</li>
<li>HTTP API — everything under <a target="_blank" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cat.html">CAT</a> API. This was super useful to debug things locally and see how Elasticsearch responds (don’t miss: cluster health, cat indices, search, delete index).</li>
<li>Metrics (❤️) — From the first day, we configured a shiny new dashboard with lots of cool metrics (taken from <a target="_blank" href="https://github.com/justwatchcom/elasticsearch_exporter"><em>elasticsearch-exporter-for-Prometheus</em></a>) that helped and pushed us to understand more about Elasticsearch.</li>
</ol>
<h1 id="heading-the-code">The code</h1>
<p>Our codebase was using a library called <a target="_blank" href="https://github.com/sksamuel/elastic4s">elastic4s</a> and was using the oldest release available in the library — a really good reason to migrate! So the first thing to do was just to migrate versions and see what broke.</p>
<p>There are a few tactics on how to do this code migration. The tactic we chose was to try and restore existing functionality first in the new Elasticsearch version without re-writing the all code from the start. In other words, to reach existing functionality but on a newer version of Elasticsearch.</p>
<p>Luckily for us, the code already contained almost full testing coverage so our task was much much simpler, and that took around 2 weeks of development time.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/12/you_need_some_tests_yo.jpg" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>It's important to note that, if that wasn't the case, we would have had to invest some time in filling that coverage up. Only then would we be able to migrate since one of our constraints was to not break existing functionality.</em></p>
<h1 id="heading-chapter-5-the-mapping-explosion-problem">Chapter 5 — The mapping explosion problem</h1>
<p>Let’s describe our use-case in more detail. This is our model:</p>
<p><code>class InsertMessageCommand(tags: Map[String,String])</code></p>
<p>And for example, an instance of this message would be:</p>
<p><code>new InsertMessageCommand(Map("name"-&gt;"dor","lastName"-&gt;"sever"))</code></p>
<p>And given this model, we needed to support the following query requirements:</p>
<ol>
<li>Query by value</li>
<li>Query by tag name and value</li>
</ol>
<p>The way this was modeled in our Elasticsearch 1.7 schema was using a dynamic template schema (since the tag keys are dynamic, and cannot be modeled in advanced).</p>
<p>The dynamic template caused us multiple outages due to the mapping explosion problem, and the schema looked like this:</p>
<pre><code class="lang-bash">curl -X PUT <span class="hljs-string">"localhost:9200/_template/my_template?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d <span class="hljs-string">'
{
    "index_patterns": [
        "your-index-names*"
    ],
    "mappings": {
            "_doc": {
                "dynamic_templates": [
                    {
                        "tags": {
                            "mapping": {
                                "type": "text"
                            },
                            "path_match": "actions.tags.*"
                        }
                    }
                ]
            }
        },
    "aliases": {}
}'</span>  

curl -X PUT <span class="hljs-string">"localhost:9200/your-index-names-1/_doc/1?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "actions": {
    "tags" : {
        "name": "John",
        "lname" : "Smith"
    }
  }
}
'</span>

curl -X PUT <span class="hljs-string">"localhost:9200/your-index-names-1/_doc/2?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "actions": {
    "tags" : {
        "name": "Dor",
        "lname" : "Sever"
  }
}
}
'</span>

curl -X PUT <span class="hljs-string">"localhost:9200/your-index-names-1/_doc/3?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "actions": {
    "tags" : {
        "name": "AnotherName",
        "lname" : "AnotherLastName"
  }
}
}
'</span>
</code></pre>
<pre><code class="lang-bash">
curl -X GET <span class="hljs-string">"localhost:9200/_search?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
    "query": {
        "match" : {
            "actions.tags.name" : {
                "query" : "John"
            }
        }
    }
}
'</span>
<span class="hljs-comment"># returns 1 match(doc 1)</span>


curl -X GET <span class="hljs-string">"localhost:9200/_search?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
    "query": {
        "match" : {
            "actions.tags.lname" : {
                "query" : "John"
            }
        }
    }
}
'</span>
<span class="hljs-comment"># returns zero matches</span>

<span class="hljs-comment"># search by value</span>
curl -X GET <span class="hljs-string">"localhost:9200/_search?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
    "query": {
        "query_string" : {
            "fields": ["actions.tags.*" ],
            "query" : "Dor"
        }
    }
}
'</span>
</code></pre>
<h2 id="heading-nested-documents-solution">Nested documents solution</h2>
<p>Our first instinct in solving the mapping explosion problem was to use nested documents.</p>
<p>We read the nested data type tutorial in the Elastic docs and defined the following schema and queries:</p>
<pre><code class="lang-bash">curl -X PUT <span class="hljs-string">"localhost:9200/my_index?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
        "mappings": {
            "_doc": {
            "properties": {
            "tags": {
                "type": "nested" 
                }                
            }
        }
        }
}
'</span>

curl -X PUT <span class="hljs-string">"localhost:9200/my_index/_doc/1?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "tags" : [
    {
      "key" : "John",
      "value" :  "Smith"
    },
    {
      "key" : "Alice",
      "value" :  "White"
    }
  ]
}
'</span>


<span class="hljs-comment"># Query by tag key and value</span>
curl -X GET <span class="hljs-string">"localhost:9200/my_index/_search?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "query": {
    "nested": {
      "path": "tags",
      "query": {
        "bool": {
          "must": [
            { "match": { "tags.key": "Alice" }},
            { "match": { "tags.value":  "White" }} 
          ]
        }
      }
    }
  }
}
'</span>

<span class="hljs-comment"># Returns 1 document</span>


curl -X GET <span class="hljs-string">"localhost:9200/my_index/_search?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "query": {
    "nested": {
      "path": "tags",
      "query": {
        "bool": {
          "must": [
            { "match": { "tags.value":  "Smith" }} 
          ]
        }
      }
    }
  }
}
'</span>

<span class="hljs-comment"># Query by tag value</span>
<span class="hljs-comment"># Returns 1 result</span>
</code></pre>
<p>And this solution worked. However, when we tried to insert real customer data we saw that the number of documents in our index increased by around 500 times.</p>
<p>We thought about the following problems and went on to find a better solution:</p>
<ol>
<li>The amount of documents we had in our cluster was around 500 million documents. This meant that, with the new schema, we were going to reach two hundred fifty billion documents (that’s 250,000,000,000 documents ?).</li>
<li>We read this really good blog post — <a target="_blank" href="https://blog.gojekengineering.com/elasticsearch-the-trouble-with-nested-documents-e97b33b46194">https://blog.gojekengineering.com/elasticsearch-the-trouble-with-nested-documents-e97b33b46194</a> which highlights that nested documents can cause high latency in queries and heap usage problems.</li>
<li>Testing — Since we were converting 1 document in the old cluster to an unknown number of documents in the new cluster, it would be much harder to track if the migration process worked without any data loss. If our conversion was 1:1, we could assert that the count in the old cluster equalled the count in the new cluster.</li>
</ol>
<h2 id="heading-avoiding-nested-documents">Avoiding nested documents</h2>
<p>The real trick in this was to focus on what supported queries we were running: search by tag value, and search by tag key and value.</p>
<p>The first query does not require nested documents since it works on a single field. For the latter, we did the following trick. We created a field that contains the combination of the key and the value. Whenever a user queries on a key, value match, we translate their request to the corresponding text and query against that field.</p>
<p>Example:</p>
<pre><code class="lang-bash">curl -X PUT <span class="hljs-string">"localhost:9200/my_index_2?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
    "mappings": {
        "_doc": {
            "properties": {
                "tags": {
                    "type": "object",
                    "properties": {
                        "keyToValue": {
                            "type": "keyword"
                        },
                        "value": {
                            "type": "keyword"
                        }
                    }
                }
            }
        }
    }
}
'</span>


curl -X PUT <span class="hljs-string">"localhost:9200/my_index_2/_doc/1?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "tags" : [
    {
      "keyToValue" : "John:Smith",
      "value" : "Smith"
    },
    {
      "keyToValue" : "Alice:White",
      "value" : "White"
    }
  ]
}
'</span>

<span class="hljs-comment"># Query by key,value</span>
<span class="hljs-comment"># User queries for key: Alice, and value : White , we then query elastic with this query:</span>

curl -X GET <span class="hljs-string">"localhost:9200/my_index_2/_search?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "query": {
        "bool": {
          "must": [ { "match": { "tags.keyToValue": "Alice:White" }}]
  }}}
'</span>

<span class="hljs-comment"># Query by value only</span>
curl -X GET <span class="hljs-string">"localhost:9200/my_index_2/_search?pretty"</span> -H <span class="hljs-string">'Content-Type: application/json'</span> -d<span class="hljs-string">'
{
  "query": {
        "bool": {
          "must": [ { "match": { "tags.value": "White" }}]
  }}}
'</span>
</code></pre>
<h1 id="heading-chapter-6-the-migration-process">Chapter 6 — The migration process</h1>
<p>We planned to migrate about 500 million documents with zero downtime. To do that we needed:</p>
<ol>
<li>A strategy on how to transfer data from the old Elastic to the new Elasticsearch</li>
<li>A strategy on how to close the lag between the start of the migration and the end of it</li>
</ol>
<p>And our two options in closing the lag:</p>
<ol>
<li>Our messaging system is Kafka based. We could have just taken the current offset before the migration started, and after the migration ended, start consuming from that specific offset. This solution requires some manual tweaking of offsets and some other stuff, but will work.</li>
<li>Another approach to solving this issue was to start consuming messages from the beginning of the topic in Kafka and make our actions on Elasticsearch idempotent — meaning, if the change was “applied” already, nothing would change in Elastic store.</li>
</ol>
<p>The requests made by our service against Elastic were already idempotent, so we choose option 2 because it required zero manual work (no need to take specific offsets, and then set them afterward in a new consumer group).</p>
<h2 id="heading-how-can-we-migrate-the-data">How can we migrate the data?</h2>
<p>These were the options we thought of:</p>
<ol>
<li>If our Kafka contained all messages from the beginning of time, we could just play from the start and the end state would be equal. But since we apply retention to out topics, this was not an option.</li>
<li>Dump messages to disk and then ingest them to Elastic directly – This solution looked kind of weird. Why store them in disk instead of just writing them directly to Elastic?</li>
<li>Transfer messages between old Elastic to new Elastic — This meant, writing some sort of “script” (did anyone say Python? ?) that will connect to the old Elasticsearch cluster, query for items, transform them to the new schema, and index them in the cluster.</li>
</ol>
<p>We choose the last option. These were the design choices we had in mind:</p>
<ol>
<li>Let’s not try to think about error handling unless we need to. Let’s try to write something super simple, and if errors occur, let’s try to address them. In the end, we did not need to address this issue since no errors occurred during the migration.</li>
<li>It’s a one-off operation, so whatever works first / KISS.</li>
<li>Metrics — Since the migration processes can take hours to days, we wanted the ability from day 1 to be able to monitor the error count and to track the current progress and copy rate of the script.</li>
</ol>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/12/python.gif" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We thought long and hard and choose Python as our weapon of choice. The final version of the code is below:</p>
<pre><code class="lang-yml"><span class="hljs-string">dictor==0.1.2</span> <span class="hljs-bullet">-</span> <span class="hljs-string">to</span> <span class="hljs-string">copy</span> <span class="hljs-string">and</span> <span class="hljs-string">transform</span> <span class="hljs-string">our</span> <span class="hljs-string">Elasticsearch</span> <span class="hljs-string">documentselasticsearch==1.9.0</span> <span class="hljs-bullet">-</span> <span class="hljs-string">to</span> <span class="hljs-string">connect</span> <span class="hljs-string">to</span> <span class="hljs-string">"old"</span> <span class="hljs-string">Elasticsearchelasticsearch6==6.4.2</span> <span class="hljs-bullet">-</span> <span class="hljs-string">to</span> <span class="hljs-string">connect</span> <span class="hljs-string">to</span> <span class="hljs-string">the</span> <span class="hljs-string">"new"</span> <span class="hljs-string">Elasticsearchstatsd==3.3.0</span> <span class="hljs-bullet">-</span> <span class="hljs-string">to</span> <span class="hljs-string">report</span> <span class="hljs-string">metrics</span>
</code></pre>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> elasticsearch <span class="hljs-keyword">import</span> Elasticsearch
<span class="hljs-keyword">from</span> elasticsearch6 <span class="hljs-keyword">import</span> Elasticsearch <span class="hljs-keyword">as</span> Elasticsearch6
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">from</span> elasticsearch.helpers <span class="hljs-keyword">import</span> scan
<span class="hljs-keyword">from</span> elasticsearch6.helpers <span class="hljs-keyword">import</span> parallel_bulk
<span class="hljs-keyword">import</span> statsd

ES_SOURCE = Elasticsearch(sys.argv[<span class="hljs-number">1</span>])
ES_TARGET = Elasticsearch6(sys.argv[<span class="hljs-number">2</span>])
INDEX_SOURCE = sys.argv[<span class="hljs-number">3</span>]
INDEX_TARGET = sys.argv[<span class="hljs-number">4</span>]
QUERY_MATCH_ALL = {<span class="hljs-string">"query"</span>: {<span class="hljs-string">"match_all"</span>: {}}}
SCAN_SIZE = <span class="hljs-number">1000</span>
SCAN_REQUEST_TIMEOUT = <span class="hljs-string">'3m'</span>
REQUEST_TIMEOUT = <span class="hljs-number">180</span>
MAX_CHUNK_BYTES = <span class="hljs-number">15</span> * <span class="hljs-number">1024</span> * <span class="hljs-number">1024</span>
RAISE_ON_ERROR = <span class="hljs-literal">False</span>


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transform_item</span>(<span class="hljs-params">item, index_target</span>):</span>
    <span class="hljs-comment"># implement your logic transformation here</span>
    transformed_source_doc = item.get(<span class="hljs-string">"_source"</span>)
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"_index"</span>: index_target,
            <span class="hljs-string">"_type"</span>: <span class="hljs-string">"_doc"</span>,
            <span class="hljs-string">"_id"</span>: item[<span class="hljs-string">'_id'</span>],
            <span class="hljs-string">"_source"</span>: transformed_source_doc}


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transformedStream</span>(<span class="hljs-params">es_source, match_query, index_source, index_target, transform_logic_func</span>):</span>
    <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> scan(es_source, query=match_query, index=index_source, size=SCAN_SIZE,
                     timeout=SCAN_REQUEST_TIMEOUT):
        <span class="hljs-keyword">yield</span> transform_logic_func(item, index_target)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">index_source_to_target</span>(<span class="hljs-params">es_source, es_target, match_query, index_source, index_target, bulk_size, statsd_client,
                           logger, transform_logic_func</span>):</span>
    ok_count = <span class="hljs-number">0</span>
    fail_count = <span class="hljs-number">0</span>
    count_response = es_source.count(index=index_source, body=match_query)
    count_result = count_response[<span class="hljs-string">'count'</span>]
    statsd_client.gauge(stat=<span class="hljs-string">'elastic_migration_document_total_count,index={0},type=success'</span>.format(index_target),
                        value=count_result)
    <span class="hljs-keyword">with</span> statsd_client.timer(<span class="hljs-string">'elastic_migration_time_ms,index={0}'</span>.format(index_target)):
        actions_stream = transformedStream(es_source, match_query, index_source, index_target, transform_logic_func)
        <span class="hljs-keyword">for</span> (ok, item) <span class="hljs-keyword">in</span> parallel_bulk(es_target,
                                        chunk_size=bulk_size,
                                        max_chunk_bytes=MAX_CHUNK_BYTES,
                                        actions=actions_stream,
                                        request_timeout=REQUEST_TIMEOUT,
                                        raise_on_error=RAISE_ON_ERROR):
            <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok:
                logger.error(<span class="hljs-string">"got error on index {} which is : {}"</span>.format(index_target, item))
                fail_count += <span class="hljs-number">1</span>
                statsd_client.incr(<span class="hljs-string">'elastic_migration_document_count,index={0},type=failure'</span>.format(index_target),
                                   <span class="hljs-number">1</span>)
            <span class="hljs-keyword">else</span>:
                ok_count += <span class="hljs-number">1</span>
                statsd_client.incr(<span class="hljs-string">'elastic_migration_document_count,index={0},type=success'</span>.format(index_target),
                                   <span class="hljs-number">1</span>)

    <span class="hljs-keyword">return</span> ok_count, fail_count


statsd_client = statsd.StatsClient(host=<span class="hljs-string">'localhost'</span>, port=<span class="hljs-number">8125</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    index_source_to_target(ES_SOURCE, ES_TARGET, QUERY_MATCH_ALL, INDEX_SOURCE, INDEX_TARGET, BULK_SIZE,
                           statsd_client, transform_item)
</code></pre>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Migrating data in a live production system is a complicated task that requires a lot of attention and careful planning. I recommend taking the time to work through the steps listed above and figure out what works best for your needs.</p>
<p>As a rule of thumb, always try to reduce your requirements as much as possible. For example, is a zero downtime migration required? Can you afford data-loss?</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/12/enjoy-the-ride.gif" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Upgrading data stores is usually a marathon and not a sprint, so take a deep breath and try to enjoy the ride.</p>
<ul>
<li>The whole process listed above took me around 4 months of work</li>
<li>All of the Elasticsearch examples that appear in this post have been tested against version 6.8.1</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to import Google BigQuery tables to AWS Athena ]]>
                </title>
                <description>
                    <![CDATA[ By Aftab Ansari As a data engineer, it is quite likely that you are using one of the leading big data cloud platforms such as AWS, Microsoft Azure, or Google Cloud for your data processing. Also, migrating data from one platform to another is somethi... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-import-google-bigquery-tables-to-aws-athena-5da842a13539/</link>
                <guid isPermaLink="false">66c352c5c2631756f9f063d7</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ big data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data migration ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 11 Mar 2019 18:55:49 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*518Z4MAe36ZqeLX2LTzeDA.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aftab Ansari</p>
<p>As a data engineer, it is quite likely that you are using one of the leading big data cloud platforms such as AWS, Microsoft Azure, or Google Cloud for your data processing. Also, migrating data from one platform to another is something you might have already faced or will face at some point.</p>
<p>In this post, I will show how I imported Google BigQuery tables to AWS Athena. If you only need a list of tools to be used with some very high-level guidance, you can quickly look at this <a target="_blank" href="https://amazon-aws-big-data-demystified.ninja/2018/05/27/how-to-export-data-from-google-big-query-into-aws-s3-emr-hive/">post that shows how to import a single BigQuery table into Hive metastore</a>. In this article, I will show one way of importing a full BigQuery project (multiple tables) into both Hive and Athena metastore.</p>
<p>There are few import limitations: for example, when you import data from partitioned tables, you cannot import individual partitions. Please check the <a target="_blank" href="https://cloud.google.com/bigquery/docs/exporting-data">limitations</a> before starting the process.</p>
<p>In order to successfully import Google BigQuery tables to Athena, I performed the steps shown below. I used AVRO format when dumping data and the schemas from Google BigQuery and loading them into AWS Athena.</p>
<p><a class="post-section-overview" href="#264f">Step 1. Dump BigQuery data to Google Cloud Storage</a></p>
<p><a class="post-section-overview" href="#3af9">Step 2. Transfer data from Google Cloud Storage to AWS S3</a></p>
<p><a class="post-section-overview" href="#c089">Step 3. Extract AVRO schema from AVRO files stored in S3</a></p>
<p><a class="post-section-overview" href="#cc2d">Step 4. Create Hive tables on top of AVRO data, use schema from Step 3</a></p>
<p><a class="post-section-overview" href="#fbf3">Step 5. Extract Hive table definition from Hive tables</a></p>
<p><a class="post-section-overview" href="#c6f6">Step 6. Use the output of Step 3 and 5 to create Athena tables</a></p>
<p>So why do I have to create Hive tables in the first place although the end goal is to have data in Athena? This is because:</p>
<ul>
<li>Athena does not support using <code>avro.schema.url</code> to specify table schema.</li>
<li>Athena requires you to explicitly specify field names and their data types in CREATE statement.</li>
<li>Athena also requires the AVRO schema in JSON format under <code>avro.schema.literal</code>.</li>
<li>You can check this AWS <a target="_blank" href="https://docs.aws.amazon.com/athena/latest/ug/avro.html">doc</a> for more details.</li>
</ul>
<p>So, Hive tables can be created directly by pointing to AVRO schema files stored on S3. But to have the same in Athena, columns and schema are required in the CREATE TABLE statement.</p>
<p>One way to overcome this is to first extract schema from AVRO data to be supplied as <code>avro.schema.literal</code> . Second, for field names and data types required for CREATE statement, create Hive tables based on AVRO schemas stored in S3 and use <code>SHOW CREATE TABLE</code> to dump/export Hive table definitions which contain field names and datatypes. Finally, create Athena tables by combining the extracted AVRO schema and Hive table definition. I will discuss in detail in subsequent sections.</p>
<p>For the demonstration, I have the following BigQuery tables that I would like to import to Athena.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/klOvEMVXS8X9k5YaxGkacXgjhdsxMGwmnupj" alt="Image" width="600" height="400" loading="lazy"></p>
<p>So, let’s get started!</p>
<h3 id="heading-step-1-dump-bigquery-data-to-google-cloud-storage">Step 1. Dump BigQuery data to Google Cloud Storage</h3>
<p>It is possible to dump BigQuery data in Google storage with the help of the Google cloud UI. However, this can become a tedious task if you have to dump several tables manually.</p>
<p>To tackle this problem, I used Google Cloud Shell. In Cloud Shell, you can combine regular shell scripting with BigQuery commands and dump multiple tables relatively fast. You can activate Cloud Shell as shown in the picture below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/o3Vj-2DY5je5jS1wZeE84Dh6OuO37CrR106-" alt="Image" width="600" height="400" loading="lazy"></p>
<p>From Cloud Shell, the following operation provides the BigQuery <code>extract</code> commands to dump each table of the “backend” dataset to Google Cloud Storage.</p>
<pre><code>bq ls backend | cut -d <span class="hljs-string">' '</span> -f3 | tail -n+<span class="hljs-number">3</span> | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY &lt;dataset&gt;.@ gs:<span class="hljs-comment">//&lt;bucket&gt;@</span>
</code></pre><p>In my case it prints:</p>
<pre><code>aftab_ansari@cloudshell:~ (project-ark-archive)$ bq ls backend | cut -d <span class="hljs-string">' '</span> -f3 | tail -n+<span class="hljs-number">3</span> | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY backend.@ gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/@/@-*.avro</span>
</code></pre><pre><code>bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_daily_phase2 gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-*.avro</span>
</code></pre><pre><code>bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_detailed_phase2 gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-*.avro</span>
</code></pre><pre><code>bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_phase2 gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-*.avro</span>
</code></pre><p>Please note: <code>--compression SNAPPY</code>, this is important, as uncompressed and big files can cause the <code>gsutil</code> command (that is used to transfer data to AWS S3) to get stuck. The wildcard (<strong>*</strong>) makes <code>bq extract</code> split bigger tables (&gt;1GB) into multiple output files. Running those commands on Cloud Shell, copy data to the following Google Storage directory.</p>
<pre><code>gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/table_name/table_name-*.avro</span>
</code></pre><p>Let’s do <code>ls</code> to see the dumped AVRO file.</p>
<pre><code>aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil ls gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2</span>
</code></pre><pre><code>gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro</span>
</code></pre><p>I can also browse from the UI and find the data like shown below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/bdhhLC9Dyuv59VpLeMhLhHNm0Ul-sAe7L8f8" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-2-transfer-data-from-google-cloud-storage-to-aws-s3">Step 2. Transfer data from Google Cloud Storage to AWS S3</h3>
<p>Transferring data from Google Storage to AWS S3 is straightforward. First, set up your S3 credentials. On Cloud Shell, create or edit <code>.boto</code> file ( <code>vi ~/.boto</code>) and add these:</p>
<pre><code>[Credentials]aws_access_key_id = &lt;your aws access key ID&gt;aws_secret_access_key = &lt;your aws secret access key&gt;[s3]host = s3.us-east-1.amazonaws.comuse-sigv4 = True
</code></pre><p>Please note: <strong>s3.us-east-1.amazonaws.com</strong> has to correspond with the region where the bucket is.</p>
<p>After setting up the credentials, execute <code>gsutil</code> to transfer data from Google Storage to AWS S3. For example:</p>
<pre><code>gsutil rsync -r gs:<span class="hljs-comment">//your-gs-bucket/your-extract-path/your-schema s3://your-aws-bucket/your-target-path/your-schema</span>
</code></pre><p>Add the <strong><em>-n</em></strong> flag to the command above to display the operations that would be performed using the specified command without actually running them.</p>
<p>In this case, to transfer the data to S3, I used the following:</p>
<pre><code>aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil rsync -r gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend s3://my-bucket/bq_data/backend</span>
</code></pre><pre><code>Building synchronization state…Starting synchronization…Copying gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-000000000000.avro [Content-Type=application/octet-stream]...| [3 files][987.8 KiB/987.8 KiB]Operation completed over 3 objects/987.8 KiB.</span>
</code></pre><p>Let’s check if the data got transferred to S3. I verified that from my local machine:</p>
<pre><code>aws s3 ls --recursive  s3:<span class="hljs-comment">//my-bucket/bq_data/backend --profile smoke | awk '{print $4}'</span>
</code></pre><pre><code>bq_data/backend/sessions_daily_phase2/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avrobq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avrobq_data/backend/sessions_phase2/sessions_phase2<span class="hljs-number">-000000000000.</span>avro
</code></pre><h3 id="heading-step-3-extract-avro-schema-from-avro-files-stored-in-s3">Step 3. Extract AVRO schema from AVRO files stored in S3</h3>
<p>To extract schema from AVRO data, you can use the Apache <code>avro-tools-&lt;version&amp;g</code>t;.jar wit<code>h the get</code>schema parameter. The benefit of using this tool is that it returns schema in the form you can use direct<code>ly in WITH SERDEPROP</code>ERTIES statement when creating Athena tables.</p>
<p>You noticed I got only one <code>.avro</code> file per table when dumping BigQuery tables. This was because of small data volume — otherwise, I would have gotten several files per table. Regardless of single or multiple files per table, it’s enough to run avro-tools against any single file per table to extract that table’s schema.</p>
<p>I downloaded the latest version of avro-tools which is <code>avro-tools-1.8.2.jar</code>. I first copied all <code>.avro</code> files from s3 to local disk:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ aws s3 cp s3:<span class="hljs-comment">//my-bucket/bq_data/backend/ bq_data/backend/ --recursive</span>
</code></pre><pre><code>download: s3:<span class="hljs-comment">//my-bucket/bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro to bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro</span>
</code></pre><pre><code>download: s3:<span class="hljs-comment">//my-bucket/bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro to bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro</span>
</code></pre><pre><code>download: s3:<span class="hljs-comment">//my-bucket/bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro to bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro</span>
</code></pre><p>Avro-tools command should look like <code>java -jar avro-tools-1.8.2.jar getschema your_data.avro &gt; schema_file.a</code>vsc. This can become tedious if you have several AVRO files (in reality, I’ve done this for a project with many more tables). Again, I used a shell script to generate commands. I creat<code>ed extract_schema_avro</code>.sh with the following content:</p>
<pre><code>schema_avro=(bq_data/backend<span class="hljs-comment">/*)for i in ${!schema_avro[*]}; do  input_file=$(find ${schema_avro[$i]} -type f)  output_file=$(ls -l ${schema_avro[$i]} | tail -n+2 \    | awk -v srch="avro" -v repl="avsc" '{ sub(srch,repl,$9);    print $9 }')  commands=$(    echo "java -jar avro-tools-1.8.2.jar getschema " \      $input_file" &gt; bq_data/schemas/backend/avro/"$output_file  )  echo $commandsdone</span>
</code></pre><p>Running <code>extract_schema_avro.sh</code> provides the following:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ sh extract_schema_avro.sh
</code></pre><pre><code>java -jar avro-tools<span class="hljs-number">-1.8</span><span class="hljs-number">.2</span>.jar getschema bq_data/backend/sessions_daily_phase2/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avro &gt; bq_data/schemas/backend/avro/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><pre><code>java -jar avro-tools<span class="hljs-number">-1.8</span><span class="hljs-number">.2</span>.jar getschema bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avro &gt; bq_data/schemas/backend/avro/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><pre><code>java -jar avro-tools<span class="hljs-number">-1.8</span><span class="hljs-number">.2</span>.jar getschema bq_data/backend/sessions_phase2/sessions_phase2<span class="hljs-number">-000000000000.</span>avro &gt; bq_data/schemas/backend/avro/sessions_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><p>Executing the above commands copies the extracted schema under <code>bq_data/schemas/backend/avro/</code> :</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ ls -l bq_data/schemas/backend/avro<span class="hljs-comment">/* | awk '{print $9}'</span>
</code></pre><pre><code>bq_data/schemas/backend/avro/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avscbq_data/schemas/backend/avro/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avscbq_data/schemas/backend/avro/sessions_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><p>Let’s also check what’s inside an <code>.avsc</code> file.</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ cat bq_data/schemas/backend/avro/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><pre><code>{<span class="hljs-string">"type"</span> : <span class="hljs-string">"record"</span>,<span class="hljs-string">"name"</span> : <span class="hljs-string">"Root"</span>,<span class="hljs-string">"fields"</span> : [ {<span class="hljs-string">"name"</span> : <span class="hljs-string">"uid"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"string"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"platform"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"string"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"version"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"string"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"country"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"string"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"sessions"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"long"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"active_days"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"long"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"session_time_minutes"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"double"</span> ]} ]}
</code></pre><p>As you can see, the schema is in the form that can be directly used in Athena <code>WITH SERDEPROPERTIES</code>. But before Athena, I used the AVRO schemas to create Hive tables. If you want to avoid Hive table creation, you can read the <code>.avsc</code> files to extract field names and data types, but then you have to map the data types yourself from AVRO format to Athena table creation DDL.</p>
<p>The complexity of the mapping task depends on how complex the data types are in your tables. For simplicity (and to cover most simple to complex data types), I let Hive do the mapping for me. So I created the tables first in Hive metastore. Then I used <code>SHOW CREATE TABLE</code> to get the field names and data types part of the DDL.</p>
<h3 id="heading-step-4-create-hive-tables-on-top-of-avro-data-use-schema-from-step-3">Step 4. Create Hive tables on top of AVRO data, use schema from Step 3</h3>
<p>As discussed earlier, Hive allows creating tables by using <code>avro.schema.url</code>. So once you have schema (<code>.avsc</code> file) extracted from AVRO data, you can create tables as follows:</p>
<pre><code>CREATE EXTERNAL TABLE table_nameSTORED AS AVROLOCATION <span class="hljs-string">'s3://your-aws-bucket/your-target-path/avro_data'</span>TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://your-aws-bucket/your-target-path/your-avro-schema'</span>);
</code></pre><p>First, upload the extracted schemas to S3 so that <code>avro.schema.url</code> can refer to their S3 locations:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ aws s3 cp bq_data/schemas s3:<span class="hljs-comment">//my-bucket/bq_data/schemas --recursive</span>
</code></pre><pre><code>upload: bq_data/schemas/backend/avro/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avsc to s3:<span class="hljs-comment">//my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc</span>
</code></pre><pre><code>upload: bq_data/schemas/backend/avro/sessions_phase2<span class="hljs-number">-000000000000.</span>avsc to s3:<span class="hljs-comment">//my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc</span>
</code></pre><pre><code>upload: bq_data/schemas/backend/avro/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avsc to s3:<span class="hljs-comment">//my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc</span>
</code></pre><p>After having both AVRO data and schema in S3, DDL for Hive table can be created using the template shown at the beginning of this section. I used another shell script <code>create_tables_hive.sh</code> (shown below) to cover any number of tables:</p>
<pre><code>schema_avro=$(ls -l bq_data/backend | tail -n+<span class="hljs-number">2</span> | awk <span class="hljs-string">'{print $9}'</span>)<span class="hljs-keyword">for</span> table_name <span class="hljs-keyword">in</span> $schema_avro; <span class="hljs-keyword">do</span>  file_name=$(ls -l bq_data/backend/$table_name | tail -n+<span class="hljs-number">2</span> | awk \    -v srch=<span class="hljs-string">"avro"</span> -v repl=<span class="hljs-string">"avsc"</span> <span class="hljs-string">'{ sub(srch,repl,$9); print $9 }'</span>)  table_definition=$(    echo <span class="hljs-string">"CREATE EXTERNAL TABLE IF NOT EXISTS backend."</span>$table_name<span class="hljs-string">"\\nSTORED AS AVRO"</span><span class="hljs-string">"\\nLOCATION 's3://my-bucket/bq_data/backend/"</span>$table_name<span class="hljs-string">"'"</span><span class="hljs-string">"\\nTBLPROPERTIES('avro.schema.url'='s3://my-bucket/bq_data/\schemas/backend/avro/"</span>$file_name<span class="hljs-string">"');"</span>  )  printf <span class="hljs-string">"\n$table_definition\n"</span>done
</code></pre><p>Running the script provides the following:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ sh create_tables_hive.sh
</code></pre><pre><code>CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_daily_phase2'</span> TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc'</span>);
</code></pre><pre><code>CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_detailed_phase2'</span>TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc'</span>);
</code></pre><pre><code>CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_phase2'</span> TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc'</span>);
</code></pre><p>I ran the above on Hive console to actually create the Hive tables:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ hiveLogging initialized using configuration <span class="hljs-keyword">in</span> file:<span class="hljs-regexp">/etc/</span>hive/conf.dist/hive-log4j2.properties Async: <span class="hljs-literal">false</span>
</code></pre><pre><code>hive&gt; CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2&gt; STORED AS AVRO&gt; LOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_daily_phase2'</span> TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc'</span>);OKTime taken: <span class="hljs-number">4.24</span> seconds
</code></pre><pre><code>hive&gt;&gt; CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVRO&gt; LOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_detailed_phase2'</span>&gt; TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc'</span>);OKTime taken: <span class="hljs-number">0.563</span> seconds
</code></pre><pre><code>hive&gt;&gt; CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2&gt; STORED AS AVRO&gt; LOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_phase2'</span> TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc'</span>);OKTime taken: <span class="hljs-number">0.386</span> seconds
</code></pre><p>So I have created the Hive tables successfully. To verify that the tables work, I ran this simple query:</p>
<pre><code>hive&gt; select count(*) <span class="hljs-keyword">from</span> backend.sessions_detailed_phase2;Query ID = hadoop_20190214152548_2316cb5b<span class="hljs-number">-29</span>f1<span class="hljs-number">-4416</span><span class="hljs-number">-922</span>e-a6ff02ec1775Total jobs = <span class="hljs-number">1</span>Launching Job <span class="hljs-number">1</span> out <span class="hljs-keyword">of</span> <span class="hljs-number">1</span>Status: Running (Executing on YARN cluster <span class="hljs-keyword">with</span> App id application_1550010493995_0220)----------------------------------------------------------------------------------------------VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------<span class="hljs-built_in">Map</span> <span class="hljs-number">1</span> .......... container     SUCCEEDED      <span class="hljs-number">1</span>          <span class="hljs-number">1</span>        <span class="hljs-number">0</span>        <span class="hljs-number">0</span>       <span class="hljs-number">0</span>       <span class="hljs-number">0</span>Reducer <span class="hljs-number">2</span> ...... container     SUCCEEDED      <span class="hljs-number">1</span>          <span class="hljs-number">1</span>        <span class="hljs-number">0</span>        <span class="hljs-number">0</span>       <span class="hljs-number">0</span>       <span class="hljs-number">0</span>----------------------------------------------------------------------------------------------VERTICES: <span class="hljs-number">02</span>/<span class="hljs-number">02</span>  [==========================&gt;&gt;] <span class="hljs-number">100</span>%  ELAPSED TIME: <span class="hljs-number">8.17</span> s----------------------------------------------------------------------------------------------OK6130
</code></pre><p>So it works!</p>
<h3 id="heading-step-5-extract-hive-table-definition-from-hive-tables">Step 5. Extract Hive table definition from Hive tables</h3>
<p>As discussed earlier, Athena requires you to explicitly specify field names and their data types in <code>CREATE</code> statement. In Step 3, I extracted the AVRO schema, which can be used in <code>WITH SERDEPROPERTIES</code> of Athena table DDL, but I also have to specify all the fiend names and their (Hive) data types. Now that I have the tables in the Hive metastore, I can easily get those by running <code>SHOW CREATE TABLE</code>. First, prepare the Hive DDL queries for all tables:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ ls -l bq_data/backend | tail -n+<span class="hljs-number">2</span> | awk <span class="hljs-string">'{print "hive -e '</span>\<span class="hljs-string">''</span>SHOW CREATE TABLE backend.<span class="hljs-string">"$9"</span><span class="hljs-string">'\''</span> &gt; bq_data/schemas/backend/hql/backend.<span class="hljs-string">"$9"</span>.hql;<span class="hljs-string">" }'</span>
</code></pre><pre><code>hive -e <span class="hljs-string">'SHOW CREATE TABLE backend.sessions_daily_phase2'</span> &gt; bq_data/schemas/backend/hql/backend.sessions_daily_phase2.hql;
</code></pre><pre><code>hive -e <span class="hljs-string">'SHOW CREATE TABLE backend.sessions_detailed_phase2'</span> &gt; bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql;
</code></pre><pre><code>hive -e <span class="hljs-string">'SHOW CREATE TABLE backend.sessions_phase2'</span> &gt; bq_data/schemas/backend/hql/backend.sessions_phase2.hql;
</code></pre><p>Executing the above commands copies Hive table definitions under <code>bq_data/schemas/backend/hql/</code>. Let’s see what’s inside:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ cat bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql
</code></pre><pre><code>CREATE EXTERNAL TABLE <span class="hljs-string">`backend.sessions_detailed_phase2`</span>(<span class="hljs-string">`uid`</span> string COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`platform`</span> string COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`version`</span> string COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`country`</span> string COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`sessions`</span> bigint COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`active_days`</span> bigint COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`session_time_minutes`</span> double COMMENT <span class="hljs-string">''</span>)ROW FORMAT SERDE<span class="hljs-string">'org.apache.hadoop.hive.serde2.avro.AvroSerDe'</span>STORED AS INPUTFORMAT<span class="hljs-string">'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'</span>OUTPUTFORMAT<span class="hljs-string">'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'</span>LOCATION<span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_detailed_phase2'</span>TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc'</span>,<span class="hljs-string">'transient_lastDdlTime'</span>=<span class="hljs-string">'1550157659'</span>)
</code></pre><p>By now all the building blocks needed for creating AVRO tables in Athena are there:</p>
<ul>
<li>Field names and data types can be obtained from the Hive table DDL (to be used in columns section of <code>CREATE</code> statement)</li>
<li>AVRO schema (JSON) can be obtained from the extracted <code>.avsc</code> files (to be supplied in <code>WITH SERDEPROPERTIES</code>).</li>
</ul>
<h3 id="heading-step-6-use-the-output-of-steps-3-and-5-to-create-athena-tables">Step 6. Use the output of Steps 3 and 5 to Create Athena tables</h3>
<p>If you are still with me, you have done a great job coming this far. I am now going to perform the final step which is creating Athena tables. I used the following script to combine <code>.avsc</code> and <code>.hql</code> files to construct Athena table definitions:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ cat create_tables_athena.sh
</code></pre><pre><code># directory where extracted avro schemas are storedschema_avro=(bq_data/schemas/backend/avro<span class="hljs-comment">/*)# directory where extracted HQL schemas are storedschema_hive=(bq_data/schemas/backend/hql/*)for i in ${!schema_avro[*]}; do  schema=$(awk -F '{print $0}' '/CREATE/{flag=1}/STORED/{flag=0}\   flag' ${schema_hive[$i]})  location=$(awk -F '{print $0}' '/LOCATION/{flag=1; next}\  /TBLPROPERTIES/{flag=0} flag' ${schema_hive[$i]})  properties=$(cat ${schema_avro[$i]})  table=$(echo $schema '\n' \    "WITH SERDEPROPERTIES ('avro.schema.literal'='\n"$properties \    "\n""')STORED AS AVRO \n" \    "LOCATION" $location";\n\n")  printf "\n$table\n"done \  &gt; bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql</span>
</code></pre><p>Running the above script copies Athena table definitions to <code>bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql</code>. In my case it contains:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> all_athena_tables]$ cat all_athena_tables.hql
</code></pre><pre><code>CREATE EXTERNAL TABLE <span class="hljs-string">`backend.sessions_daily_phase2`</span>( <span class="hljs-string">`uid`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`activity_date`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`sessions`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`session_time_minutes`</span> double COMMENT <span class="hljs-string">''</span>)ROW FORMAT SERDE <span class="hljs-string">'org.apache.hadoop.hive.serde2.avro.AvroSerDe'</span>WITH SERDEPROPERTIES (<span class="hljs-string">'avro.schema.literal'</span>=<span class="hljs-string">'{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "activity_date", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }'</span>)STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_daily_phase2'</span>;
</code></pre><pre><code>CREATE EXTERNAL TABLE <span class="hljs-string">`backend.sessions_detailed_phase2`</span>( <span class="hljs-string">`uid`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`platform`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`version`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`country`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`sessions`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`active_days`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`session_time_minutes`</span> double COMMENT <span class="hljs-string">''</span>)ROW FORMAT SERDE <span class="hljs-string">'org.apache.hadoop.hive.serde2.avro.AvroSerDe'</span>WITH SERDEPROPERTIES (<span class="hljs-string">'avro.schema.literal'</span>=<span class="hljs-string">'{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "platform", "type" : [ "null", "string" ] }, { "name" : "version", "type" : [ "null", "string" ] }, { "name" : "country", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] } '</span>)STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_detailed_phase2'</span>;
</code></pre><pre><code>CREATE EXTERNAL TABLE <span class="hljs-string">`backend.sessions_phase2`</span>( <span class="hljs-string">`uid`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`sessions`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`active_days`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`session_time_minutes`</span> double COMMENT <span class="hljs-string">''</span>)ROW FORMAT SERDE <span class="hljs-string">'org.apache.hadoop.hive.serde2.avro.AvroSerDe'</span>WITH SERDEPROPERTIES (<span class="hljs-string">'avro.schema.literal'</span>=<span class="hljs-string">'{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }'</span>)STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_phase2'</span>;
</code></pre><p>And finally, I ran the above scripts in Athena to create the tables:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9TNErBinL98R9qH9pasv6a99jQWPNANLXPKP" alt="Image" width="600" height="400" loading="lazy"></p>
<p>There you have it.</p>
<p>I feel that the process is a bit lengthy. However, this has worked well for me. The other approach would be to use AWS Glue wizard to crawl the data and infer the schema. If you have used AWS Glue wizard, please share your experience in the comment section below.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
