data migration - freeCodeCamp.org

How to Perform Database Migrations using Go Migrate

freeCodeCamp — Thu, 26 Jan 2023 23:17:14 +0000

By Rwitesh Bera

Since its introduction, the programming language Go (also known as Golang) has become increasingly popular. It is known for its simplicity and efficient performance, similar to that of a lower-level language such as C++.

While working with a database, schema migration is one of the most important tasks that developers do throughout the project lifecycle. In this article, I will explain what database migration is and how to manage it using go-migrate.

What is a Database Migration?

A database migration, also known as a schema migration, is a set of changes to be made to a structure of objects within a relational database.

It is a way to manage and implement incremental changes to the structure of data in a controlled, programmatic manner. These changes are often reversible, meaning they can be undone or rolled back if required.

The process of migration helps to change the database schema from its current state to a new desired state, whether it involves adding tables and columns, removing elements, splitting fields, or changing types and constraints.

By managing these changes in a programmatic way, it becomes easier to maintain consistency and accuracy in the database, as well as keep track of the history of modifications made to it.

Setup and Installation

migrate is a CLI tool that you can use to run migrations. You can easily install it on various operating systems such as Linux, Mac and Windows by using package managers like curl, brew, and scoop, respectively.

For more information on how to install and use the tool, you can refer to the official documentation.

To install the migrate CLI tool using scoop on Windows, you can follow these steps:

$ scoop install migrate

To install the migrate CLI tool using curl on Linux, you can follow these steps:

$ curl -L https://packagecloud.io/golang-migrate/migrate/gpgkey| apt-key add -
$ echo "deb https://packagecloud.io/golang-migrate/migrate/ubuntu/ $(lsb_release -sc) main" > /etc/apt/sources.list.d/migrate.list
$ apt-get update
$ apt-get install -y migrate

To install the migrate CLI tool using on Mac, you can follow these steps:

$ brew install golang-migrate

How to Create a New Migration

Create a directory like database/migration to store all the migration files.

Source files structure in GoLand IDE

Next, create migration files using the following command:

$ migrate create -ext sql -dir database/migration/ -seq init_mg

Terminal output displaying successful creation of migration

You use -seq to generate a sequential version and init_mg is the name of the migration.

Source files structure in GoLand IDE

A migration typically consists of two distinct files, one for moving the database to a new state (referred to as "up") and another for reverting the changes made to the previous state (referred to as "down").

The "up" file is used to implement the desired changes to the database, while the "down" file is used to undo those changes and return the database to its previous state.

Flow of database migration

The format of those files for SQL are:

{version}_{title}.down.sql
{version}_{title}.up.sql

When you create migration files, they will be empty by default. To implement the changes you want, you will need to fill them with the appropriate SQL queries.

SQL queries for migrating data

How to Run Migration Up

In order to execute the SQL statements in the migration files, migrate requires a valid connection to a Postgres database.

To accomplish this, you will need to provide a connection string in the proper format.

$ migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" -verbose up

Now in your Postgres shell, you can check newly created tables by using the following commands:

\d+

\d+ table_name DESCRIBE TABLE

Displaying table data in Postgres

How to Rollback Migrations

If you want to revert back the migration, you can do that using the following down tag:

$ migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" -verbose down

It will delete the email column from both tables as mentioned in the 000002_init_mg.up.sql file.

Now, let's check the database and see if email has been deleted or not:

Displaying updated table data in Postgres

How to Resolve Migration Errors

If a migration contains an error and is executed, migrate will prevent any further migrations from being run on the same database.

An error message like Dirty database version 1. Fix and force version will be displayed, even after the error in the migration is fixed. This indicates that the database is "dirty" and needs to be investigated.

It is necessary to determine if the migration was applied partially or not at all. Once you've determined this, the database version should be forced to reflect its true state using the force command.

$ migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" force

How to Add Commands in a Makefile

migration_up: migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" -verbose up

migration_down: migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" -verbose down

migration_fix: migrate -path database/migration/ -database "postgresql://username:secretkey@localhost:5432/database_name?sslmode=disable" force VERSION

Now, you can run $ make migration_up for 'up', $ make migration_down for 'down', and $ make migration_fix to fix the migration issue.

Before running the makefile, ensure that the correct version number is included in the migration_fix command.

Conclusion

Migration systems typically generate files that can be shared across developers and multiple teams. They can also be applied to multiple databases and maintained in version control.

Keeping a record of changes to the database makes it possible to track the history of modifications made to it. This way, the database schema and the application's understanding of that structure can evolve together.

That concludes our discussion on database migration. I hope you found the information to be useful and informative.

If you enjoyed reading this article, please consider sharing it with your colleagues and friends on social media. Additionally, please follow me on Twitter for more updates on technology and coding. Thank you for reading!

How to Migrate a Database in PHP Using Phinx

Zubair Idris Aweda — Wed, 30 Mar 2022 23:59:31 +0000

Building modern web applications usually involves a lot of data. Managing these data (databases) during development and production can be a lot.

This is especially true if there's more than one developer, and multiple environments where changes have to be manually implemented.

Database migrations help developers manage these changes easily, across multiple environments and developers.

This article explains:

What database migrations are.
How to get started with database migrations in PHP using Phinx.
How to manage tables in your database.

This article is meant for readers with basic PHP knowledge. It will help you learn to easily (and better) manage your databases.

What Are Database Migrations?

In basic terms, migrations contain changes that you wish to make to your database. These changes could be creating or dropping a table, adding or removing some field(s) from a table, changing column types, and many more.

These files make it easy to make these same changes across multiple systems as anyone with the files can just run them, and have their database updated.

So in a real life scenario, some developer on the team could make a change to the users table to allow the gender field to accept more than the default male and female options, maybe a third other option.

After making this change, the developer creates a migration. This migration includes changes that they have made to the database – in this case a change to a column on a table – and other developers can easily get this change to their own local databases by running the migrations.

Migrations are like version control for your database, allowing your team to define and share the application's database schema definition. If you have ever had to tell a teammate to manually add a column to their local database schema after pulling in your changes from source control, you've faced the problem that database migrations solve. - Laravel

Many popular web frameworks already have support for migrations built in. But in this article, we explore using migrations in vanilla PHP.

Learn more about database migrations here.

What Is Phinx?

Phinx is a PHP library that makes it ridiculously easy to manage the database migrations for your PHP app. - Phinx

Phinx makes it possible to manage migrations easily regardless of whether you're using a PHP framework or not. It is also very easy to install (as we will see later on).

It ships with a couple of commands to make operations easier. It is fully customisable (you can do whatever you want with it 🙃). It also works in multiple environments, meaning you can have some production migrations, testing migrations, and dev migrations.

Phinx Installation

You can add Phinx to any PHP project using composer.

$ mkdir php-migrations
$ cd php-migrations
$ composer init

The first command creates a folder in your current directory, php-migrations, and the second command moves into it. The last command starts an interactive shell.

Follow the prompt, filling in the details as required (the default values are fine). You can set the project description, author name (or contributors' names), minimum stability for dependencies, project type, license, and define your dependencies.

When you get to the dependencies part, install the robmorgan/phinx phinx package as a dependency.

Accept the other defaults and proceed to generating the composer.json file. The generated file should look like this currently:

{
    "name": "zubair/php-migrations",
    "description": "A simple tutorial on how to use and manage migrations in PHP applications.",
    "type": "project",
    "require": {
        "robmorgan/phinx": "^0.12.10"
    },
    "license": "ISC",
    "autoload": {
        "psr-4": {
            "Zubs\\": "src/"
        }
    },
    "authors": [
        {
            "name": "Zubs",
            "email": "zubairidrisaweda@gmail.com"
        }
    ]
}

Init Phinx

After installing Phinx, you need to initialise it. You can do this very easily using its binary installed in the vendor folder.

$ ./vendor/bin/phinx init

This creates phinx's configuration file as a PHP file. It could be created as a JSON file too. I prefer JSON for configurations, so I will use the JSON format.

$ ./vendor/bin/phinx init --format=json

Here's what the default configuration file looks like:

{
    "paths": {
        "migrations": "%%PHINX_CONFIG_DIR%%/db/migrations",
        "seeds": "%%PHINX_CONFIG_DIR%%/db/seeds"
    },
    "environments": {
        "default_migration_table": "phinxlog",
        "default_environment": "development",
        "production": {
            "adapter": "mysql",
            "host": "localhost",
            "name": "production_db",
            "user": "root",
            "pass": "",
            "port": 3306,
            "charset": "utf8"
        },
        "development": {
            "adapter": "mysql",
            "host": "localhost",
            "name": "development_db",
            "user": "root",
            "pass": "",
            "port": 3306,
            "charset": "utf8"
        },
        "testing": {
            "adapter": "mysql",
            "host": "localhost",
            "name": "testing_db",
            "user": "root",
            "pass": "",
            "port": 3306,
            "charset": "utf8"
        }
    },
    "version_order": "creation"
}

In this configuration file, notice how Phinx expects that you have a db/migrations path (for your migrations) by default. You can change this if you want, but I think it's fine and I'll be keeping it.

$ mkdir db && db/migrations

You can read more about these configurations in the official documentation.

Phinx also ships with commands for different actions to make it easier to use in our projects.

How to Create A Migration

Phinx uses classes for its migrations. To create a new migration (say, one to create a posts table), use the create command with the name of the migration.

$ ./vendor/bin/phinx create PostsTableMigration

Creating A Migration

This creates a 20220328122134_posts_table_migration.php file in the db/migrations directory created earlier. This file is named using the YYYYMMDDHHMMSS_my_new_migration.php format. In this format, the first 14 characters, YYYYMMDDHHMMSS, are representations of the current timestamp.

The 20220328122134_posts_table_migration.php looks like this currently:


declare(strict_types=1);

use Phinx\Migration\AbstractMigration;

final class PostsTableMigration extends AbstractMigration
{
    /**
     * Change Method.
     *
     * Write your reversible migrations using this method.
     *
     * More information on writing migrations is available here:
     * https://book.cakephp.org/phinx/0/en/migrations.html#the-change-method
     *
     * Remember to call "create()" or "update()" and NOT "save()" when working
     * with the Table class.
     */
    public function change(): void
    {

    }
}

This file (and all other migrations created using Phinx) extends the Phinx\Migration\AbstractMigration class. This class has all the methods you need to interact with your database.

This migration file also includes a change method. This method was introduced recently to Phinx in version 0.2.0 to implements Phinx's idea of reversible migrations.

These are migration files with just one method, change, that contains logic for performing some action, leaving Phinx to figure out how to reverse it. Rather than the traditional use of two methods, up and down, to create and reverse actions.

Phinx still allows you use up and down methods. But it gives the change method preference over them when they are used together. It ignores them.

How to Manage Tables

Tables are the basis on which structured databases are built and are the most important part of what Phinx offers.

You can easily manage database tables using PHP code with Phinx. Phinx offers a powerful table() method. This method retrieves an instance of the Table object.

How to Create a Table

Creating a table is really easy using Phinx. You create a new instance of the Table object using the table() method with the table name.

$table = $this->table('posts');

Next, you can add columns with their settings.

$table->addColumn('title', 'string', ['limit' => 20])
    ->addColumn('body', 'text')
    ->addColumn('cover_image', 'string')
    ->addTimestamps()
    ->addIndex(['title'], ['unique' => true]);

Here, I've created columns title, body, cover_image, created_at, and updated_at. I also set the type of the title to be a string with 20 or fewer characters.

I set the body to be a text field, so it can hold long posts. The cover_image is also a string field that uses the default size of a string (255).

The created_at and updated_at fields are timestamps automatically generated in the addTimestamps() method.

Finally, I set the title field to be unique (as it would be in a real blog).

You can get all the available column types by checking the Valid Column Types. You can also get all the available column options by checking the Valid Column Options.

Finally, you can say that the database should be created by using the create method.

$table->create();

In the end, your migration file's change method should look like this:

public function change(): void
{
    $table = $this->table('posts');

    $table->addColumn('title', 'string', ['limit' => 20])
        ->addColumn('body', 'text')
        ->addColumn('cover_image', 'string')
        ->addTimestamps()
        ->addIndex(['title'], ['unique' => true]);

     $table->create();
}

We can now run this migration to create our table.

How to Run Migrations

After creating migrations, the next step is to enforce these desired changes in the database. Running migrations actually enforces these changes.

$ ./vendor/bin/phinx migrate

Running a migration

This image shows the result of the migration. You can see the time taken to run the migration.

Now, if you check your database GUI tool, you'll notice that the posts table was created with an additional field, the id field. This field is also the primary field by default. And it also auto-increments.

posts table.

You may change the primary key to some other key by either specifying some other field as the primary field, or by mapping the id field to the desired primary field. The latter includes the auto incrementing ability of the normal id field.

$table = $this->table('posts', [
    'id' => false,
    'primary_key' => ['posts_key']
]);

$table = $this->table('posts', [
    'id' => 'posts_key',
]);

In the first method, the primary key to be used has to be a column on the table (it is not auto-created).

You may also set which environment you want to run the migrations.

$ ./vendor/bin/phinx migrate -e testing

How to ReverseMigrations

Migrations can be reversed by being run down. This is the reverse of migrating up. The table previously created will be dropped, columns added will be removed, and the database will be returned to its initial pre-migration state.

$ ./vendor/bin/phinx rollback

Reversing a migration

How to Check Migration Status

As your application size increases, it is expected that your database migrations will increase. Due to this, at some point, you may wish to check the status of your migrations, to know which have been run, and which have not.

$ ./vendor/bin/phinx status

Checking migration status

How to Drop a Table

You can easily use the drop method, followed by the save method to persist the change, on the Table object.

$this->table('posts')->drop()->save();

How to Rename a Table

$table = $this->table('posts');

$table->rename('articles')
    ->update();

To drop a table, get the table. Then use the rename method with the new name, followed by the update method to persist this change.

How to Change a Table's Primary Key

You can also change a table's primary key very easily.

$table = $this->table('posts');

$table->changePrimaryKey('new_primary_key');

$table->update();

Conclusion

Now you know how to set up migrations in your PHP applications.

If you have any questions or relevant advice, please get in touch with me to share them.

To read more of my articles or follow my work, you can connect with me on LinkedIn, Twitter, and Github. It’s quick, it’s easy, and it’s free!

How to migrate from Elasticsearch 1.7 to 6.8 with zero downtime

freeCodeCamp — Wed, 25 Dec 2019 09:45:32 +0000

By dor sever

My last task at BigPanda was to upgrade an existing service that was using Elasticsearch version 1.7 to a newer Elasticsearch version, 6.8.1.

In this post, I will share how we migrated from Elasticsearch 1.6 to 6.8 with harsh constraints like zero downtime, no data loss, and zero bugs. I'll also provide you with a script that does the migration for you.

This post contains 6 chapters (and one is optional):

What’s in it for me? --> What were the new features that led us to upgrade our version?
The constraints --> What were our business requirements?
Problem solving --> How did we address the constraints?
Moving forward --> The plan.
[Optional chapter] --> How did we handle the infamous mapping explosion problem?
Finally --> How to do data migration between clusters.

Chapter 1 — What’s in it for me?

What benefits were we expecting to solve by upgrading our data store?

There were a couple of reasons:

Performance and stability issues — We were experiencing a huge number of outages with long MTTR that caused us a lot of headaches. This was reflected in frequent high latencies, high CPU usage, and more issues.
Non-existent support in old Elasticsearch versions — We were missing some operative knowledge in Elasticsearch, and when we searched for outside consulting we were encouraged to migrate forward to receive support.
Dynamic mappings in our schema — Our current schema in Elasticsearch 1.7 used a feature called dynamic mappings that made our cluster explode multiple times. So we wanted to address this issue.
Poor visibility on our existing cluster — We wanted a better view under the hood and saw that later versions had great metrics exporting tools.

Chapter 2 — The constraints

ZERO downtime migration — We have active users on our system, and we could not afford for the system to be down while we were migrating.
Recovery plan — We could not afford to “lose” or “corrupt” data, no matter the cost. So we needed to prepare a recovery plan in case our migration failed.
Zero bugs — We could not change existing search functionality for end-users.

Chapter 3 — Problem solving and thinking of a plan

Let’s tackle the constraints from the simplest to the most difficult:

Zero bugs

In order to address this requirement, I studied all the possible requests the service receives and what its outputs were. Then I added unit-tests where needed.

In addition, I added multiple metrics (to the Elasticsearch Indexer and the new Elasticsearch Indexer ) to track latency, throughput, and performance, which allowed me to validate that we only improved them.

Recovery plan

This means that I needed to address the following situation: I deployed the new code to production and stuff was not working as expected. What can I do about it then

Since I was working in a service that used event-sourcing, I could add another listener (diagram attached below) and start writing to a new Elasticsearch cluster without affecting production status

Zero downtime migration

The current service is in live mode and cannot be “deactivated” for periods longer than 5–10 minutes. The trick to getting this right is this:

Store a log of all the actions your service is handling (we use Kafka in production)
Start the migration process offline (and keep track of the offset before you started the migration)
When the migration ends, start the new service against the log with the recorded offset and catch up the lag
When the lag finishes, change your frontend to query against the new service and you are done

Chapter 4 — The plan

Our current service uses the following architecture (based on message passing in Kafka):

Event topic contains events produced by other applications (for example, UserId 3 created)
Command topic contains the translation of these events into specific commands used by this application (for example: Add userId 3)
Elasticsearch 1.7 — The datastore of the command Topic read by the Elasticsearch Indexer.

We planned to add another consumer (new Elasticsearch Indexer) to the command topic, which will read the same exact messages and write them in parallel to Elasticsearch 6.8.

Where should I start?

To be honest, I considered myself a newbie Elasticsearch user. To feel confident to perform this task, I had to think about the best way to approach this topic and learn it. A few things that helped were:

Documentation — It’s an insanely useful resource for everything Elasticsearch. Take the time to read it and take notes (don’t miss: Mapping and QueryDsl).
HTTP API — everything under CAT API. This was super useful to debug things locally and see how Elasticsearch responds (don’t miss: cluster health, cat indices, search, delete index).
Metrics (❤️) — From the first day, we configured a shiny new dashboard with lots of cool metrics (taken from elasticsearch-exporter-for-Prometheus) that helped and pushed us to understand more about Elasticsearch.

The code

Our codebase was using a library called elastic4s and was using the oldest release available in the library — a really good reason to migrate! So the first thing to do was just to migrate versions and see what broke.

There are a few tactics on how to do this code migration. The tactic we chose was to try and restore existing functionality first in the new Elasticsearch version without re-writing the all code from the start. In other words, to reach existing functionality but on a newer version of Elasticsearch.

Luckily for us, the code already contained almost full testing coverage so our task was much much simpler, and that took around 2 weeks of development time.

It's important to note that, if that wasn't the case, we would have had to invest some time in filling that coverage up. Only then would we be able to migrate since one of our constraints was to not break existing functionality.

Chapter 5 — The mapping explosion problem

Let’s describe our use-case in more detail. This is our model:

class InsertMessageCommand(tags: Map[String,String])

And for example, an instance of this message would be:

new InsertMessageCommand(Map("name"->"dor","lastName"->"sever"))

And given this model, we needed to support the following query requirements:

Query by value
Query by tag name and value

The way this was modeled in our Elasticsearch 1.7 schema was using a dynamic template schema (since the tag keys are dynamic, and cannot be modeled in advanced).

The dynamic template caused us multiple outages due to the mapping explosion problem, and the schema looked like this:

curl -X PUT "localhost:9200/_template/my_template?pretty" -H 'Content-Type: application/json' -d '
{
    "index_patterns": [
        "your-index-names*"
    ],
    "mappings": {
            "_doc": {
                "dynamic_templates": [
                    {
                        "tags": {
                            "mapping": {
                                "type": "text"
                            },
                            "path_match": "actions.tags.*"
                        }
                    }
                ]
            }
        },
    "aliases": {}
}'  

curl -X PUT "localhost:9200/your-index-names-1/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "actions": {
    "tags" : {
        "name": "John",
        "lname" : "Smith"
    }
  }
}
'

curl -X PUT "localhost:9200/your-index-names-1/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
  "actions": {
    "tags" : {
        "name": "Dor",
        "lname" : "Sever"
  }
}
}
'

curl -X PUT "localhost:9200/your-index-names-1/_doc/3?pretty" -H 'Content-Type: application/json' -d'
{
  "actions": {
    "tags" : {
        "name": "AnotherName",
        "lname" : "AnotherLastName"
  }
}
}
'


curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match" : {
            "actions.tags.name" : {
                "query" : "John"
            }
        }
    }
}
'
# returns 1 match(doc 1)


curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match" : {
            "actions.tags.lname" : {
                "query" : "John"
            }
        }
    }
}
'
# returns zero matches

# search by value
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "query_string" : {
            "fields": ["actions.tags.*" ],
            "query" : "Dor"
        }
    }
}
'

Nested documents solution

Our first instinct in solving the mapping explosion problem was to use nested documents.

We read the nested data type tutorial in the Elastic docs and defined the following schema and queries:

curl -X PUT "localhost:9200/my_index?pretty" -H 'Content-Type: application/json' -d'
{
        "mappings": {
            "_doc": {
            "properties": {
            "tags": {
                "type": "nested" 
                }                
            }
        }
        }
}
'

curl -X PUT "localhost:9200/my_index/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "tags" : [
    {
      "key" : "John",
      "value" :  "Smith"
    },
    {
      "key" : "Alice",
      "value" :  "White"
    }
  ]
}
'


# Query by tag key and value
curl -X GET "localhost:9200/my_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "nested": {
      "path": "tags",
      "query": {
        "bool": {
          "must": [
            { "match": { "tags.key": "Alice" }},
            { "match": { "tags.value":  "White" }} 
          ]
        }
      }
    }
  }
}
'

# Returns 1 document


curl -X GET "localhost:9200/my_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "nested": {
      "path": "tags",
      "query": {
        "bool": {
          "must": [
            { "match": { "tags.value":  "Smith" }} 
          ]
        }
      }
    }
  }
}
'

# Query by tag value
# Returns 1 result

And this solution worked. However, when we tried to insert real customer data we saw that the number of documents in our index increased by around 500 times.

We thought about the following problems and went on to find a better solution:

The amount of documents we had in our cluster was around 500 million documents. This meant that, with the new schema, we were going to reach two hundred fifty billion documents (that’s 250,000,000,000 documents ?).
We read this really good blog post — https://blog.gojekengineering.com/elasticsearch-the-trouble-with-nested-documents-e97b33b46194 which highlights that nested documents can cause high latency in queries and heap usage problems.
Testing — Since we were converting 1 document in the old cluster to an unknown number of documents in the new cluster, it would be much harder to track if the migration process worked without any data loss. If our conversion was 1:1, we could assert that the count in the old cluster equalled the count in the new cluster.

Avoiding nested documents

The real trick in this was to focus on what supported queries we were running: search by tag value, and search by tag key and value.

The first query does not require nested documents since it works on a single field. For the latter, we did the following trick. We created a field that contains the combination of the key and the value. Whenever a user queries on a key, value match, we translate their request to the corresponding text and query against that field.

Example:

curl -X PUT "localhost:9200/my_index_2?pretty" -H 'Content-Type: application/json' -d'
{
    "mappings": {
        "_doc": {
            "properties": {
                "tags": {
                    "type": "object",
                    "properties": {
                        "keyToValue": {
                            "type": "keyword"
                        },
                        "value": {
                            "type": "keyword"
                        }
                    }
                }
            }
        }
    }
}
'


curl -X PUT "localhost:9200/my_index_2/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "tags" : [
    {
      "keyToValue" : "John:Smith",
      "value" : "Smith"
    },
    {
      "keyToValue" : "Alice:White",
      "value" : "White"
    }
  ]
}
'

# Query by key,value
# User queries for key: Alice, and value : White , we then query elastic with this query:

curl -X GET "localhost:9200/my_index_2/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
        "bool": {
          "must": [ { "match": { "tags.keyToValue": "Alice:White" }}]
  }}}
'

# Query by value only
curl -X GET "localhost:9200/my_index_2/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
        "bool": {
          "must": [ { "match": { "tags.value": "White" }}]
  }}}
'

Chapter 6 — The migration process

We planned to migrate about 500 million documents with zero downtime. To do that we needed:

A strategy on how to transfer data from the old Elastic to the new Elasticsearch
A strategy on how to close the lag between the start of the migration and the end of it

And our two options in closing the lag:

Our messaging system is Kafka based. We could have just taken the current offset before the migration started, and after the migration ended, start consuming from that specific offset. This solution requires some manual tweaking of offsets and some other stuff, but will work.
Another approach to solving this issue was to start consuming messages from the beginning of the topic in Kafka and make our actions on Elasticsearch idempotent — meaning, if the change was “applied” already, nothing would change in Elastic store.

The requests made by our service against Elastic were already idempotent, so we choose option 2 because it required zero manual work (no need to take specific offsets, and then set them afterward in a new consumer group).

How can we migrate the data?

These were the options we thought of:

If our Kafka contained all messages from the beginning of time, we could just play from the start and the end state would be equal. But since we apply retention to out topics, this was not an option.
Dump messages to disk and then ingest them to Elastic directly – This solution looked kind of weird. Why store them in disk instead of just writing them directly to Elastic?
Transfer messages between old Elastic to new Elastic — This meant, writing some sort of “script” (did anyone say Python? ?) that will connect to the old Elasticsearch cluster, query for items, transform them to the new schema, and index them in the cluster.

We choose the last option. These were the design choices we had in mind:

Let’s not try to think about error handling unless we need to. Let’s try to write something super simple, and if errors occur, let’s try to address them. In the end, we did not need to address this issue since no errors occurred during the migration.
It’s a one-off operation, so whatever works first / KISS.
Metrics — Since the migration processes can take hours to days, we wanted the ability from day 1 to be able to monitor the error count and to track the current progress and copy rate of the script.

We thought long and hard and choose Python as our weapon of choice. The final version of the code is below:

dictor==0.1.2 - to copy and transform our Elasticsearch documentselasticsearch==1.9.0 - to connect to "old" Elasticsearchelasticsearch6==6.4.2 - to connect to the "new" Elasticsearchstatsd==3.3.0 - to report metrics

from elasticsearch import Elasticsearch
from elasticsearch6 import Elasticsearch as Elasticsearch6
import sys
from elasticsearch.helpers import scan
from elasticsearch6.helpers import parallel_bulk
import statsd

ES_SOURCE = Elasticsearch(sys.argv[1])
ES_TARGET = Elasticsearch6(sys.argv[2])
INDEX_SOURCE = sys.argv[3]
INDEX_TARGET = sys.argv[4]
QUERY_MATCH_ALL = {"query": {"match_all": {}}}
SCAN_SIZE = 1000
SCAN_REQUEST_TIMEOUT = '3m'
REQUEST_TIMEOUT = 180
MAX_CHUNK_BYTES = 15 * 1024 * 1024
RAISE_ON_ERROR = False


def transform_item(item, index_target):
    # implement your logic transformation here
    transformed_source_doc = item.get("_source")
    return {"_index": index_target,
            "_type": "_doc",
            "_id": item['_id'],
            "_source": transformed_source_doc}


def transformedStream(es_source, match_query, index_source, index_target, transform_logic_func):
    for item in scan(es_source, query=match_query, index=index_source, size=SCAN_SIZE,
                     timeout=SCAN_REQUEST_TIMEOUT):
        yield transform_logic_func(item, index_target)


def index_source_to_target(es_source, es_target, match_query, index_source, index_target, bulk_size, statsd_client,
                           logger, transform_logic_func):
    ok_count = 0
    fail_count = 0
    count_response = es_source.count(index=index_source, body=match_query)
    count_result = count_response['count']
    statsd_client.gauge(stat='elastic_migration_document_total_count,index={0},type=success'.format(index_target),
                        value=count_result)
    with statsd_client.timer('elastic_migration_time_ms,index={0}'.format(index_target)):
        actions_stream = transformedStream(es_source, match_query, index_source, index_target, transform_logic_func)
        for (ok, item) in parallel_bulk(es_target,
                                        chunk_size=bulk_size,
                                        max_chunk_bytes=MAX_CHUNK_BYTES,
                                        actions=actions_stream,
                                        request_timeout=REQUEST_TIMEOUT,
                                        raise_on_error=RAISE_ON_ERROR):
            if not ok:
                logger.error("got error on index {} which is : {}".format(index_target, item))
                fail_count += 1
                statsd_client.incr('elastic_migration_document_count,index={0},type=failure'.format(index_target),
                                   1)
            else:
                ok_count += 1
                statsd_client.incr('elastic_migration_document_count,index={0},type=success'.format(index_target),
                                   1)

    return ok_count, fail_count


statsd_client = statsd.StatsClient(host='localhost', port=8125)

if __name__ == "__main__":
    index_source_to_target(ES_SOURCE, ES_TARGET, QUERY_MATCH_ALL, INDEX_SOURCE, INDEX_TARGET, BULK_SIZE,
                           statsd_client, transform_item)

Conclusion

Migrating data in a live production system is a complicated task that requires a lot of attention and careful planning. I recommend taking the time to work through the steps listed above and figure out what works best for your needs.

As a rule of thumb, always try to reduce your requirements as much as possible. For example, is a zero downtime migration required? Can you afford data-loss?

Upgrading data stores is usually a marathon and not a sprint, so take a deep breath and try to enjoy the ride.

The whole process listed above took me around 4 months of work
All of the Elasticsearch examples that appear in this post have been tested against version 6.8.1

How to import Google BigQuery tables to AWS Athena

freeCodeCamp — Mon, 11 Mar 2019 18:55:49 +0000

By Aftab Ansari

As a data engineer, it is quite likely that you are using one of the leading big data cloud platforms such as AWS, Microsoft Azure, or Google Cloud for your data processing. Also, migrating data from one platform to another is something you might have already faced or will face at some point.

In this post, I will show how I imported Google BigQuery tables to AWS Athena. If you only need a list of tools to be used with some very high-level guidance, you can quickly look at this post that shows how to import a single BigQuery table into Hive metastore. In this article, I will show one way of importing a full BigQuery project (multiple tables) into both Hive and Athena metastore.

There are few import limitations: for example, when you import data from partitioned tables, you cannot import individual partitions. Please check the limitations before starting the process.

In order to successfully import Google BigQuery tables to Athena, I performed the steps shown below. I used AVRO format when dumping data and the schemas from Google BigQuery and loading them into AWS Athena.

Step 1. Dump BigQuery data to Google Cloud Storage

Step 2. Transfer data from Google Cloud Storage to AWS S3

Step 3. Extract AVRO schema from AVRO files stored in S3

Step 4. Create Hive tables on top of AVRO data, use schema from Step 3

Step 5. Extract Hive table definition from Hive tables

Step 6. Use the output of Step 3 and 5 to create Athena tables

So why do I have to create Hive tables in the first place although the end goal is to have data in Athena? This is because:

Athena does not support using avro.schema.url to specify table schema.
Athena requires you to explicitly specify field names and their data types in CREATE statement.
Athena also requires the AVRO schema in JSON format under avro.schema.literal.
You can check this AWS doc for more details.

So, Hive tables can be created directly by pointing to AVRO schema files stored on S3. But to have the same in Athena, columns and schema are required in the CREATE TABLE statement.

One way to overcome this is to first extract schema from AVRO data to be supplied as avro.schema.literal . Second, for field names and data types required for CREATE statement, create Hive tables based on AVRO schemas stored in S3 and use SHOW CREATE TABLE to dump/export Hive table definitions which contain field names and datatypes. Finally, create Athena tables by combining the extracted AVRO schema and Hive table definition. I will discuss in detail in subsequent sections.

For the demonstration, I have the following BigQuery tables that I would like to import to Athena.

So, let’s get started!

Step 1. Dump BigQuery data to Google Cloud Storage

It is possible to dump BigQuery data in Google storage with the help of the Google cloud UI. However, this can become a tedious task if you have to dump several tables manually.

To tackle this problem, I used Google Cloud Shell. In Cloud Shell, you can combine regular shell scripting with BigQuery commands and dump multiple tables relatively fast. You can activate Cloud Shell as shown in the picture below.

From Cloud Shell, the following operation provides the BigQuery extract commands to dump each table of the “backend” dataset to Google Cloud Storage.

bq ls backend | cut -d ' ' -f3 | tail -n+3 | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY .@ gs://@

In my case it prints:

aftab_ansari@cloudshell:~ (project-ark-archive)$ bq ls backend | cut -d ' ' -f3 | tail -n+3 | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY backend.@ gs://plr_data_transfer_temp/bigquery_data/backend/@/@-*.avro

bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_daily_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-*.avro

bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_detailed_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-*.avro

bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-*.avro

Please note: --compression SNAPPY, this is important, as uncompressed and big files can cause the gsutil command (that is used to transfer data to AWS S3) to get stuck. The wildcard (*) makes bq extract split bigger tables (>1GB) into multiple output files. Running those commands on Cloud Shell, copy data to the following Google Storage directory.

gs://plr_data_transfer_temp/bigquery_data/backend/table_name/table_name-*.avro

Let’s do ls to see the dumped AVRO file.

aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil ls gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2

gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro

I can also browse from the UI and find the data like shown below.

Step 2. Transfer data from Google Cloud Storage to AWS S3

Transferring data from Google Storage to AWS S3 is straightforward. First, set up your S3 credentials. On Cloud Shell, create or edit .boto file ( vi ~/.boto) and add these:

[Credentials]aws_access_key_id = aws_secret_access_key = [s3]host = s3.us-east-1.amazonaws.comuse-sigv4 = True

Please note: s3.us-east-1.amazonaws.com has to correspond with the region where the bucket is.

After setting up the credentials, execute gsutil to transfer data from Google Storage to AWS S3. For example:

gsutil rsync -r gs://your-gs-bucket/your-extract-path/your-schema s3://your-aws-bucket/your-target-path/your-schema

Add the -n flag to the command above to display the operations that would be performed using the specified command without actually running them.

In this case, to transfer the data to S3, I used the following:

aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil rsync -r gs://plr_data_transfer_temp/bigquery_data/backend s3://my-bucket/bq_data/backend

Building synchronization state…Starting synchronization…Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-000000000000.avro [Content-Type=application/octet-stream]...| [3 files][987.8 KiB/987.8 KiB]Operation completed over 3 objects/987.8 KiB.

Let’s check if the data got transferred to S3. I verified that from my local machine:

aws s3 ls --recursive  s3://my-bucket/bq_data/backend --profile smoke | awk '{print $4}'

bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avrobq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avrobq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro

Step 3. Extract AVRO schema from AVRO files stored in S3

To extract schema from AVRO data, you can use the Apache avro-tools-t;.jar with the getschema parameter. The benefit of using this tool is that it returns schema in the form you can use directly in WITH SERDEPROPERTIES statement when creating Athena tables.


You noticed I got only one .avro file per table when dumping BigQuery tables. This was because of small data volume — otherwise, I would have gotten several files per table. Regardless of single or multiple files per table, it’s enough to run avro-tools against any single file per table to extract that table’s schema.
I downloaded the latest version of avro-tools which is avro-tools-1.8.2.jar. I first copied all .avro files from s3 to local disk:
[hadoop@ip-10-0-10-205 tmpAftab]$ aws s3 cp s3://my-bucket/bq_data/backend/ bq_data/backend/ --recursive
download: s3://my-bucket/bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro to bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro
download: s3://my-bucket/bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro to bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro
download: s3://my-bucket/bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro to bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro
Avro-tools command should look like java -jar avro-tools-1.8.2.jar getschema your_data.avro > schema_file.avsc. This can become tedious if you have several AVRO files (in reality, I’ve done this for a project with many more tables). Again, I used a shell script to generate commands. I created extract_schema_avro.sh with the following content:
schema_avro=(bq_data/backend/*)for i in ${!schema_avro[*]}; do  input_file=$(find ${schema_avro[$i]} -type f)  output_file=$(ls -l ${schema_avro[$i]} | tail -n+2 \    | awk -v srch="avro" -v repl="avsc" '{ sub(srch,repl,$9);    print $9 }')  commands=$(    echo "java -jar avro-tools-1.8.2.jar getschema " \      $input_file" > bq_data/schemas/backend/avro/"$output_file  )  echo $commandsdone
Running extract_schema_avro.sh provides the following:
[hadoop@ip-10-0-10-205 tmpAftab]$ sh extract_schema_avro.sh
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc
Executing the above commands copies the extracted schema under bq_data/schemas/backend/avro/ :
[hadoop@ip-10-0-10-205 tmpAftab]$ ls -l bq_data/schemas/backend/avro/* | awk '{print $9}'
bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avscbq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avscbq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc
Let’s also check what’s inside an .avsc file.
[hadoop@ip-10-0-10-205 tmpAftab]$ cat bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc
{"type" : "record","name" : "Root","fields" : [ {"name" : "uid","type" : [ "null", "string" ]}, {"name" : "platform","type" : [ "null", "string" ]}, {"name" : "version","type" : [ "null", "string" ]}, {"name" : "country","type" : [ "null", "string" ]}, {"name" : "sessions","type" : [ "null", "long" ]}, {"name" : "active_days","type" : [ "null", "long" ]}, {"name" : "session_time_minutes","type" : [ "null", "double" ]} ]}
As you can see, the schema is in the form that can be directly used in Athena WITH SERDEPROPERTIES. But before Athena, I used the AVRO schemas to create Hive tables. If you want to avoid Hive table creation, you can read the .avsc files to extract field names and data types, but then you have to map the data types yourself from AVRO format to Athena table creation DDL.
The complexity of the mapping task depends on how complex the data types are in your tables. For simplicity (and to cover most simple to complex data types), I let Hive do the mapping for me. So I created the tables first in Hive metastore. Then I used SHOW CREATE TABLE to get the field names and data types part of the DDL.
Step 4. Create Hive tables on top of AVRO data, use schema from Step 3
As discussed earlier, Hive allows creating tables by using avro.schema.url. So once you have schema (.avsc file) extracted from AVRO data, you can create tables as follows:
CREATE EXTERNAL TABLE table_nameSTORED AS AVROLOCATION 's3://your-aws-bucket/your-target-path/avro_data'TBLPROPERTIES ('avro.schema.url'='s3://your-aws-bucket/your-target-path/your-avro-schema');
First, upload the extracted schemas to S3 so that avro.schema.url can refer to their S3 locations:
[hadoop@ip-10-0-10-205 tmpAftab]$ aws s3 cp bq_data/schemas s3://my-bucket/bq_data/schemas --recursive
upload: bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc
upload: bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc
upload: bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc
After having both AVRO data and schema in S3, DDL for Hive table can be created using the template shown at the beginning of this section. I used another shell script create_tables_hive.sh (shown below) to cover any number of tables:
schema_avro=$(ls -l bq_data/backend | tail -n+2 | awk '{print $9}')for table_name in $schema_avro; do  file_name=$(ls -l bq_data/backend/$table_name | tail -n+2 | awk \    -v srch="avro" -v repl="avsc" '{ sub(srch,repl,$9); print $9 }')  table_definition=$(    echo "CREATE EXTERNAL TABLE IF NOT EXISTS backend."$table_name"\\nSTORED AS AVRO""\\nLOCATION 's3://my-bucket/bq_data/backend/"$table_name"'""\\nTBLPROPERTIES('avro.schema.url'='s3://my-bucket/bq_data/\schemas/backend/avro/"$file_name"');"  )  printf "\n$table_definition\n"done
Running the script provides the following:
[hadoop@ip-10-0-10-205 tmpAftab]$ sh create_tables_hive.sh
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc');
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2'TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc');
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc');
I ran the above on Hive console to actually create the Hive tables:
[hadoop@ip-10-0-10-205 tmpAftab]$ hiveLogging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: false
hive> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2> STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc');OKTime taken: 4.24 seconds
hive>> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2'> TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc');OKTime taken: 0.563 seconds
hive>> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2> STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc');OKTime taken: 0.386 seconds
So I have created the Hive tables successfully. To verify that the tables work, I ran this simple query:
hive> select count(*) from backend.sessions_detailed_phase2;Query ID = hadoop_20190214152548_2316cb5b-29f1-4416-922e-a6ff02ec1775Total jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App id application_1550010493995_0220)----------------------------------------------------------------------------------------------VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 8.17 s----------------------------------------------------------------------------------------------OK6130
So it works!
Step 5. Extract Hive table definition from Hive tables
As discussed earlier, Athena requires you to explicitly specify field names and their data types in CREATE statement. In Step 3, I extracted the AVRO schema, which can be used in WITH SERDEPROPERTIES of Athena table DDL, but I also have to specify all the fiend names and their (Hive) data types. Now that I have the tables in the Hive metastore, I can easily get those by running SHOW CREATE TABLE. First, prepare the Hive DDL queries for all tables:
[hadoop@ip-10-0-10-205 tmpAftab]$ ls -l bq_data/backend | tail -n+2 | awk '{print "hive -e '\''SHOW CREATE TABLE backend."$9"'\'' > bq_data/schemas/backend/hql/backend."$9".hql;" }'
hive -e 'SHOW CREATE TABLE backend.sessions_daily_phase2' > bq_data/schemas/backend/hql/backend.sessions_daily_phase2.hql;
hive -e 'SHOW CREATE TABLE backend.sessions_detailed_phase2' > bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql;
hive -e 'SHOW CREATE TABLE backend.sessions_phase2' > bq_data/schemas/backend/hql/backend.sessions_phase2.hql;
Executing the above commands copies Hive table definitions under bq_data/schemas/backend/hql/. Let’s see what’s inside:
[hadoop@ip-10-0-10-205 tmpAftab]$ cat bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql
CREATE EXTERNAL TABLE `backend.sessions_detailed_phase2`(`uid` string COMMENT '',`platform` string COMMENT '',`version` string COMMENT '',`country` string COMMENT '',`sessions` bigint COMMENT '',`active_days` bigint COMMENT '',`session_time_minutes` double COMMENT '')ROW FORMAT SERDE'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'LOCATION's3://my-bucket/bq_data/backend/sessions_detailed_phase2'TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc','transient_lastDdlTime'='1550157659')
By now all the building blocks needed for creating AVRO tables in Athena are there:

Field names and data types can be obtained from the Hive table DDL (to be used in columns section of CREATE statement)
AVRO schema (JSON) can be obtained from the extracted .avsc files (to be supplied in WITH SERDEPROPERTIES).

Step 6. Use the output of Steps 3 and 5 to Create Athena tables
If you are still with me, you have done a great job coming this far. I am now going to perform the final step which is creating Athena tables. I used the following script to combine .avsc and .hql files to construct Athena table definitions:
[hadoop@ip-10-0-10-205 tmpAftab]$ cat create_tables_athena.sh
# directory where extracted avro schemas are storedschema_avro=(bq_data/schemas/backend/avro/*)# directory where extracted HQL schemas are storedschema_hive=(bq_data/schemas/backend/hql/*)for i in ${!schema_avro[*]}; do  schema=$(awk -F '{print $0}' '/CREATE/{flag=1}/STORED/{flag=0}\   flag' ${schema_hive[$i]})  location=$(awk -F '{print $0}' '/LOCATION/{flag=1; next}\  /TBLPROPERTIES/{flag=0} flag' ${schema_hive[$i]})  properties=$(cat ${schema_avro[$i]})  table=$(echo $schema '\n' \    "WITH SERDEPROPERTIES ('avro.schema.literal'='\n"$properties \    "\n""')STORED AS AVRO \n" \    "LOCATION" $location";\n\n")  printf "\n$table\n"done \  > bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql
Running the above script copies Athena table definitions to bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql. In my case it contains:
[hadoop@ip-10-0-10-205 all_athena_tables]$ cat all_athena_tables.hql
CREATE EXTERNAL TABLE `backend.sessions_daily_phase2`( `uid` string COMMENT '', `activity_date` string COMMENT '', `sessions` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "activity_date", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2';
CREATE EXTERNAL TABLE `backend.sessions_detailed_phase2`( `uid` string COMMENT '', `platform` string COMMENT '', `version` string COMMENT '', `country` string COMMENT '', `sessions` bigint COMMENT '', `active_days` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "platform", "type" : [ "null", "string" ] }, { "name" : "version", "type" : [ "null", "string" ] }, { "name" : "country", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] } ')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2';
CREATE EXTERNAL TABLE `backend.sessions_phase2`( `uid` string COMMENT '', `sessions` bigint COMMENT '', `active_days` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_phase2';
And finally, I ran the above scripts in Athena to create the tables:

There you have it.
I feel that the process is a bit lengthy. However, this has worked well for me. The other approach would be to use AWS Glue wizard to crawl the data and infer the schema. If you have used AWS Glue wizard, please share your experience in the comment section below.