apache - freeCodeCamp.org

How to Launch an EC2 Instance and Set Up a Web Server Using HTTPD

Kedar Makode — Tue, 05 Nov 2024 13:32:15 +0000

Hey there! Have you ever thought about creating your own web server on the cloud? Well, you’re in for a treat because in this article, we’re going to explore how you can launch an EC2 instance and use HTTPD to host a simple web server.

Don’t worry – it’s simpler than it sounds, and I promise to walk you through it step-by-step with a bit of fun along the way.

By the end of this guide, you’ll feel like a cloud wizard, casting spells that make servers appear out of thin air (well, out of Amazon’s data centers, but you get the point).

Ready? Let’s dive in!

Table Of Content

What Is EC2?
What is HTTPD?
Step 1: How to Launch Your EC2 Instance
Step 2: How to Connect to Your EC2 Instance
Step 3: How to Install and Start HTTPD (Apache Web Server)
Step 4: How to Host Your Custom Web Page
Wrapping Up

What Is EC2?

Think of EC2 (Elastic Compute Cloud) as a hotel room in the cloud. Instead of booking a physical server to store your website, you’re renting one from Amazon’s magical cloud infrastructure. This room (or instance) comes with all the amenities you need to host a website. Today, we’ll install HTTPD (a web server software) in our “room” to make our website live. 🏨✨

What is HTTPD?

At its core, HTTPD stands for Hypertext Transfer Protocol Daemon. Let’s break that down:
Hypertext Transfer Protocol (HTTP): This is the standard protocol used on the web. When you type a URL into your browser or click a link, you’re using HTTP to tell the server, “Hey, send me this web page!”
Daemon (D): A daemon is just a fancy term for a background process that runs continuously on a server. In this case, the daemon is responsible for responding to requests from web browsers (like Chrome or Firefox) and sending back the appropriate content.
So, HTTPD is a program that listens for incoming HTTP requests (like when you visit a webpage) and serves back the data (HTML, CSS, images, and so on) needed to display that page.

HTTPD vs. Apache2: Different Names, Same Game

Depending on your Linux distribution, you may encounter different names for the same basic software:

On RPM-based distributions (like Red Hat, CentOS, or Fedora), it’s called httpd.
On Debian-based distributions (like Ubuntu or Debian itself), it’s referred to as apache2.

Let’s look at the steps you can use to launch your EC2 instance, and how to set up a web server using HTTPD.

Step 1: How to Launch Your EC2 Instance

First things first, let’s launch our EC2 instance. You’ll need an AWS account—signing up is free, and AWS offers a free tier, so this won’t cost you a dime for small-scale experiments.

Head over to the AWS Management Console and log in. From the search bar, type “EC2” and click on EC2 Dashboard.

Create a new instance by clicking on the orange Launch Instance button.

Next, choose the Amazon Machine Image (AMI) by selecting the Amazon Linux AMI, which is free-tier eligible and super reliable. Don’t forget to give your instance a unique name!

Adding a "Name" tag with a value like "MyFirstInstance" or "ProductionServer" helps you keep track of multiple instances while adding a personal touch to your cloud workspace.

Also, remember to check the default username for the AMI you select. Since you’ve chosen Amazon Linux, the default username is ec2-user. Keep this in mind for connecting to your instance later!

Select an Instance Type: The t2.micro is your best buddy here again, free-tier eligible and perfect for our needs.

Key Pair for SSH Access: Here’s where it gets important to have a .pem file to securely connect to your instance. This file, also known as a key pair, acts like the secret key to your cloud “hotel room,” allowing you to log in via SSH.

If you already have a .pem file for a previously created key pair, go ahead and choose that from the dropdown menu.

If you don’t have a .pem file, no worries! Create a new key pair by clicking Create New Key Pair, and download the .pem file to your computer. Make sure to store this file safely—you’ll need it to log in, and if you lose it, you won’t be able to access your EC2 instance!

Why is this file important? The .pem file is your private key, and AWS uses it to verify that you are the rightful owner trying to connect to the instance. You won’t get access without it, just like how you can’t get into a hotel room without the key.

Configure Security Group: AWS EC2 security groups are like virtual firewalls that control traffic in and out of your instance, ensuring only specific types of access. To allow web visitors, set up an HTTP rule on port 80, and for secure server logins, enable SSH on port 22 with restricted IPs.

You can reuse security groups across instances, making configuration easier and more consistent. Regularly review these settings to keep your instance secure and organized.

Launch the instance: Boom! You’ve just launched your very own server in the cloud.

Wait a minute or two for your instance to come online. Now that we have our EC2 instance running, let’s move to the next step of `setting up our web server.

Step 2: How to Connect to Your EC2 Instance

To connect, we’ll use the .pem file (key pair) we created earlier. If you’re on a Mac or Linux machine, this is super simple with SSH. For Windows folks, I recommend using MobaXterm—it’s a user-friendly terminal with SSH built-in.

If you’re new to connecting EC2 instances using MobaXterm, I’ve written a detailed guide in my previous blog post. You can check it out here, where I show how to set up and connect to an EC2 instance using MobaXterm.

For now, here’s a quick overview of the connection process using SSH:

ssh -i "your-key.pem" ec2-user@your-ec2-public-ip

Replace "your-key.pem" with the name of your key pair and "your-ec2-public-ip" with the public IP of your instance (you can find this in the EC2 dashboard).

If you’ve connected successfully, congratulations! 🎉 You’re inside your cloud server.

Step 3: How to Install and Start HTTPD (Apache Web Server)

Alright, time to install our web server software (HTTPD)! We’ll be using Apache, one of the most popular web servers around. Don’t worry, you don’t need a degree in IT to get this working.

After you successfully connect to your EC2 instance from MobaXterm, you should be all set to start the installation. You’re just a few commands away from having your web server up and running!

It’s always good practice to make sure your server is up to date. To update your server, run:

sudo dnf update -y

Next, we’ll install HTTPD (Apache):

sudo dnf install httpd -y

Then start the HTTPD service. Run this command to get the server running.

sudo systemctl start httpd

Next, enable it to start on boot so that every time your EC2 instance reboots, your web server comes back to life automatically.

sudo systemctl enable httpd

Time to test it out! Open your browser and type in your instance’s public IP. If you see the Apache test page, give yourself a high-five. 🖐️ You’ve just launched a web server!

Step 4: How to Host Your Custom Web Page

Now, let’s get creative! Instead of the default web server message, let’s host your very own custom web page in just one step. This will allow you to display a unique message on your site in no time.

Run the following command in your EC2 instance to create and display a simple, personalized web page:

echo "Welcome to the Cloud! You’re now hosting your own custom web server 
using AWS EC2 and Apache!" > /var/www/html/index.html

What does this command do?

The echo command outputs the text: "Welcome to the Cloud! You’re now hosting your own custom web server using AWS EC2 and Apache!".
The > symbol redirects this output to a file.
/var/www/html/index.html is the path to the file where the message is saved. This file is the homepage of your web server.

By running this command, you're replacing the default Apache test page with your custom message.

Now, select your EC2 instance, and you’ll find its public IP address. Open your browser, enter that IP, refresh the page, and boom! Your custom message is live on the site. 🎉

Feel free to modify the text to make it uniquely yours!

Wrapping Up

And there you have it – you’ve just launched an EC2 instance and set up a simple web server using HTTPD! With these steps, you’ve not only spun up a server in the cloud but also configured it to be accessible to the world. By following along, you’ve learned the essentials of creating instances, setting up security groups, connecting via SSH, and installing Apache to serve up web content.

Keep exploring EC2’s features, and don’t hesitate to test new configurations and ideas. Each step adds to your cloud skills, bringing you one step closer to mastering AWS. So keep building, experimenting, and, most importantly, enjoying the journey. Happy cloud computing!

You can follow me on

How to Orchestrate an ETL Data Pipeline with Apache Airflow

freeCodeCamp — Wed, 01 Mar 2023 22:42:42 +0000

By Aviator Ifeanyichukwu

Data Orchestration involves using different tools and technologies together to extract, transform, and load (ETL) data from multiple sources into a central repository.

Data orchestration typically involves a combination of technologies such as data integration tools and data warehouses.

Apache Airflow is a tool for data orchestration.

With Airflow, data teams can schedule, monitor, and manage the entire data workflow. Airflow makes it easier for organizations to manage their data, automate their workflows, and gain valuable insights from their data

In this guide, you will be writing an ETL data pipeline. It will download data from Twitter, transform the data into a CSV file, and load the data into a Postgres database, which will serve as a data warehouse.

External users or applications will be able to connect to the database to build visualizations and make policy decisions.

What you will learn

How to extract data from Twitter
How to write a DAG script
How to load data into a database
How to use Airflow Operators

What you need

To follow along with this tutorial, you'll need the following:

Apache Airflow installed on your machine
Airflow development environment up and running
An understanding of the building blocks of Apache Airflow (Tasks, Operators, etc)
An IDE of your choice. Mine is VsCode.

Sounds interesting yeah? Let’s begin.

How to Get the Data from Twitter

Twitter is a social media platform where users gather to share information and discuss trending world events/topics. Tons of data is generated daily through this platform. This will be your data source.

To get data from Twitter, you need to connect to its API. Numerous libraries make it easy to connect to the Twitter API. For this guide, we'll use snscrape. You will also need Pandas, a Python library for data exploration and transformation.

Installation

Make sure your Airflow virtual environment is currently active.

pip install snscrape pandas

Inside the Airflow dags folder, create two files: extract.py and transform.py.

extract.py:

import snscrape.modules.twitter as sntwitter
import pandas as pd
from transform import transform_data


# Creating list to append tweet data to
def extract_data():

    # scrape tweets and append to a list
  for i,tweet in enumerate(sntwitter.TwitterSearchScraper('Chatham House since:2023-01-14').get_items()):
    if i>1000:
      break
    tweets_list.append([tweet.date, tweet.user.username, tweet.rawContent, 
                          tweet.sourceLabel,tweet.user.location
                          ])

      # convert tweets into a dataframe
  tweets_df = pd.DataFrame(tweets_list, columns=['datetime', 'username', 'text', 'source', 'location'])

      # save tweets as csv file

  transform_data(tweets_df)

transform.py:

import pandas as pd
from airflow.hooks.postgres_hook import PostgresHook

# Load clean data into postgres database
def task_data_upload(data):
  print(data.head() )

  data = data.to_csv(index=None, header=None)

  postgres_sql_upload = PostgresHook(postgres_conn_id="postgres_connection")
  postgres_sql_upload.bulk_load('twitter_etl_table', data)

  return True

## perform data cleaning and transformation
def transform_data(tweets_df):
  print(tweets_df.info() )
    ### Transformation happens here    

  # load transformed data into database
  task_data_upload(tweets_df)

###

The Database

Airflow comes with a SQLite3 database. To store your data, you'll use PostgreSQL as a database.

You should have PostgreSQL installed and running on your machine.

Install the libraries

pip install psycopg2

If this fails, try installing the binary version like this:

pip install psycopg2-binary

Install the provider package for the Postgres database like this:

pip install apache-airflow-providers-postgres

How to Set Up the DAG Script

Create a file named etl_pipeline.py inside the dags folder.

Start by importing the different airflow operators like this:

from airflow import DAG
from airflow.operators.empty import EmptyOperator
from datetime import datetime, timedelta

with DAG(
  'etl_twitter_pipeline',
  description="A simple twitter ETL pipeline using Python,PostgreSQL and Apache Airflow",
  start_date=datetime(year=2023, month=2, day=5),
  schedule_interval=timedelta(minutes=2)
) as dag:

  start_pipeline = EmptyOperator(
    task_id='start_pipeline',
  )

start_pipeline

With a dag_id named 'etl_twitter_pipeline', this dag is scheduled to run every two minutes, as defined by the schedule interval.

How to View the Web UI

Start the scheduler with this command:

airflow scheduler

Then start the web server with this command:

airflow webserver

Open the browser on localhost:8080 to view the UI.

Search for a dag named ‘etl_twitter_pipeline’, and click on the toggle icon on the left to start the dag.

Airflow UI showing created dags

How to Set Up a Postgres Database Connection

You should already have apache-airflow-providers-postgres and psycopg2 or psycopg2-binary installed in your virtual environment.

From the UI, navigate to Admin -> Connections. Click on the plus sign at the top left corner of your screen to add a new connection and specify the connection parameters. Click on test to verify the connection to the database server. Once completed, scroll to the bottom of the screen and click on Save.

PostgreSQL database connection

Inside the Airflow directory created in the virtual environment, open the airflow.cfg file in your text editor, locate the variable named sql_alchemy_conn, and set the PostgreSQL connection string:

sql_alchemy_conn = postgresql+psycopg2://postgres:1234@localhost:5432/test

The Airflow executor is currently set to SequentialExecutor. Change this to LocalExecutor:

executor = LocalExecutor

Airflow DAG Executor

The Airflow UI is currently cluttered with samples of example dags. In the airflow.cfg config file, find the load_examples variable, and set it to False.

load_examples = False

Disable example dags

Restart the webserver, reload the web UI, and you should now have a clean UI:

Airflow UI

How to Use the Postgres Operator

Start by importing the different Airflow operators. You'll also need to import the extract and transform Python files.

etl_pipeline.py

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.empty import EmptyOperator
from airflow.operators.postgres_operator import PostgresOperator

from datetime import datetime, timedelta

from extract import extract_data



with DAG(
  'etl_twitter_pipeline',
  description="A simple twitter ETL pipeline using Python,PostgreSQL and Apache Airflow",
  start_date=datetime(year=2023, month=2, day=5),
  schedule_interval=timedelta(minutes=5)
) as dag:

  start_pipeline = EmptyOperator(
        task_id='start_pipeline',
    )

  create_table = PostgresOperator(
    task_id='create_table',
    postgres_conn_id='postgres_connection',
    sql='sql/create_table.sql'
  )


  etl = PythonOperator(
    task_id = 'extract_data',
    python_callable = extract_data
  )


  clean_table = PostgresOperator(
      task_id='clean_sql_table',
      postgres_conn_id='postgres_connection',
      sql=["""TRUNCATE TABLE twitter_etl_table"""]
  )

  end_pipeline = EmptyOperator(
      task_id='end_pipeline',
  )

sql/create_table.sql

sql="""CREATE TABLE IF NOT EXISTS twitter_etl_table(
      id SERIAL PRIMARY KEY,
      datetime DATE NOT NULL,
      username VARCHAR(200) NOT NULL,
      text TEXT,
      source VARCHAR(200),
      location VARCHAR(200)
    );"""

The create_table task makes a connection to postgres to create a table.

The ETL task makes a call to the extract_data() function which is where our ETL data processing takes place.

The clean_table task invokes the postgresOperator which truncates the table of previous contents before new contents in inserted into the postgres table.

The end_pipeline marks the end of the task definition.

How to Create Dependencies Between Tasks

The last step is to create a dependencies between tasks, to enable Airflow to know the order of priority to schedule tasks.

start_pipeline >> create_table >> clean_table >> etl >> end_pipeline

How to Test the Workflow

To start, click on the 'etl_twitter_pipeline' dag. Click on the graph view option, and you can now see the flow of your ETL pipeline and the dependencies between tasks.

Airflow running data pipeline

And there you have it – your ETL data pipeline in Airflow. I hope you found it useful and yours is working properly.

Conclusion

Apache Airflow is an easy-to-use orchestration tool making it easy to schedule and monitor data pipelines. With your knowledge of Python, you can write DAG scripts to schedule and monitor your data pipeline.

In this guide, you learned how to set up an ETL pipeline using Airflow and also how to schedule and monitor the pipeline.

You also have seen the usage of some Airflow operators such as PythonOperator, PostgresOperator, and EmptyOperator.

I hope you learned something from this guide.

How to Configure a Laravel Project with a Custom Domain Name on Windows with XAMPP

freeCodeCamp — Tue, 14 Feb 2023 02:58:05 +0000

By Abdulwahab Ashimi

Laravel's simplicity and MVC architecture make it an ideal PHP framework for building web applications.

In this article, I will show you how to set up Laravel on your Windows machine and configure it to run on a custom domain name.

This guide is best suited for a beginner trying to get Laravel up and running quickly and easily. But even as an advanced programmer, you'll likely find fresh insights into how you can simplify the process of configuring a Laravel project. So let's dive in!

How to Install and Start Xampp

Xampp is an open-source tool that allows you to run an Apache server, MySQL database, and other tools from a single interface for development.

You can download and install Xampp from here: https://www.apachefriends.org/download.html.

First, launch your Xampp Interface and start your Apache and MySQL Server.

The Xampp Interface

Next, click on Explorer to launch your Xampp htdocs folder. Delete the files and folders inside the folder. Now you can setup your Laravel application.

How to Set Up Laravel

Inside the htdocs folder, you can clone your existing Laravel application or set up a fresh installation using composer create-project laravel/laravel example-app. In this case, "example-app" is your project name but you can replace it with your preferred name for the project.

The Laravel Directory Structure on htdocs

Open the htdocs folder in your preferred code editor. I will be using VScode for my example.

Replace the APP_URL value in the .env file of your Laravel project with the custom domain name:

APP_URL=https://project.test

You can replace "project.test" with your prefered test domain name.

How to Configure Your Hosts File

In your Windows file explorer, navigate to the "hosts" file located at C:\Windows\System32\drivers\etc\hosts and open it with VSCode (or whatever editor you're using). I'd advise that you use VSCode with admin privileges.

etc directory containing the hosts file and other files

Add the following line to the file:

127.0.0.1 project.test

This will map the hostname "project.test" to the local IP address "127.0.0.1".

Now, if you launch your Apache server and visit project.test on your browser, it loads the "index of" project.

Index of' The Laravel Directory on Browser

This is because for your Laravel application to work, it needs to load the public folder. If you can load public.test/public on your browser, you will be redirected to the Laravel project. To fix that, you can configure the Apache root directory.

How to Configure Your Apache Root Directory

In your Windows file explorer, navigate to and open the "httpd.conf" file which contains the Apache configuration. It's located at C:\xampp\apache\conf\httpd.conf . You should also use VSCode with admin privileges in this case.

Right below # Virtual hosts, add the following:


    ServerName project.test
    DocumentRoot "C:/xampp/htdocs/project/public"
    
        Options Indexes FollowSymLinks Includes ExecCGI
        AllowOverride All
        Require all granted

Note: Replace project.test with your custom domain name and C:/xampp/htdocs/project/public with the path to your public folder.

Stop and restart the Apache server from your Xampp interface and try visiting "http://project.test" on your browser to see the Laravel project's homepage.

Conclusion

You can have multiple projects with their own custom domains by setting them up in different directories inside the htdocs directory and specifying their individual Apache configurations.

If this article was helpful to you. Share it with friends or drop me a shout out on Twitter @adebowale1st.

How to Install Apache Airflow on Windows without Docker

freeCodeCamp — Thu, 02 Feb 2023 00:18:32 +0000

By Aviator Ifeanyichukwu

Apache Airflow is a tool that helps you manage and schedule data pipelines. According to the documentation, it lets you "programmatically author, schedule, and monitor workflows."

Airflow is a crucial tool for data engineers and scientists. In this article, I'll show you how to install it on Windows without Docker.

Although it's recommended to run Airflow with Docker, this method works for low-memory machines that are unable to run Docker.

Prerequisites:

This article assumes that you're familiar with using the command line and can set up your development environment as directed.

Requirements:

You need Python 3.8 or higher, Windows 10 or higher, and the Windows Subsystem for Linux (WSL2) to follow this tutorial.

What is Windows Subsystem for Linux (WSL2)?

WSL2 allows you to run Linux commands and programs on a Windows operating system.

It provides a Linux-compatible environment that runs natively on Windows, enabling users to use Linux command-line tools and utilities on a Windows machine.

You can read more here to install WSL2 on your machine.

With Python and WSL2 installed and activated on your machine, launch the terminal by searching for Ubuntu from the start menu.

Step 1: Set Up the Virtual Environment

To work with Airflow on Windows, you need to set up a virtual environment. To do this, you'll need to install the virtualenv package.

Note: Make sure you are at the root of the terminal by typing:

cd ~

pip install virtualenv

Create the virtual environment like this:

virtualenv airflow_env

And then activate the environment:

 source airflow_env/bin/activate

Step 2: Set Up the Airflow Directory

Create a folder named airflow. Mine will be located at c/Users/[Username]. You can put yours wherever you prefer.

If you do not know how to navigate the terminal, you can follow the steps in the image below:

Create an Airflow directory from the terminal

Now that you have created this folder, you have to set it as an environment variable. Open a .bashrc script from the terminal with the command:

nano ~/.bashrc

Then write the following:

AIRFLOW_HOME=/c/Users/[YourUsername]/airflow

Setup Airflow directory path as an environment variable

Press ctrl s and ctrl x to exit the nano editor.

This part of the Airflow directory will be permanently saved as an environment variable. Anytime you open a new terminal, you can recover the value of the variable by typing:

cd $AIRFLOW_HOME

Navigate to Airflow directory using the environment variable

Step 3: Install Apache Airflow

With the virtual environment still active and the current directory pointing to the created Airflow folder, install Apache Airflow:

 pip install apache-airflow

Initialize the database:

airflow db init

Create a folder named dags inside the airflow folder. This will be used to store all Airflow scripts.

View files and folders generated by Airflow db init

Step 4: Create an Airflow User

When airflow is newly installed, you'll need to create a user. This user will be used to login into the Airflow UI and perform some admin functions.

airflow users create --username admin –password admin –firstname admin –lastname admin –role Admin –email youremail@email.com

Check the created user:

airflow users list

Create an Airflow user and list the created user

Step 5: Run the Webserver

Run the scheduler with this command:

airflow scheduler

Launch another terminal, activate the airflow virtual environment, cd to $AIRFLOW_HOME, and run the webserver:

airflow webserver

If the default port 8080 is in use, change the port by typing:

airflow webserver –port

In the UI, you can view pre-created DAGs that come with Airflow by default.

How to Create the first DAG

A DAG is a Python script for organizing and managing tasks in a workflow.

To create a DAG, navigate into the dags folder created inside the $AIRFLOW_HOME directory. Create a file named "hello_world_dag.py". Use VS Code if it's available.

Enter the code from the image below, and save it:

Example DAG script in VS Code editor

Go to the Airflow UI and search for hello_world_dag. If it does not show up, try refreshing your browser.

That's it. This completes the installation of Apache Airflow on Windows.

Wrapping Up

This guide covered how to install Apache Airflow on a Windows machine without Docker and how to write a DAG script.

I do hope the steps outlined above helped you install airflow on your Windows machine without Docker.

In subsequent articles, you will learn about Apache Airflow concepts and components.

Follow me on Twitter or LinkedIn for more Analytics Engineering content.

How to Create Better Policy with Open Policy Agent and the Apache APISIX OPA Plugin

freeCodeCamp — Tue, 24 Jan 2023 19:28:56 +0000

By Njoku Samson Ebere

One common thing in every organisation is policy. Policies define how an organisation operates.

They are essential to the long-term success of an organisation. They preserve significant knowledge about how to comply with matters such as legal requirements, work within technical constraints, and avoid repeating mistakes.

Softwares follow the same pattern by adhering to rules that govern its behavior. These rules (or policies) may specify the application's environments, permitted network routes, dependencies versions allowed, and when micro-services receive API requests. Usually, developers create them manually using documents like spreadsheets.

The issue with this method is that it gradually becomes bulky. If each part of an application has its policy, things like authorization will be hard to manage across the whole application. There might also be the unnecessary repetition of policies across different parts of the application.

Aside from that, updating any policy will require the redeployment of the whole application. Fortunately, Open Policy Agent(OPA) found a way to fix these issues.

This article will explain what OPA is, how it works, what the OPA plugin entails, and how to use it.

Let’s get started!

What is OPA?

OPA is an open-source general-purpose policy engine. It can replace built-in policy function modules in software and help users decouple services from the policy engine.

OPA provides a way to build applications separate from their policies and for them to be reusable in many applications.

The OPA policy handling method reduces complexities and gives more control to the application owner. OPA allows users to integrate it with other services, such as program libraries, and HTTP APIs.

How OPA Works

OPA mediates between applications and policies to decide the rule to apply in handling a request. The image below describes its operation:

Here is a breakdown of the image above:

A service (let’s say it is an authentication micro-service) receives a request (like a login request). For the service to decide how to handle the request, it needs to get the policy guiding authentication. That takes us to the next step.
The service sends a query (this can be in any JSON format) to OPA requesting for the policy to be adhered to in handling the request received.
OPA now compares the data and policies it has access to and makes the right decision.
Finally, OPA returns the policy decision (this can be in any JSON format) reached to the service.

That is a summary of how OPA works. You can imagine many services attached to OPA and OPA helping them decide how to handle requests or events instead of each service managing its policies. It provides a more robust system that is easy to maintain.

Apache APISIX decided to integrate with OPA by providing the OPA plugin. That's what we'll discuss now.

Apache APISIX OPA Plugin

The plugin allows Apache APISIX users to conveniently introduce the policy capabilities provided by OPA when using Apache APISIX. It enables flexible authentication and access control features.

How It Works

Apache APISIX OPA Plugin follows two main steps to carry out its task:

First, APISIX re-constructs any request data it receives into acceptable JSON data and makes a policy query to OPA with it. The query is usually referred to as an APISIX to OPA service request. See the following example:


{
    "type": "http",
    "request": {
        "scheme": "http",
        "path": "\/get",
        "headers": {
            "user-agent": "curl\/7.68.0",
            "accept": "*\/*",
            "host": "127.0.0.1:9080"
        },
        "query": {},
        "port": 9080,
        "method": "GET",
        "host": "127.0.0.1"
    },
    "var": {
        "timestamp": 1701234567,
        "server_addr": "127.0.0.1",
        "server_port": "9080",
        "remote_port": "port",
        "remote_addr": "ip address"
    },
    "route": {},
    "service": {},
    "consumer": {}
}

The JSON data above tells OPA that a user has made an HTTP request using the GET method via 127.0.0.1:9080/get at 1701234567 timestamp (Wednesday, 29 November 2023 05:09:27).

OPA now has to help Apache APISIX decide how to handle the request.

Next, OPA checks the policies and data available, compares them, and reaches the decision in JSON format below:

{
    "result": {
        "allow": true,
        "reason": "test",
        "headers": {
            "an": "header"
        },
        "status_code": 401
    }
}

The policy decision above is an OPA service to APISIX response. It tells APISIX to accept the request due to the reason (test) given. When allow is false, Apache APISIX rejects it.

The following is an explanation of some of the keys in the request and response above:

type indicates the request type (HTTP or stream).
request is used when the type is HTTP and contains the basic request information like URL and headers.
var holds the basic information about the requested connection (IP, port, server details, and request timestamp).
route, service, and consumer contain the same data stored in APISIX. They require configuration for a user to see them after a transaction.
allow is required and indicates whether the request is authorised to pass through APISIX.
reason, headers, and status_code are optional and are returned when you configure a custom response.

How to Use the Plugin

This section will introduce you to some of the features of the plugin. You will see how to use Docker to build OPA services, create policy, create users’ data, create a custom route, test requests, and enable and disable the plugin.

How to use docker to build OPA services

Use the command below to launch the OPA environment on port 8181

docker run -d --name opa -p 8181:8181 openpolicyagent/opa:0.35.0 run -s

We will be using CURL for the rest of this article. If you are new to it or you are coming from other programming languages, copy the requests or response code and paste the code here to convert it to your preferred language.

We will also stick to the -H and -d flags instead of --header and --data-raw respectively.

How to create a policy

Creating a policy follows the format below:

curl -X PUT '127.0.0.1:8181/v1/policies/example1' \
    -H 'Content-Type: text/plain' \
    -d 'package example

import input.request

default allow = false

allow {
    # HTTP method must GET
    request.method == "GET"
}'

The code above came about through the following steps:

State the route: 127.0.0.1:8181/v1/policies/example1.
Import Request: import input.request.
State that no request is allowed: default allow = false.
Specify what is permissible:


allow {
    # HTTP method must GET
    request.method == "GET"
}

The code above instructs that the only acceptable HTTP method is GET. Every line in the allow object gets implemented as policies asides from the lines that begin with a # because they are comments.

You can add as many rules as you want based on the policies you have in mind. For example, the code below contains five rules that must be adhered to:

# Create policy
curl -X PUT '127.0.0.1:8181/v1/policies/example1' \
    -H 'Content-Type: text/plain' \
    -d 'package example

import input.request
import data.users

default allow = false

allow {
    # has the name test-header with the value only-for-test request header
    request.headers["test-header"] == "only-for-test"

    # The request method is GET
    request.method == "GET"

    # The request path starts with /get
    startswith(request.path, "/get")

    # GET parameter test exists and is not equal to abcd
    request.query["test"] != "abcd"

    # GET parameter user exists
    request.query["user"]
}'

With the configuration we have made so far, everything will work fine. But what happens when our users get something wrong and an error they don’t understand is returned to them? They will become frustrated and left with a bad user experience. We can avoid that by adding a custom response.

A custom response provides extra details (body, header, and status code) about the result of a transaction. Our request now becomes:


# Create policy
curl -X PUT '127.0.0.1:8181/v1/policies/example1' \
    -H 'Content-Type: text/plain' \
    -d 'package example

import input.request
import data.users

default allow = false

allow {
    # has the name test-header with the value only-for-test request header
    request.headers["test-header"] == "only-for-test"
    # The request method is GET
    request.method == "GET"
    # The request path starts with /get
    startswith(request.path, "/get")
    # GET parameter test exists and is not equal to abcd
    request.query["test"] != "abcd"
    # GET parameter user exists
    request.query["user"]
}

# custom response body (Accepts a string or an object, the object will respond as JSON format)
reason = users[request.query["user"]].reason {
    not allow
    request.query["user"]
}

# custom response header (The data of the object can be written in this way)
headers = users[request.query["user"]].headers {
    not allow
    request.query["user"]
}

# custom response status code
status_code = users[request.query["user"]].status_code {
    not allow
    request.query["user"]
}'

When a user gets an error, it becomes easier to debug because the error comes with a reason, headers details, and status_code.

How to create users’ data

The users' data is an object of objects. Each user data is an object of custom details (body, header, and status code) that help with user authorization.

The code below is an example of users data containing four (4) users with different details:

# Create test user data
curl -X PUT '127.0.0.1:8181/v1/data/users' \
    -H 'Content-Type: text/plain' \
    -d '{

    "alice": {
        "headers": {
            "Location": "http://example.com/auth"
        },
        "status_code": 302
    },

    "bob": {
        "headers": {
            "test": "abcd",
            "abce": "test"
        }
    },

    "carla": {
        "reason": "Give you a string reason"
    },

    "dylon": {
        "headers": {
            "Content-Type": "application/json"
        },
        "reason": {
            "code": 40001,
            "desc": "Give you a object reason"
        }
    }
}'

Notice that each user’s custom details are optional and may differ for every user.

How to create a custom route and enable the plugin

The APISIX OPA plugin's flexibility makes it possible for users to customize their route like in the code below:

curl -X PUT 'http://127.0.0.1:9080/apisix/admin/routes/r1' \
    -H 'X-API-KEY: ' \
    -H 'Content-Type: application/json' \
    -d '{
    "uri": "/*",
    "methods": [
        "GET",
        "POST",
        "PUT",
        "DELETE"
    ],
    "plugins": {},
    "upstream": {
        "nodes": {
            "httpbin.org:80": 1
        },
        "type": "roundrobin"
    }
}'

For this to work, the plugin has to be enabled. Enter the needed configuration into the plugins object to turn it on. So we have:


curl -X PUT 'http://127.0.0.1:9080/apisix/admin/routes/r1' \
    -H 'X-API-KEY: ' \
    -H 'Content-Type: application/json' \
    -d '{
    "uri": "/*",
    "methods": [
        "GET",
        "POST",
        "PUT",
        "DELETE"
    ],
    "plugins": {
        "opa": {
            "host": "http://127.0.0.1:8181",
            "policy": "example1"
        }
    },
    "upstream": {
        "nodes": {
            "httpbin.org:80": 1
        },
        "type": "roundrobin"
    }
}'

Now that the plugin is enabled, you can use your route as you see fit.

How to test the requests

We have been able to create policies, users’ data, and custom routes and enabled the Apache APISIX OPA plugin so far. Let’s now test these setups and see the response we get for different scenarios:

Here's a test for when a request is allowed:

Request:


curl -XGET '127.0.0.1:9080/get?test=none&user=dylon' \
    --header 'test-header: only-for-test'

Response:

{
    "args": {
        "test": "abcd1",
        "user": "dylon"
    },
    "headers": {
        "Test-Header": "only-for-test",
        "with": "more"
    },
    "origin": "127.0.0.1",
    "url": "http://127.0.0.1/get?test=abcd1&user=dylon"
}

Here's a test for when a request is rejected and the status code and response headers are re-written:

Request:


curl -XGET '127.0.0.1:9080/get?test=abcd&user=alice' \
    --header 'test-header: only-for-test'

Response:


HTTP/1.1 302 Moved Temporarily
Date: Mon, 20 Dec 2021 09:37:35 GMT
Content-Type: text/html
Content-Length: 142
Connection: keep-alive
Location: http://example.com/auth
Server: APISIX/2.11.0

Here's a test for when a request is rejected and a custom response header is returned:

Request:


curl -XGET '127.0.0.1:9080/get?test=abcd&user=bob' \
    --header 'test-header: only-for-test'

Response:


HTTP/1.1 403 Forbidden
Date: Mon, 20 Dec 2021 09:38:27 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 150
Connection: keep-alive
abce: test
test: abcd
Server: APISIX/2.11.0

Here's a test for when a request is rejected and a custom response (string) is returned:

Request:


curl -XGET '127.0.0.1:9080/get?test=abcd&user=carla' \
    --header 'test-header: only-for-test'

Response:


HTTP/1.1 403 Forbidden
Date: Mon, 20 Dec 2021 09:38:58 GMT
Content-Type: text/plain; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Server: APISIX/2.11.0

Give you a string of reason

And here's a test for when a request is rejected and a custom response (JSON) is returned:

Request:


curl -XGET '127.0.0.1:9080/get?test=abcd&user=dylon' \
    --header 'test-header: only-for-test'

Response:


HTTP/1.1 403 Forbidden
Date: Mon, 20 Dec 2021 09:42:12 GMT
Content-Type: application/json
Transfer-Encoding: chunked
Connection: keep-alive
Server: APISIX/2.11.0

{"code":40001,"desc":"Give you a object reason"}

How to disable the plugin

To disable the APISIX OPA plugin, remove all the configurations we added when we set up a custom route and enabled the plugin. We now have:


curl -X PUT 'http://127.0.0.1:9080/apisix/admin/routes/r1' \
    -H 'X-API-KEY: ' \
    -H 'Content-Type: application/json' \
    -d '{
    "uri": "/*",
    "methods": [
        "GET",
        "POST",
        "PUT",
        "DELETE"
    ],
    "plugins": {},
    "upstream": {
        "nodes": {
            "httpbin.org:80": 1
        },
        "type": "roundrobin"
    }
}'

The plugins object being empty indicates that the plugin cannot work. It is that easy because of Apache APISIX’s dynamic nature.

Conclusion

This article aimed to introduce you to the Apache APISIX OPA plugin and walk you through some of its features.

We began by looking at what OPA is and why APISIX adopted it by employing a plugin. Then we discussed how the plugin works and how we can use it.

Apache APISIX currently has more than ten authentication and authorization-related plugins that support interfacing with mainstream authentication/authorization services in the industry.

If you need to interface with other authentication authorities, you can visit Apache APISIX's GitHub and leave your suggestions via an issue or subscribe to Apache APISIX's mailing list to express your ideas.

I hope this article helps you understand how to use OPA in Apache APISIX so you can start adopting it yourself. I also encourage you to take the time to visit the Apache APISIX OPA plugin documentation to see other use cases for the plugin. The more you practice with it, the better you get at using it.

Happy Policy Making!

How to Use Apache Airflow to Schedule and Manage Workflows

Sameer Shukla — Fri, 13 May 2022 15:11:17 +0000

Apache Airflow is an open-source workflow management system that makes it easy to write, schedule, and monitor workflows.

A workflow as a sequence of operations, from start to finish. The workflows in Airflow are authored as Directed Acyclic Graphs (DAG) using standard Python programming.

You can configure when a DAG should start execution and when it should finish. You can also set up workflow monitoring through the very intuitive Airflow UI.

You can be up and running on Airflow in no time – it’s easy to use and you only need some basic Python knowledge. It's also completely open source.

Apache Airflow also has a helpful collection of operators that work easily with the Google Cloud, Azure, and AWS platforms.

In this article we are going to cover

What are Directed Acyclic Graphs (DAGs)?
What are Operators?
How to Create your First DAG
A Use-Case for DAGs
How to Set Up Cloud Composer
How to Run the Pipeline on Composer

What are Directed Acyclic Graphs, or DAGs?

DAGs, or Directed Acyclic Graphs, have nodes and edges. DAGs should not contain any loops and their edges should always be directed.

In short, a DAG is a data pipeline and each node in a DAG is a task. Some examples of nodes are downloading a file from GCS (Google Cloud Storage) to Local, applying business logic on a file using Pandas, querying the database, making a rest call, or uploading a file again to a GCS bucket.

Visualizing DAGs

Correct DAG with no loops

Incorrect DAG with Loop

You can schedule DAGs in Airflow using the schedule_interval attribute. By default it’s "None" which means that the DAG can be run only using the Airflow UI.

You can schedule the DAG to run once every hour, every day, once a week, monthly, yearly or whatever you wish using the cron presets options (@hour, @daily, @weekly, @hourly, @monthly, @yearly).

If you need to run the DAG every 5 mins, every 10 mins, every day at 14:00, or once on a specific day like every Thursday at 10:00am, then you should use these cron-based expressions.

*/5 * * * * = Every 5 minutes

0 14 * * * = Every day at 14:00

What are Operators?

A DAG consists of multiple tasks. You can create tasks in a DAG using operators which are nodes in the graph.

There are various ready to use operators available in Airflow, such as:

LocalFilesystemToGCSOperator – use it to upload a file from Local to GCS bucket.
PythonOperator – use it to execute Python callables.
functionEmailOperator – use it to send email.
SimpleHTTPOperator – use it to make an HTTP Request.

How to Create Your First DAG

The example DAG we are going to create consists of only one operator (the Python operator) which executes a Python function.

from airflow import DAG
from datetime import datetime
from airflow.operators.python_operator import PythonOperator

def message():
    print("First DAG executed Successfully!!")

with DAG(dag_id="FirstDAG", start_date=datetime(2022,1,23), schedule_interval="@hourly",
         catchup=False) as dag:

    task = PythonOperator(
        task_id="task",
        python_callable=message)

task

The first step is to import the necessary modules required for DAG development. The line with DAG is the DAG which is a data pipeline that has basic parameters like dag_id, start_date, and schedule_interval.

The schedule_interval is configured as @hourly which indicates that the DAG will run every hour.

The task in the DAG is to print a message in the logs. We have used the PythonOperator here. This operator is used to execute any callable Python function.

Once the execution is complete, we should see the message “First DAG executed Successfully” in the logs. We are going to execute all our DAGs on GCP Cloud Composer.

Airflow UI

After successful execution, the message is printed on the logs:

Logs

A Use-Case for DAGs

The use-case we are going to cover in this article involves a three-step process.

In step one, we will upload a .csv file in some input GCS bucket. This file should be processed by PythonOperator in the DAG. The function which will be executed by the PythonOperator consists of Pandas code, which represents how users can use Pandas code for transforming the data in the Airflow Data Pipeline.

In step two, we'll upload the transformed .csv file to another GCS bucket. This task will be handled by the GCSToGCSOperator.

Step three is to send the status email indicating the that the pipeline execution is completed which will be handled by the EmailOperator.

In this use-case we will also cover how to notify the team via email in case any step of the execution failed.

How to Install Cloud Composer

In GCP, Cloud Composer is a managed service built on Apache Airflow. Cloud Composer has default integration with other GCP Services such as GCS, BigQuery, Cloud Dataflow and so on.

First, we need to create the Cloud Composer Environment. So search for Cloud Composer on the search bar and click on "Create Environment" as shown below:

Create Environment

In the Environments option, I am selecting the "Composer 1" option as we don’t need auto-scaling.

Once we select the type of composer we need, we'll need to do some basic configuration just like in any GCP managed service ("Instance Name", "Location", and so on).

The node count here should always be 3 as GCP will setup the 3 services needed for Airflow.

Once we're done with that, it'll set up an Airflow instance for us. To upload a DAG, we need to open the DAGs folder shown in ‘DAGs folder’ section.

Airflow Instance

If you go to the "Kubernetes Engine" section on GCP, we can see 3 services up and running:

Kubernetes Engine

All DAGs will reside in a bucket created by Airflow.

Airflow Instance bucket for DAGs

How to Create and Run the Pipeline on Composer

In the Pipeline, we have two buckets. input_csv will contain the csv which requires some transformation, and the transformed_csv bucket will be the location where the file will be uploaded once the transformation is done.

The entire pipeline code is the following:

from airflow import DAG
from datetime import datetime
import pandas as pd

from airflow.utils.email import send_email
from airflow.operators.python_operator import PythonOperator
from airflow.operators.email_operator import EmailOperator
from airflow.providers.google.cloud.transfers.gcs_to_gcs import GCSToGCSOperator


def transformation():
    trainDetailsDF = pd.read_csv('gs://input_csv/Event_File_03_16_2022.csv')
    print(trainDetailsDF.head())


with DAG(
        dag_id="pipeline_demo",
        schedule_interval="@hourly",
        start_date=datetime(2022, 1, 23),
        catchup=False
) as dag:
    buisness_logic_task = PythonOperator(
        task_id='ApplyBusinessLogic',
        python_callable=transformation,
        dag=dag)

    upload_task = GCSToGCSOperator(
        task_id='upload_task',
        source_bucket='input_csv',
        destination_bucket='transformed_csv',
        source_object='Event_File_03_16_2022.csv',
        move_object=True,
        dag=dag
    )

    email_task = EmailOperator(
        task_id="SendStatusEmail",
        depends_on_past=True,
        to='youremail',
        subject='Pipeline Status!',
        html_content='Hi Everyone, Process completed Successfully! ',
        dag=dag)

    buisness_logic_task >> upload_task >> email_task

In the first task, all we are doing is creating a DataFrame from the input file and printing the head elements. In the logs it looks like this:

DataFrame Head

In the second task, GCSToGCSOperator, we have used the attribute move_object=True which will delete the file from the Source bucket.

Once we upload the file to the bucket, we can see that the DAG is being scheduled. The name of the DAG is "pipeline_demo".

DAGs

Note that in case if you encounter any "import errors" after uploading or executing a DAG, something like this:

You can upload these missing packages through the "PYPI Packages" option in GCP. This will update the environment after few minutes.

Updating environment with missing Packages

To open an Airflow UI, Click on the "Airflow" link under Airflow webserver.

Airflow Instance, click Airflow link to Open UI

The Airflow UI looks like this:

Upon successful execution of Pipeline, here's what you should see:

In order to send email if a task fails, you can use the on_failure_callback like this:

def notify_email(contextDict, **kwargs):
    title = "Airflow alert: {task_name} Failed".format(**contextDict)
    body = """
    Task Name :{task_name} Failed.

    """.format(**contextDict)
    send_email('youremail’, title, body)



buisness_logic_task = PythonOperator(
    task_id='ApplyBusinessLogic',
    python_callable=transformation,
    on_failure_callback=notify_email,
    dag=dag)

We're doing the notification email configuration on composer through Sendgrid. Also, once you are done with Cloud Composer, don't forget to delete the instance as it cannot be stopped.

Conclusion

Apache Airflow is a fairly easy-to-use tool. There's also a lot of help now available on the internet and the community is growing.

GCP simplified working with Airflow a lot by creating a separate managed service for it.

The Apache Cassandra Beginner Tutorial

freeCodeCamp — Thu, 15 Jul 2021 13:13:02 +0000

By Sebastian Sigl

There are lots of data-storage options available today. You have to choose between managed or unmanaged, relational or NoSQL, write- or read-optimized, proprietary or open-source — and it doesn't end there.

Once you begin your search, you will end up in the universe that is database marketing. All of the vendors will tell you why their database is fantastic.

Unfortunately, it's difficult to find out when not to use a specific database, because this is not an attractive selling point.

If you know what questions to ask, you will eventually understand all the essential properties of a given system. In the end, your choice will depend on your expertise and your requirements.

In this tutorial I will introduce you to Apache Cassandra, a distributed, horizontally scalable, open-source database. Or as Cassandra users like to describe Cassandra: "It's a database that puts you in the driver seat."

I will share the essential gotchas and provide references to documentation. I’ll also provide insights based on my experience of running Cassandra on a large scale at work, with executable examples wherever possible.

Here’s an overview of everything you'll learn:

Along the way, you will learn to ask fundamental questions that will help you to chose a database that suits your needs. You'll also learn about other popular databases like Spanner, Cockroach, or FaunaDB, and how they can serve different use-cases.

How to Set Up a Cassandra Cluster
Cassandra Architecture
Data Modeling
Running a Cluster
- Fully Managed Cassandra
- Self-Managed Cassandra
Other Learnings
Conclusion
References

How to Set Up a Cassandra Cluster

To execute the examples of this tutorial, you'll need a running Cassandra cluster. You can get this up and running quickly by using Docker.

Required Docker settings

Your device should have a minimum of 8GB of memory and at least 8GB of free disk space. Your Docker settings should be updated to be able to use at least 6GB of memory, or better, 8GB.

To apply these suggestions, open your Docker preferences, go to Resources, and increase your memory threshold.

Cassandra is built for scale, and some features only work on a multi-node Cassandra cluster, so let’s start one locally.

For Linux and Mac, run the following commands:

# Run the first node and keep it in background up and running
docker run --name cassandra-1 -p 9042:9042 -d cassandra:3.7
INSTANCE1=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-1)
echo "Instance 1: ${INSTANCE1}"

# Run the second node
docker run --name cassandra-2 -p 9043:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1 cassandra:3.7
INSTANCE2=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-2)
echo "Instance 2: ${INSTANCE2}"

echo "Wait 60s until the second node joins the cluster"
sleep 60

# Run the third node
docker run --name cassandra-3 -p 9044:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1,$INSTANCE2 cassandra:3.7
INSTANCE3=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-3)

For Windows, run the following commands in PowerShell:

# Run the first node and keep it in background up and running
docker run --name cassandra-1 -p 9042:9042 -d cassandra:3.7
$INSTANCE1=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-1)
echo "Instance 1: ${INSTANCE1}"

# Run the second node
docker run --name cassandra-2 -p 9043:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1 cassandra:3.7
$INSTANCE2=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-2)
echo "Instance 2: ${INSTANCE2}"

echo "Wait 60s until the second node joins the cluster"
sleep 60

# Run the third node
docker run --name cassandra-3 -p 9044:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1,$INSTANCE2 cassandra:3.7
$INSTANCE3=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-3)

The startup process can take a few minutes.

You can verify if everything is done and ready by executing a Cassandra utility tool called nodetool via docker exec on a node:

$ docker exec cassandra-3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.17.0.3  112.69 KiB  256          68.7%             bb5ef231-0dd2-4762-a447-806a45f710ac  rack1
UN  172.17.0.2  107.96 KiB  256          68.3%             d7392374-8daa-4292-b724-cb790b0ee6ad  rack1
UN  172.17.0.4  93.93 KiB  256          63.0%             386d094f-5483-4945-a1a7-2bb3975d6167  rack1

UN means Up and Normal. Here, all 3 nodes are running and healthy.

In this tutorial we will send lots of queries to Cassandra. I recommend starting a new shell and connecting to one node using cqlsh. Here's how to start a cqlsh shell in Docker:

$ docker exec -it cassandra-1 cqlsh

Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh>

And to execute your first query:

cqlsh> DESCRIBE keyspaces;

system_traces  system_schema  system_auth  system  system_distributed

The response shows all the existing keyspaces. Keyspaces group tables and are similar to a database in a traditional relational database system. In other systems, groups of certain items are also known as namespaces.

Before you begin creating tables and inserting data, first create a keyspace in your local datacenter, which should replicate data 3 times:

cqlsh> CREATE KEYSPACE learn_cassandra
  WITH REPLICATION = { 
   'class' : 'NetworkTopologyStrategy',
   'datacenter1' : 3 
  };

A keyspace with a replication factor of 3 using the NetworkTopologyStrategy was created. The strategy defines how data is replicated in different datacenters. This is the recommended strategy for all user created keyspaces.

Why should you start with 3 nodes?

It’s recommended to have at least 3 nodes or more. One reason is, in case you need strong consistency, you need to get confirmed data from at least 2 nodes. Or if 1 node goes down, your cluster would still be available because the 2 remaining nodes are up and running.

You don’t need to fully understand this yet. After reading through the rest of this tutorial, things should be more clear.

Now, all the nodes are up and healthy. You have a 3-node Cassandra setup listening on ports 9042, 9043, and 9044 for client requests. This is a realistic setup for a small cluster.

In production, the instances would run on different machines to maximize performance.

Before you start creating tables, reading, and writing data, it's helpful to understand the basics of designing tables for scalability.

In this tutorial, you will create tables with different settings for a to-do list application. If you want to get your hands dirty straight away, you can jump directly to the next cqlsh example.

Cassandra Architecture

Cassandra is a decentralized multi-node database that physically spans separate locations and uses replication and partitioning to infinitely scale reads and writes.

Decentralization

Cassandra is decentralized because no node is superior to other nodes, and every node acts in different roles as needed without any central controller. We'll get into examples of decentralization a bit later in this section.

Cassandra's decentralized property is what allows it to handle situations easily in case one node becomes unavailable or a new node is added.

Every Node Is a Coordinator

Data is replicated to different nodes. If certain data is requested, a request can be processed from any node.

This initial request receiver becomes the coordinator node for that request. If other nodes need to be checked to ensure consistency then the coordinator requests the required data from replica nodes.

The coordinator can calculate which node contains the data using a so-called consistent hashing algorithm.

Every node can be a coordinator

The coordinator is responsible for many things, such as request batching, repairing data, or retries for reads and writes.

Data Partitioning

“[Partitioning] is a method of splitting and storing a single logical dataset in multiple databases. By distributing the data among multiple machines, a cluster of database systems can store larger datasets and handle additional requests.

”How Sharding Works by Jeeyoung Kim

As with many other databases, you store data in Cassandra in a predefined schema. You need to define a table with columns and types for each column.

Additionally, you need to think about the primary key of your table. A primary key is mandatory and ensures data is uniquely identifiable by one or multiple columns.

The concept of primary keys is more complex in Cassandra than in traditional databases like MySQL. In Cassandra, the primary key consists of 2 parts:

a mandatory partition key and
an optional set of clustering columns.

You will learn more about the partition key and clustering columns in the data modeling section.

For now, let's focus on the partition key and its impact on data partitioning.

Consider the following table:

Table Users | Legend: p - Partition-Key, c - Clustering Column

country (p) | user_email (c)  | first_name | last_name | age
----------------------------------------------------------------
US          | john@email.com  | John       | Wick      | 55  
UK          | peter@email.com | Peter      | Clark     | 65  
UK          | bob@email.com   | Bob        | Sandler   | 23 
UK          | alice@email.com | Alice      | Brown     | 26

Together, the columns user_email and country make up the primary key.

The country column is the partition key (p). The CREATE-statement for the table looks like this:

cqlsh> 
CREATE TABLE learn_cassandra.users_by_country (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), user_email)
);

The first group of the primary key defines the partition key. All other elements of the primary key are clustering columns:

Let’s fill the table with some data:

cqlsh> 
INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('US', 'john@email.com', 'John','Wick',55);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'peter@email.com', 'Peter','Clark',65);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'bob@email.com', 'Bob','Sandler',23);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'alice@email.com', 'Alice','Brown',26);

If you’re used to designing traditional relational database tables like it’s taught in school or university, you might be surprised. Why would you use country as an essential part of the primary key?

This example will make sense after you understand the basics of partitioning in Cassandra.

Partitioning is the foundation for scalability, and it is based on the partition key. In this example, partitions are created based on country. All rows with the country US are placed in a partition. All other rows with the country UK will be stored in another partition.

In the context of partitioning, the words partition and shard can be used interchangeably.

Partitions are created and filled based on partition key values. They are used to distribute data to different nodes. By distributing data to other nodes, you get scalability. You read and write data to and from different nodes by their partition key.

The distribution of data is a crucial point to understand when designing applications that store data based on partitions. It may take a while to get fully accustomed to this concept, especially if you are used to relational databases.

Instead, think about how you read and write data and how partitioning should be done to scale horizontally.

What does horizontal scaling mean?

Horizontal scaling means you can increase throughput by adding more nodes. If your data is distributed to more servers, then more CPU, memory, and network capacity is available.

You might ask, then why do you even need email in the primary key?

The answer is that the primary key defines what columns are used to identify rows. You need to add all columns that are required to identify a row uniquely to the primary key. Using only the country would not identify rows uniquely.

The partition key is vital to distribute data evenly between nodes and essential when reading the data. The previously defined schema is designed to be queried by country because country is the partition key.

A query that selects rows by country performs well:

cqlsh> 
  SELECT * FROM learn_cassandra.users_by_country WHERE country='US';

In your cqlsh shell, you will send a request only to a single Cassandra node by default. This is called a consistency level of one, which enables excellent performance and scalability.

If you access Cassandra differently, the default consistency level might not be one.

What does consistency level of one mean?

A consistency level of one means that only a single node is asked to return the data. With this approach, you will lose strong consistency guarantees and instead experience eventual consistency.

We’ll dive deeper into consistency levels later on.

Let's create another table. This one has a partition defined only by the user_email column:

cqlsh> 
CREATE TABLE learn_cassandra.users_by_email (
    user_email text,
    country text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY (user_email)
);

Now let’s fill this table with some records:

cqlsh> 
INSERT INTO learn_cassandra.users_by_email (user_email, country,first_name,last_name,age)
  VALUES('john@email.com', 'US', 'John','Wick',55);

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('peter@email.com', 'UK', 'Peter','Clark',65); 

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('bob@email.com', 'UK', 'Bob','Sandler',23);

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('alice@email.com', 'UK', 'Alice','Brown',26);

This time, each row is put in its own partition.

This is not bad, per se. If you want to optimize for getting data by email only, it's a good idea:

cqlsh> 
  SELECT * FROM learn_cassandra.users_by_email WHERE user_email='alice@email.com';

If you set up your table with a partition key for user_email and want to get all users by age, you would need to get the data from all partitions because the partitions were created by user_email.

Talking to all nodes is expensive and can cause performance issues on a large cluster.

Cassandra tries to avoid harmful queries. If you want to filter by a column that is not a partition key, you need to tell Cassandra explicitly that you want to filter by a non-partition key column:

cqlsh> 
SELECT * FROM learn_cassandra.users_by_email WHERE age=26 ALLOW FILTERING;

Without ALLOW FILTERING, the query would not be executed to prevent harm to the cluster by accidentally running expensive queries. Executing queries without conditions (like without a WHERE clause) or with conditions that don’t use the partition key, are costly and should be avoided to prevent performance bottlenecks.

But how do you get all the rows from the table in a scalable way?

If you can, partition by a value like country. If you know all the countries, you can then iterate over all available countries, send a query for each one, and collect the results in your application.

In terms of scalability, it’s worse to just select all rows, because when you use a table partitioned by user_email, all the data is collected in 1 request in a single coordinator.

This is OK as long as you have no performance issues.

By comparison, sending multiple requests by country distributes the effort to different coordinator nodes, which scales a lot better.

If you still need access to all of the data, there is an excellent integration between Spark and Cassandra that allows efficient reads and writes for massive datasets. The Spark connector for Cassandra groups your data by partition key and can execute queries very efficiently.

Replication

Scalability using partitioning alone is limited.

Consider a lot of write requests arriving for a single partition. All requests would be sent to a single node with technical limitations such as CPU, memory, and bandwidth. Additionally, you want to handle read and write requests if this node is not available.

That is where the concept of replication comes in. By duplicating data to different nodes, so called replicas, you can serve more data simultaneously from other nodes to improve latency and throughput. It also enables your cluster to perform reads and writes in case a replica is not available.

In Cassandra, you need to define a replication factor for every keyspace. At the beginning of our example, you created a keyspace with a replication factor of 3 for our default datacenter:

cqlsh> CREATE KEYSPACE learn_cassandra
  WITH REPLICATION = { 
   'class' : 'NetworkTopologyStrategy',
   'datacenter1' : 3 
  };

A replication factor of one means there’s only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved.

A replication factor of two means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica.

As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.

Usually, it's recommended to use a replication factor of 3 for production use cases. It makes sure your data is very unlikely to get lost or become inaccessible because there are three copies available. Also, if data is not consistent between replicas at any point in time, you can ask what information state is held by the majority.

In your local cluster setup, the majority means 2 out of 3 replicas. This allows us to use some powerful query options that you will see in the next section.

Consistency Level

Now that you know about partitioning and replication, you are ready to think about consistency levels. Cassandra has a truly outstanding feature called tunable consistency.

You can define the consistency level of your read and write queries. You can check the Cassandra docs for all available settings.

Let’s focus on the most popular settings and try to understand when to choose each consistency level.

Let’s assume you have 3 replicas defined.

The first question you need to answer is, do you need strong consistency?

What does strong consistency mean?

In contrast to eventual consistency, strong consistency means only one state of your data can be observed at any time in any location.

For example, when consistency is critical, like in a banking domain, you want to be sure that everything is correct. You would rather accept a decrease in availability and increase of latency to ensure correctness.

It all comes down to the CAP theorem. You can not be available and consistent at the same time in case of connection issues between nodes of your cluster.

Let's think through the following example:

You want to write a single value to a table. The data is replicated in 2 nodes, and the connection between the nodes is interrupted. First, a write-request is sent to node 1. Then, data is read from node 2.

How do you manage this situation?

Should you disallow writes to all nodes to ensure consistency? This means availability would be sacrificed to ensure consistency and correctness.
Accept the write to node 1 and keep serving reads from both nodes. This would keep the system available but depending on what node you read from, the answer will be different, which means sacrificing consistency over availability.

You can simplify the problem to make crucial decisions for your application: Do you want consistency or availability?

Another factor is latency. By talking to more nodes to ensure consistency, you need to wait longer to receive all nodes’ responses.

Tune for Consistency by Setting up a Strong Consistency Application

There is a very important formula that if true guarantees strong consistency:

[read-consistency-level] + [write-consistency-level] > [replication-factor]

What does consistency level mean?

Consistency level means how many nodes need to acknowledge a read or a write query.

You can shift read and write consistency levels to your favor if you want to keep strong consistency. Or you even give up strong consistency for better performance, which is also called eventual consistency:

For a read-heavy system, it’s recommended to keep read consistency low because reads happen more often than writes. Let's say you have a replication factor of 3. The formula would look like this:

1 + [write-consistency-level] > 3

Therefore, the write consistency has to be set to 3 to have a strongly consistent system.

For a write-heavy system, you can do the same. Set the write consistency level to 1 and the read consistency level to 3.

You either check every node for a read to ensure all nodes have received the last updated state, or, for a write, you ensure that all nodes have written the update to their local storage. Both will make sure that data for reading and writing is correct.

This decision needs to be reflected in all the applications that access your Cassandra data because, on a query level, you need to set the required consistency level.

You set the replication factor of 3. Therefore, you can use a consistency level of ALL or THREE:

cqlsh> 
   CONSISTENCY ALL;
   SELECT * FROM learn_cassandra.users_by_country WHERE country='US';

If just one of your applications violates the required consistency strategy, you are quickly at the risk of either dropping consistency or pressuring the cluster more than required.

Tune for Performance by Using Eventual Consistency

If you don't need to be strongly consistent, you can reduce the consistency level for queries to 1 to gain performance:

cqlsh> 
   CONSISTENCY ONE;
   SELECT * FROM learn_cassandra.users_by_country WHERE country='US';

Eventually, the data will be spread to all replicas and this will ensure eventual consistency. How fast data will be made consistent depends on different mechanics that sync data between nodes.

Various features can be tuned in Cassandra, like read-repairs and external processes that repair data continuously.

Optimize Data Storage for Reading or Writing

Writes are cheaper than reads in Cassandra due to its storage engine. Writing data means simply appending something to a so-called commit-log.

Commit-logs are append-only logs of all mutations local to a Cassandra node and reduce the required I/O to a minimum.

Reading is more expensive, because it might require checking different disk locations until all the query data is eventually found.

But this does not mean Cassandra is terrible at reading. Instead, Cassandra's storage engine can be tuned for reading performance or writing performance.

Understanding Compaction

For every write operation, data is written to disk to provide durability. This means that if something goes wrong, like a power outage, data is not lost.

The foundation for storing data are the so-called SSTables. SSTables are immutable data files Cassandra uses to persist data on disk.

You can set various strategies for a table that define how data should be merged and compacted. These strategies affect read and write performance:

SizeTieredCompactionStrategy is the default, and is especially performant if you have more writes than reads,
LeveledCompactionStrategy optimizes for reads over writes. This optimization can be costly and needs to be tried out in production carefully
TimeWindowCompactionStrategy is for Time-series data

By default, tables use the SizeTieredCompactionStrategy:

cqlsh> 
   DESCRIBE TABLE learn_cassandra.users_by_country;

CREATE TABLE learn_cassandra.users_by_country (
    country text,
    user_email text,
    age smallint,
    first_name text,
    last_name text,
    PRIMARY KEY (country, user_email)
) WITH CLUSTERING ORDER BY (user_email ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

Although you can alter the compaction strategy of an existing table, I would not suggest doing so, because all Cassandra nodes start this migration simultaneously. This will lead to significant performance issues in a production system.

Instead, define the compaction strategy explicitly during table creation of your new table:

cqlsh> 
CREATE TABLE learn_cassandra.users_by_country_with_leveled_compaction (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), user_email)
) WITH
  compaction = { 'class' :  'LeveledCompactionStrategy'  };

Let’s check the result:

cqlsh> 
   DESCRIBE TABLE learn_cassandra.users_by_country_with_leveled_compaction;

CREATE TABLE learn_cassandra.users_by_country_with_leveled_compaction (
    country text,
    user_email text,
    age smallint,
    first_name text,
    last_name text,
    PRIMARY KEY (country, user_email)
) WITH CLUSTERING ORDER BY (user_email ASC)
    AND bloom_filter_fp_chance = 0.1
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

The strategies define when and how compaction is executed. Compaction means rearranging data on disk to remove old data and keep performance as good as possible when more data needs to be stored.

Check out the excellent DataStax documentation about compaction for details. There may even be better strategies in the future for the performance of your use-case.

Presorting Data on Cassandra Nodes

A table always requires a primary key. A primary key consists of 2 parts:

At least 1 column(s) as partition key and
Zero or more clustering columns for nesting rows of the data.

All columns of the partition key together are used to identify partitions. All primary key columns, meaning partition key and clustering columns, identify a specific row within a partition.

In Cassandra, data is already sorted on disk. So if you want to avoid sorting data later, you can make sure sorting is applied as needed. This can be ensured on the table level and avoids having to sort data in the client applications that query Cassandra.

In our users_by_country table, you can define age as another clustering column to sort stored data:

cqlsh> 
CREATE TABLE learn_cassandra.users_by_country_sorted_by_age_asc (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), age, user_email)
) WITH CLUSTERING ORDER BY (age ASC);

Let’s add the same data again:

cqlsh> 
INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('US','john@email.com', 'John','Wick',10);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'peter@email.com', 'Peter','Clark',30);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'bob@email.com', 'Bob','Sandler',20);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'alice@email.com', 'Alice','Brown',40);

And get the data by country:

cqlsh> 
      SELECT * FROM learn_cassandra.users_by_country_sorted_by_age_asc WHERE country='UK';

 country | age | user_email       | first_name | last_name
---------+-----+------------------+------------+-----------
      UK |  20 | bob@email.com   |        Bob |   Sandler
      UK |  30 | peter@email.com |      Peter |     Clark
      UK |  40 | alice@email.com |      Alice |     Brown

(3 rows)

In this example, the clustering columns are age and user_email. So the data is first sorted by age and then by user_email. At its core, Cassandra is still like a key-value store. Therefore, you can only query the table by:

country
country and age
country, age, and user_email

But never by country and user_email.

After learning about partitioning, replication and consistency levels, let's head into data modeling and have more fun with the Cassandra cluster.

Data Modeling

You've already learned a lot about the fundamentals of Cassandra.

Let's put your knowledge into practice and design a to-do list application that receives many more reads than writes.

The best approach is to analyze some user stories you want to fulfill with your table design:

As a user, I want to create a to-do element

Note: This is only about creating data. For now, you can delay some decisions because you want to focus on how data is read.

As a user, I want to list all my to-do elements in ascending order

First, you need to query by user_email. Create a table called todos_by_user_email.

You need 1 table that contains all the information of a to-do element of a user. Data should be partitioned by user_email for efficient read and writes by user_email.

Also, the oldest records should be displayed first, which means using the creation date as a clustering column. The creation_date also ensures uniqueness.:

cqlsh> 
CREATE TABLE learn_cassandra.todo_by_user_email (
    user_email text,
    name text,
    creation_date timestamp,
    PRIMARY KEY ((user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };

As a user, I want to share a to-do element with another user

To get all the to-dos shared with a user, you need to create a table called todos_shared_by_target_user_email to display all shared to-dos for the target user.

The table contains the to-do name to display it.

But the user also wants to see the to-dos they shared with other users. This is another table, todos_shared_by_source_user_email.

Both tables have, according to the use-case, the required user_email as partition keys to allow efficient queries. Also, creation_date is added as a clustering column for sorting and uniqueness:

cqlsh> 
CREATE TABLE learn_cassandra.todos_shared_by_target_user_email (
    target_user_email text,
    source_user_email text,
    creation_date timestamp,
    name text,
    PRIMARY KEY ((target_user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };

CREATE TABLE learn_cassandra.todos_shared_by_source_user_email (
    target_user_email text,
    source_user_email text,
    creation_date timestamp,
    name text,
    PRIMARY KEY ((source_user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };

This type of modeling is different than thinking about foreign keys and primary keys that you might know from traditional databases. In the beginning, it's all about defining tables and thinking about what values you want to filter and need to display.

You need to set a partition key to ensure the data is organised for efficient read and write operations. Also, you need to set clustering columns to ensure uniqueness, sort order, and optional query parameters.

Keep Data in Sync Using `BATCH` Statements

Due to the duplication, you need to take care to keep data consistent. In Cassandra, you can do that by using BATCH statements that give you an all-at-once guarantee, also called atomicity.

This might sound like a lot of work, and yes, it is a lot of work! If you have a table schema with many relationships, you will have more work compared to a normalized table schema.

What is a normalized table schema?

A normalized table schema is optimized to contain no duplications. Instead, data is referenced by ID and needs to be joined later.

In Cassandra, you try to avoid normalized tables. It is not even possible to write a query that contains a join.

Batch statements are cheap on a single partition, but dangerous when you execute them on different partitions, because:

Data mutations will not be applied at the same time to all partitions, with no isolation
It is expensive for the coordinator node, because you have to talk to multiple nodes and prepare for a rollback if something goes wrong
There is a batch query size limit of 50kb to avoid overloading the coordinator. This limit can be increased, but this is not recommended

In general, batches are costly.

There are other ways to apply changes eventually. If you need to execute them very often, consider using async queries instead with a proper retry mechanism.

Depending on the way you access your Cassandra, the driver might already offer you retry capabilities.

Still, this approach requires thinking about what will happen if a query is never executed. If every query really needs to be executed eventually, how can you make sure that it does not get lost if your service goes down?

The topic itself needs much more time to explain, and might be the main topic of another Cassandra tutorial.

The key learning here is:

Single partition batches are cheap and should be used
Batches that include different partitions are expensive, and if there are a lot of reads/writes, this might be the reason why a Cassandra cluster is exhausted.

Let’s create a BATCH statement that contains a to-do element that is shared with a user:

cqlsh> 

BEGIN BATCH
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('alice@email.com', toTimestamp(now()), 'My first todo entry')

  INSERT INTO learn_cassandra.todos_shared_by_target_user_email (target_user_email, source_user_email,creation_date,name) VALUES('bob@email.com', 'alice@email.com',toTimestamp(now()), 'My first todo entry')

  INSERT INTO learn_cassandra.todos_shared_by_source_user_email (target_user_email, source_user_email,creation_date,name) VALUES('alice@email.com', 'bob@email.com', toTimestamp(now()), 'My first todo entry')

APPLY BATCH;

Let’s look into one of the tables:

cqlsh>          
 SELECT * FROM learn_cassandra.todos_shared_by_target_user_email WHERE target_user_email='bob@email.com';

 target_user_email | creation_date   | name   | source_user_email
-------------------+-----------------+--------+-------------------
bob@email.com | 2021-05-24 ...| My first todo entry |   alice@email.com

All the data exists and can be accessed in a performant way using all the defined tables.

Use Foreign Keys Instead of Duplicating Data in Cassandra

You might consider using foreign keys instead of duplicating data.

Traditionally, foreign keys are ID-references of an entity that are located in another table and in relational database. They guarantee that the referenced ID exists.

In Cassandra, this might feel good because you have less duplicated data. At this point, think again about why you use Cassandra. Usually, the answer is high traffic and scalability.

Cassandra can scale enormously and comes with top performance when used correctly.

Normalizing tables is against a lot of principles in Cassandra. You can reference data by ID, but keep in mind this means you need to join the data yourself. This also means reading and writing data to multiple partitions at once.

Cassandra is built for scale. If you start normalizing your schema to reduce duplication, then you sacrifice horizontal scalability.

If you still want to use foreign keys instead of data duplication, you might want to use another database. But, everything comes with trade-offs.

Instead of using Cassandra, you could use a database that sacrifices performance and availability, and gives more consistency guarantees. In cases like this, I can recommend Cloud Spanner or Cockroach DB for a scalable relational database.

Indexes in Cassandra

There are index-like features in Cassandra that can reduce the number of tables you need to maintain on your own. One feature is called secondary indexes.

I cannot recommend them because they only operate locally to a node.

Using a secondary index means talking to all nodes because the coordinator doesn’t know which nodes contain the data if you use other columns to query data than the actual partition key.

Materialized Views

Materialized views were designed with scalability in mind.

They make it easier to duplicate tables with different partition keys so you can query data by different column combinations. They also simplify the process of creating a new table and ensuring data integrity for mutations.

There is only one drawback — the source table's full primary key needs to be part of the materialized view's primary key, and optionally, one other column.

The columns that act as partition keys can be different.

Running a Cluster

Running a Cassandra cluster can be intense. It contains your business-critical data and is usually under heavy pressure.

I won't go into details because I am more a Cassandra user than an expert in cluster maintenance. Still, I want to share my knowledge.

Fully Managed Cassandra

Datastax started a fully managed Cassandra product called Astra. They promise a lot:

Start in minutes with a free tier, no credit card needed.

Eliminate the overhead to install, operate, and scale Cassandra clusters.

Build faster with REST, GraphQL, CQL, and JSON/Document APIs.

Built on open-source Apache Cassandra™, used by the best of the internet.

Scale elastically — apps are viral ready from Day 1.

Deploy multi-cloud, multi-tenant or dedicated clusters on AWS, Azure, or GCP.

Ensure enterprise-level reliability, security, and management.

Quoted from the Astra docs

I have no experience with their offering. But I would give it a try! Their pricing sounds reasonable.

Self-Managed Cassandra

Cassandra is built with Java. So knowing the basics of running JVM applications is very beneficial.

If you run Kubernetes, then definitely check out K8ssandra. It bundles all the helpful tools around Cassandra like:

Stargate.io for REST, Graphql, and API Documentation
Reaper for easier repair management
Medusa for backups
Metrics collector for monitoring
Traefik for ingress

This stack of tools is fully open source and can be used without any additional monetary costs.

For developers, there is one very beneficial tool called nodetool. It can inspect and provide insights into how many nodes are up, what size certain tables have, how many SSTables and tombstones exist. Nodetool can also repair your data to enforce eventual consistency.

Other Learnings

Even after years of using Cassandra, there are still things to learn that let you use Cassandra more efficiently. In this section, I want to share various topics that you will experience eventually.

Data Migrations

If you have worked with other databases before, you might know database migration tools like flyway or liquibase. Since version 4.0 RC-1, there is basic liquibase support.

Additionally, the community worked on something similar with Cassandra-migration. It already supports advanced features such as leader election, for when multiple services start at the same time.

Any type of export and import can be done using DSBulk that allows loading and unloading data from and to Cassandra in CSV and JSON formats.

Tombstones

Cassandra is a multi-node cluster that contains replicated data on different nodes. Therefore, a delete can not simply delete a particular record.

For a delete operation, a new entry is added to the commit-log like for any other insert and update mutation. These deletes are called tombstones, and they flag a specific value for deletion.

Tombstones exist only on disk and can be analyzed and traced as described in this blog post: About Deletes and Tombstones in Cassandra.

In Cassandra, you can set a time to live on inserted data. After the time passed, the record will be automatically deleted. When you set a time to live (TTL), a tombstone is created with a date in the future.

In comparison, a regular delete query is the same with the difference that the time date of the tombstone is set to the moment the delete is executed.

Let’s create a tombstone by setting a TTL in seconds which basically function as a delayed delete:

cqlsh>     
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', toTimestamp(now()), 'This entry should be removed soon') USING TTL 60;

And the data is stored like regular data:

cqlsh>      
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

  user_email    | creation_date | name
----------------+---------------+--------------------
 john@email.com | 2021-05-30... | This entry should be removed soon

(1 rows)

You can also read the TTL from the database for a given column:

cqlsh> 
 SELECT TTL(name) FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

 ttl(name)
-----------
        43

(1 rows)

After 60 seconds, the row is gone.

cqlsh>  
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';                                  

 user_email | creation_date | todo_uuid | name
-----------+---------------+-----------+------

(0 rows)

Setting a TTL is one of many ways to create and execute tombstones.

Unfortunately, there are also others.

For example, when you insert a null value, a tombstone is created for the given cell. And as mentioned for delete requests, different types of tombstones are stored.

By default, after 10 days, data that is marked by a tombstone is freed with a compaction execution. This time can be configured and reduced using the gc_grace_seconds option in the Cassandra configuration.

When is a compaction executed?

When the operation is executed depends mainly on the selected strategy. In general, a compaction execution takes SSTables and creates new SSTables out of it.

The most common executions are:

When conditions for a compaction are true, that triggers compaction execution when data is inserted

A manually executed major compaction using the nodetool

Sometimes, tombstones not deleted for the following reasons:

Null values mark values to be deleted and are stored as tombstones. This can be avoided by either replacing null with a static value, or not setting the value at all if the value is null
Empty lists and sets are similar to null for Cassandra and create a tombstone, so don’t insert them if they’re empty. Take care to avoid null pointer exceptions when storing and retrieving data in your application
Updated lists and sets create tombstones. If you update an entity and the list or set does not change, it still creates a tombstone to empty the list and set the same values. Therefore, only update necessary fields to avoid issues. The good thing is, they are compacted due to the new values

If you have many tombstones, you might run into another Cassandra issue that prevents a query from being executed.

This happens when the tombstone_failure_threshold is reached, which is set by default to 100,000 tombstones. This means that, when a query has iterated over more than 100,000 tombstones, it will be aborted.

The issue here is, once a query stops executing, it’s not easy to tidy things up because Cassandra will stop even when you execute a delete, as it has reached the tombstone limit.

Usually you would never have that many tombstones. But mistakes happen, and you should take care to avoid this case.

There is a handy operation metric that you should observe called TombstoneScannedHistogram to avoid unexpected issues in production.

`UPDATE`s Are Just `INSERT`s, and Vice Versa

In Cassandra, everything is append-only. There is no difference between an update and insert.

You already learned that a primary key defines the uniqueness of a row. If there is no entry yet, a new row will appear, and if there is already an entry, the entry will be updated. It does not matter if you execute an update or insert a query.

The primary key in our example is set to user_email and creation_date that defines record uniqueness.

Let’s insert a new record:

cqlsh>      
  INSERT INTO learn_cassandra.todo_by_user_email (user_email, creation_date, name) VALUES('john@email.com', '2021-03-14 16:07:19.622+0000', 'Insert query');

And execute an update with a new todo_uuid:

cqlsh>    
  UPDATE learn_cassandra.todo_by_user_email SET 
    name = 'Update query'
  WHERE user_email = 'john@email.com' AND creation_date = '2021-03-14 16:10:19.622+0000';

2 new rows appear in our table:

cqlsh>    
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';                                                                                                            

  user_email     | creation_date                   | name
----------------+---------------------------------+--------------
 john@email.com | 2021-03-14 16:10:19.622000+0000 | Update query
 john@email.com | 2021-03-14 16:07:19.622000+0000 | Insert query

(2 rows)

So you inserted a row using an update, and you can also use an insert to update:

cqlsh>       
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', '2021-03-14 16:07:19.622+0000', 'Insert query updated');

Let’s check our updated row:

cqlsh>   
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

 user_email     | creation_date            | name
----------------+--------------------------+----------------------
 john@email.com | 2021-03-14 16:10:19.62   |         Update query
 john@email.com | 2021-03-14 16:07:19.62   | Insert query updated


(2 rows)

So UPDATE and INSERT are technically the same. Don’t think that an INSERT fails if there is already a row with the same primary key.

The same applies to an UPDATE — it will be executed, even if the row doesn’t exist.

The reason for this is because, by design, Cassandra rarely reads before writing to keep performance high. The only exceptions are described in the next section about lightweight transactions.

But, there are restrictions what actions you can execute based on an update or insert:

Counters can only be changed with UPDATE, not with Insert
IF NOT EXISTS can only be used in combination with an INSERT
IF EXISTS can only be used in combination with an UPDATE

You will learn more about conditions in queries within the next section.

Lightweight Transactions

You can use conditions in queries using a feature called lightweight transactions (LWTs), which execute a read to check a certain condition before executing the write.

Let’s only update if an entry already exists, by using IF EXISTS:

cqlsh>     
  UPDATE learn_cassandra.todo_by_user_email SET
    name = 'Update query with LWT'
  WHERE user_email = 'john@email.com' AND creation_date = '2021-03-14 16:07:19.622+0000' IF EXISTS;

 [applied]
-----------
      True

The same works for an insert query using IF NOT EXISTS:

cqlsh>      
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', toTimestamp(now()), 'Yet another entry') IF NOT EXISTS;

 [applied]
-----------
      True

Those executions are expensive compared to simple UPDATE and INSERT queries. Still, if it’s business-critical, they are an excellent way to achieve transactional safety.

Conclusion

I hope you enjoyed the article.

If you liked it and feel the need to give me a round of applause, or just want to get in touch, follow me on Twitter.

I work at eBay Kleinanzeigen, one of the world’s biggest classified companies. By the way, we are hiring!

Special thanks goes to Roger Sheen, Michael de la Fontaine, Christian Baer, Thomas Uebel and Swen Fuhrmann for excellent feedback and proof-reading.

References

Apache Flink Batch Example in Java

freeCodeCamp — Sun, 09 Feb 2020 23:27:00 +0000

Flink Batch Example JAVA

Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities.

Prerequisites

Unix-like environment (Linux, Mac OS X, Cygwin)
git
Maven (we recommend version 3.0.4)
Java 7 or 8
IntelliJ IDEA or Eclipse IDE

git clone https://github.com/apache/flink.git
cd flink
mvn clean package -DskipTests # this will take up to 10 minutes

Datasets

For the batch processing data we’ll be using the datasets in here: datasets In this example we’ll be using the movies.csv and the ratings.csv, create a new java project and put them in a folder in the application base.

Example

We’re going to make an execution where we retrieve the average rating by movie genre of the entire dataset we have.

Environment and datasets

First create a new Java file, I’m going to name it AverageRating.java

The first thing we’ll do is to create the execution environment and load the csv files in a dataset. Like this:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet> movies = env.readCsvFile("ml-latest-small/movies.csv")
  .ignoreFirstLine()
  .parseQuotedStrings('"')
  .ignoreInvalidLines()
  .types(Long.class, String.class, String.class);

DataSet> ratings = env.readCsvFile("ml-latest-small/ratings.csv")
  .ignoreFirstLine()
  .includeFields(false, true, true, false)
  .types(Long.class, Double.class);

There, we are making a dataset with a for the movies, ignoring errors, quotes and the header line, and a dataset with for the movie ratings, also ignoring the header, invalid lines and quotes.

Flink Processing

Here we will process the dataset with flink. The result will be in a List of String, Double tuples. where the genre will be in the String and the average rating will be in the double.

First we’ll join the ratings dataset with the movies dataset by the moviesId present in each dataset. With this we’ll create a new Tuple with the movie name, genre and score. Later, we group this tuple by genre and add the score of all equal genres, finally we divide the score by the total results and we have our desired result.

List> distribution = movies.join(ratings)
  .where(0)
  .equalTo(0)
  .with(new JoinFunction,Tuple2, Tuple3>() {
    private StringValue name = new StringValue();
    private StringValue genre = new StringValue();
    private DoubleValue score = new DoubleValue();
    private Tuple3 result = new Tuple3<>(name,genre,score);

    @Override
    public Tuple3 join(Tuple3 movie,Tuple2 rating) throws Exception {
      name.setValue(movie.f1);
      genre.setValue(movie.f2.split("\\|")[0]);
      score.setValue(rating.f1);
      return result;
    }
})
  .groupBy(1)
  .reduceGroup(new GroupReduceFunction, Tuple2>() {
    @Override
    public void reduce(Iterable> iterable, Collector> collector) throws Exception {
      StringValue genre = null;
      int count = 0;
      double totalScore = 0;
      for(Tuple3 movie: iterable){
        genre = movie.f1;
        totalScore += movie.f2.getValue();
        count++;
      }

      collector.collect(new Tuple2<>(genre.getValue(), totalScore/count));
    }
})
  .collect();

With this you’ll have a working batch processing flink application. Enjoy!

How Apache Nifi works — surf on your dataflow, don’t drown in it

freeCodeCamp — Fri, 03 May 2019 15:42:14 +0000

By François Paupier

Introduction

That’s a crazy flow of water. Just like your application deals with a crazy stream of data. Routing data from one storage to another, applying validation rules and addressing questions of data governance, reliability in a Big Data ecosystem is hard to get right if you do it all by yourself.

Good news, you don’t have to build your dataflow solution from scratch — Apache NiFi got your back!

At the end of this article, you’ll be a NiFi expert — ready to build your data pipeline.

What I will cover in this article:

What Apache NiFi is, in which situation you should use it, and what are the key concepts to understand in NiFi.

What I won’t cover:

Installation, deployment, monitoring, security, and administration of a NiFi cluster.

For your convenience here is the table of content, feel free to go straight where your curiosity takes you. If you’re a NiFi first-timer, going through this article in the indicated order is advised.

What is Apache NiFi?

On the website of the Apache Nifi project, you can find the following definition:

An easy to use, powerful, and reliable system to process and distribute data.

Let’s analyze the keywords there.

Defining NiFi

Process and distribute data
That’s the gist of Nifi. It moves data around systems and gives you tools to process this data.

Nifi can deal with a great variety of data sources and format. You take data in from one source, transform it, and push it to a different data sink.

Ten thousand feet view of Apache Nifi — Nifi pulls data from multiple data sources, enrich it and transform it to populate a key-value store.

Easy to use
Processors — the boxes — linked by connectors — the arrows create a flow_. N_iFi offers a flow-based programming experience.

Nifi makes it possible to understand, at a glance, a set of dataflow operations that would take hundreds of lines of source code to implement.

Consider the pipeline below:

An overly minimalist data pipeline

To translate the data flow above in NiFi, you go to NiFi graphical user interface, drag and drop three components into the canvas, and
That’s it. It takes two minutes to build.

A simple validation data flow as seen through Nifi canvas

Now, if you write code to do the same thing, it’s likely to be a several hundred lines long to achieve a similar result.

You don’t capture the essence of the pipeline through code as you do with a flow-based approach. Nifi is more expressive to build a data pipeline; it’s designed to do that.

Powerful
NiFi provides many processors out of the box (293 in Nifi 1.9.2). You’re on the shoulders of a giant. Those standard processors handle the vast majority of use cases you may encounter.

NiFi is highly concurrent, yet its internals encapsulates the associated complexity. Processors offer you a high-level abstraction that hides the inherent complexity of parallel programming. Processors run simultaneously, and you can span multiple threads of a processor to cope with the load.

Concurrency is a computing Pandora’s box that you don’t want to open. NiFi conveniently shields the pipeline builder from the complexities of concurrency.

Reliable
The theory backing NiFi is not new; it has solid theoretical anchors. It’s similar to models like SEDA.

For a dataflow system, one of the main topics to address is reliability. You want to be sure that data sent somewhere is effectively received.

NiFi achieves a high level of reliability through multiple mechanisms that keep track of the state of the system at any point in time. Those mechanisms are configurable so you can make the appropriate tradeoffs between latency and throughput required by your applications.

NiFi tracks the history of each piece of data with its lineage and provenance features. It makes it possible to know what transformation happens on each piece of information.

The data lineage solution proposed by Apache Nifi proves to be an excellent tool for auditing a data pipeline. Data lineage features are essential to bolster confidence in big data and AI systems in a context where transnational actors such as the European Union propose guidelines to support accurate data processing.

Why using Nifi?

First, I want to make it clear I’m not here to evangelize NiFi. My goal is to give you enough elements so you can make an informed decision on the best way to build your data pipeline.

It’s useful to keep in mind the four Vs of big data when dimensioning your solution.

The four Vs of Big Data

Volume — At what scale do you operate? In order of magnitude, are you closer to a few GigaBytes or hundreds of PetaBytes?
Variety — How many data sources do you have? Are your data structured? If yes, does the schema vary often?
Velocity — What is the frequency of the events you process? Is it credit cards payments? Is it a daily performance report sent by an IoT device?
Veracity — Can you trust the data? Alternatively, do you need to apply multiple cleaning operations before manipulating it?

NiFi seamlessly ingests data from multiple data sources and provides mechanisms to handle different schema in the data. Thus, it shines when there is a high variety in the data.

Nifi is particularly valuable if data is of low veracity. Since it provides multiple processors to clean and format the data.

With its configuration options, Nifi can address a broad range of volume/velocity situations.

An increasing list of applications for data routing solutions

New regulations, the rise of the Internet of Things and the flow of data it generates emphasize the relevance of tools such as Apache NiFi.

Microservices are trendy. In those loosely coupled services, the data is the contract between the services. Nifi is a robust way to route data between those services.
Internet of Things brings a multitude of data to the cloud. Ingesting and validating data from the edge to the cloud poses a lot of new challenges that NiFi can efficiently address (primarily through MiniFi, NiFi project for edge devices)
New guidelines and regulations are put in place to readjust the Big Data economy. In this context of increasing monitoring, it is vital for businesses to have a clear overview of their data pipeline. NiFi data lineage, for example, can be helpful in a path towards compliance to regulations.

Bridge the gap between big data experts and the others

As you can see by the user interface, a dataflow expressed in NiFi is excellent to communicate about your data pipeline. It can help members of your organization become more knowledgeable about what’s going on in the data pipeline.

An analyst is asking for insights about why this data arrives here that way? Sit together and walk through the flow. In five minutes you give someone a strong understanding of the Extract Transform and Load -ETL- pipeline.
You want feedback from your peers on a new error handling flow you created? NiFi makes it a design decision to consider error paths as likely as valid outcomes. Expect the flow review to be shorter than a traditional code review.

Should you use it? Yes, No, Maybe?

NiFi brands itself as easy to use. Still, it is an enterprise dataflow platform. It offers a complete set of features from which you may only need a reduced subset. Adding a new tool to the stack is not benign.

If you are starting from scratch and manage a few data from trusted data sources, you may be better off setting up your Extract Transform and Load — ETL pipeline. Maybe a change data capture from a database and some data preparations scripts are all you need.

On the other hand, if you work in an environment with existing big data solutions in use (be it for storage, processing or messaging ), NiFi integrates well with them and is more likely to be a quick win. You can leverage the out of the box connectors to those other Big Data solutions.

It’s easy to be hyped by new solutions. List your requirements and choose the solution that answers your needs as simply as possible.

Now that we have seen the very high picture of Apache NiFi, we take a look at its key concepts and dissect its internals.

Apache Nifi under the microscope

“NiFi is boxes and arrow programming” may be ok to communicate the big picture. However, if you have to operate with NiFi, you may want to understand a bit more about how it works.

In this second part, I explain the critical concepts of Apache NiFi with schemas. This black box model won’t be a black box to you afterward.

Unboxing Apache NiFi

When you start NiFi, you land on its web interface. The web UI is the blueprint on which you design and control your data pipeline.

Apache NiFi user interface — build your pipeline by drag and dropping component on the interface

In Nifi, you assemble processors linked together by connections. In the sample dataflow introduced previously, there are three processors.

Three processors linked together by two queues

The NiFi canvas user interface is the framework in which the pipeline builder evolves.

Making sense of Nifi terminology

To express your dataflow in Nifi, you must first master its language. No worries, a few terms are enough to grasp the concept behind it.

The black boxes are called processors, and they exchange chunks of information named FlowFiles through queues that are named connections. Finally, the FlowFile Controller is responsible for managing the resources between those components.

Processor, FlowFile, Connector, and the FlowFile Controller: four essential concepts in NiFi

Let’s take a look at how this works under the hood.

FlowFile

In NiFi, the FlowFile is the information packet moving through the processors of the pipeline.

Anatomy of a FlowFile — It contains attributes of the data as well as a reference to the associated data

A FlowFile comes in two parts:

Attributes, which are key/value pairs. For example, the file name, file path, and a unique identifier are standard attributes.
Content, a reference to the stream of bytes compose the FlowFile content.

The FlowFile does not contain the data itself. That would severely limit the throughput of the pipeline.

Instead, a FlowFile holds a pointer that references data stored at some place in the local storage. This place is called the Content Repository.

The Content Repository stores the content of the FlowFile

To access the content, the FlowFile claims the resource from the Content Repository. The later keep tracks of the exact disk offset from where the content is and streams it back to the FlowFile.

Not all processors need to access the content of the FlowFile to perform their operations — for example, aggregating the content of two FlowFiles doesn’t require to load their content in memory.

When a processor modifies the content of a FlowFile, the previous data is kept. NiFi copies-on-write, it modifies the content while copying it to a new location. The original information is left intact in the Content Repository.

Example
Consider a processor that compresses the content of a FlowFile. The original content remains in the Content Repository, and a new entry is created for the compressed content.

The Content Repository finally returns the reference to the compressed content. The FlowFile is updated to point to the compressed data.

The drawing below sums up the example with a processor that compresses the content of FlowFiles.

Copy-on-write in NiFi — The original content is still present in the repository after a FlowFile modification.

Reliability
NiFi claims to be reliable, how is it in practice? The attributes of all the FlowFiles currently in use, as well as the reference to their content, are stored in the FlowFile Repository.

At every step of the pipeline, a modification to a Flowfile is first recorded in the FlowFile Repository, in a write-ahead log, before it is performed.

For each FlowFile that currently exist in the system, the FlowFile repository stores:

The FlowFile attributes
A pointer to the content of the FlowFile located in the FlowFile repository
The state of the FlowFile. For example: to which queue does the Flowfile belong at this instant.

The FlowFile Repository contains metadata about the files currently in the flow.

The FlowFile repository gives us the most current state of the flow; thus it’s a powerful tool to recover from an outage.

NiFi provides another tool to track the complete history of all the FlowFiles in the flow: the Provenance Repository.

Provenance Repository
Every time a FlowFile is modified, NiFi takes a snapshot of the FlowFile and its context at this point. The name for this snapshot in NiFi is a Provenance Event. The Provenance Repository records Provenance Events.

Provenance enables us to retrace the lineage of the data and build the full chain of custody for every piece of information processed in NiFi.

The Provenance Repository stores the metadata and context information of each FlowFile

On top of offering the complete lineage of the data, the Provenance Repository also offers to replay the data from any point in time.

Trace back the history of your data thanks to the Provenance Repository

Wait, what’s the difference between the FlowFile Repository and the Provenance Repository?

The idea behind the FlowFile Repository and the Provenance Repository is quite similar, but they don’t address the same issue.

The FlowFile repository is a log that contains only the latest state of the in-use FlowFiles in the system. It is the most recent picture of the flow and makes it possible to recover from an outage quickly.
The Provenance Repository, on the other hand, is more exhaustive since it tracks the complete life cycle of every FlowFile that has been in the flow.

The Provenance Repository adds a time dimension where the FlowFile Repository is one snapshot

If you have only the most recent picture of the system with the FlowFile repository, the Provenance Repository gives you a collection of photos — a video. You can rewind to any moment in the past, investigate the data, replay operations from a given time. It provides a complete lineage of the data.

FlowFile Processor

A processor is a black box that performs an operation. Processors have access to the attributes and the content of the FlowFile to perform all kind of actions. They enable you to perform many operations in data ingress, standard data transformation/validation tasks, and saving this data to various data sinks.

Three different kinds of processors

NiFi comes with many processors when you install it. If you don’t find the perfect one for your use case, it’s still possible to build your own processor. Writing custom processors is outside the scope of this blog post.

Processors are high-level abstractions that fulfill one task. This abstraction is very convenient because it shields the pipeline builder from the inherent difficulties of concurrent programming and the implementation of error handling mechanisms.

Processors expose an interface with multiple configuration settings to fine-tune their behavior.

_Zoom on a NiFi Processor for [record validation](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ValidateRecord/index.html" rel="noopener" target="blank" title=") — pipeline builder specifies the high-level configuration options and the black box hides the implementation details.

The properties of those processors are the last link between NiFi and the business reality of your application requirements.

The devil is in the details, and pipeline builders spend most of their time fine-tuning those properties to match the expected behavior.

Scaling
For each processor, you can specify the number of concurrent tasks you want to run simultaneously. Like this, the Flow Controller allocates more resources to this processor, increasing its throughput. Processors share threads. If one processor requests more threads, other processors have fewer threads available to execute. Details on how the Flow Controller allocates threads are available here.

Horizontal scaling. Another way to scale is to increase the number of nodes in your NiFi cluster. Clustering servers make it possible to increase your processing capability using commodity hardware.

Process Group

This one is straightforward now that we’ve seen what processors are.

A bunch of processors put together with their connections can form a process group. You add an input port and an output port so it can receive and send data.

Building a new processor from three existing processors

Processor groups are an easy way to create new processors based from existing ones.

Connections

Connections are the queues between processors. These queues allow processors to interact at differing rates. Connections can have different capacities like there exist different size of water pipes.

Various capacities for different connectors. Here we have capacity C1 > capacity C2

Because processors consume and produce data at different rates depending on the operations they perform, connections act as buffers of FlowFiles.

There is a limit on how many data can be in the connection. Similarly, when your water pipe is full, you can’t add water anymore, or it overflows.

In NiFi you can set limits on the number of FlowFiles and the size of their aggregated content going through the connections.

What happens when you send more data than the connection can handle?

If the number of FlowFiles or the quantity of data goes above the defined threshold, backpressure is applied. The Flow Controller won’t schedule the previous processor to run again until there is room in the queue.

Let’s say you have a limit of 10 000 FlowFiles between two processors. At some point, the connection has 7 000 elements in it. It is ok since the limit is 10 000. P1 can still send data through the connection to P2.

Two processors linked by a connector with its limit respected.

Now let’s say that processor one sends 4 000 new FlowFiles to the connection.
7 0000 + 4 000 = 11 000 → We go above the connection threshold of 10 000 FlowFiles.

Processor P1 not scheduled until the connector goes back below its threshold.

The limits are soft limits, meaning they can be exceeded. However, once they are, the previous processor, P1 won’t be scheduled until the connector goes back below its threshold value — 10 000 FlowFiles.

Number of FlowFiles in the connector comes back below the threshold. The Flow Controller schedules the processor P1 for execution again.

This simplified example gives the big picture of how backpressure works.

You want to setup connection thresholds appropriate to the Volume and Velocity of data to handle. Keep in mind the Four Vs.

The idea of exceeding a limit may sound odd. When the number of FlowFiles or the associated data go beyond the threshold, a swap mechanism is triggered.

Active queue and Swap in Nifi connectors

For another example on backpressure, this mail thread can help.

Prioritizing FlowFiles
The connectors in NiFi are highly configurable. You can choose how you prioritize FlowFiles in the queue to decide which one to process next.

Among the available possibility, there is, for example, the First In First Out order — FIFO. However, you can even use an attribute of your choice from the FlowFile to prioritize incoming packets.

Flow Controller

The Flow Controller is the glue that brings everything together. It allocates and manages threads for processors. It’s what executes the dataflow.

The Flow Controller coordinates the allocation of resources for processors.

Also, the Flow Controller makes it possible to add Controller Services.

Those services facilitate the management of shared resources like database connections or cloud services provider credentials. Controller services are daemons. They run in the background and provide configuration, resources, and parameters for the processors to execute.

For example, you may use an AWS credentials provider service to make it possible for your services to interact with S3 buckets without having to worry about the credentials at the processor level.

An AWS credentials service provide context to two processors

Just like with processors, a multitude of controller services is available out of the box.

You can check out this article for more content on the controller services.

Conclusion and call to action

In the course of this article, we discussed NiFi, an enterprise dataflow solution. You now have a strong understanding of what NiFi does and how you can leverage its data routing features for your applications.

If you’re reading this, congrats! You now know more about NiFi than 99.99% of the world’s population.

Practice makes perfect. You master all the concepts required to start building your own pipeline. Make it simple; make it work first.

Here is a list of exciting resources I compiled on top of my work experience to write this article.

Resources ?

The bigger picture

Because designing data pipeline in a complex ecosystem requires proficiency in multiple areas, I highly recommend the book Designing Data-Intensive Applications from Martin Kleppmann. It covers the fundamentals.

A cheat sheet with all the references quoted in Martin’s book is available on his Github repo.

This cheat sheet is a great place to start if you already know what kind of topic you’d like to study in-depth and you want to find quality materials.

apache - freeCodeCamp.org

How to Launch an EC2 Instance and Set Up a Web Server Using HTTPD

Table Of Content

What Is EC2?

What is HTTPD?

HTTPD vs. Apache2: Different Names, Same Game

Step 1: How to Launch Your EC2 Instance

Step 2: How to Connect to Your EC2 Instance

Step 3: How to Install and Start HTTPD (Apache Web Server)

Step 4: How to Host Your Custom Web Page

Wrapping Up

How to Orchestrate an ETL Data Pipeline with Apache Airflow

What you will learn

What you need

How to Get the Data from Twitter

Installation

The Database

Install the libraries

How to Set Up the DAG Script

How to View the Web UI

How to Set Up a Postgres Database Connection

How to Use the Postgres Operator

How to Create Dependencies Between Tasks

How to Test the Workflow

Conclusion

How to Configure a Laravel Project with a Custom Domain Name on Windows with XAMPP

How to Install and Start Xampp

How to Set Up Laravel

How to Configure Your Hosts File

How to Configure Your Apache Root Directory

Conclusion

How to Install Apache Airflow on Windows without Docker

Prerequisites:

Requirements:

What is Windows Subsystem for Linux (WSL2)?

Step 1: Set Up the Virtual Environment

Step 2: Set Up the Airflow Directory

Step 3: Install Apache Airflow

Step 4: Create an Airflow User

Step 5: Run the Webserver

How to Create the first DAG

Wrapping Up

How to Create Better Policy with Open Policy Agent and the Apache APISIX OPA Plugin

What is OPA?

How OPA Works

Apache APISIX OPA Plugin

How It Works

How to Use the Plugin

How to use docker to build OPA services

How to create a policy

How to create users’ data

How to create a custom route and enable the plugin

How to test the requests

How to disable the plugin

Conclusion

How to Use Apache Airflow to Schedule and Manage Workflows

What are Directed Acyclic Graphs, or DAGs?

Visualizing DAGs

What are Operators?

How to Create Your First DAG

A Use-Case for DAGs

How to Install Cloud Composer

How to Create and Run the Pipeline on Composer

Conclusion

The Apache Cassandra Beginner Tutorial

Table of Contents

How to Set Up a Cassandra Cluster

Cassandra Architecture

Decentralization

Every Node Is a Coordinator

Data Partitioning

Replication

Consistency Level

Tune for Consistency by Setting up a Strong Consistency Application

Tune for Performance by Using Eventual Consistency

Optimize Data Storage for Reading or Writing

Understanding Compaction

Presorting Data on Cassandra Nodes

Data Modeling

Keep Data in Sync Using BATCH Statements

Keep Data in Sync Using `BATCH` Statements

`UPDATE`s Are Just `INSERT`s, and Vice Versa