<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ apache - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ apache - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 09:05:48 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/apache/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Launch an EC2 Instance and Set Up a Web Server Using HTTPD ]]>
                </title>
                <description>
                    <![CDATA[ Hey there! Have you ever thought about creating your own web server on the cloud? Well, you’re in for a treat because in this article, we’re going to explore how you can launch an EC2 instance and use HTTPD to host a simple web server. Don’t worry – ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-launch-an-ec2-instance-and-a-web-server-using-httpd/</link>
                <guid isPermaLink="false">672a1e5f52317a5d102c0dd9</guid>
                
                    <category>
                        <![CDATA[ ec2 ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kedar Makode ]]>
                </dc:creator>
                <pubDate>Tue, 05 Nov 2024 13:32:15 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730780706184/e2ac9a27-7221-47c6-a8ae-db2f62892036.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Hey there! Have you ever thought about creating your own web server on the cloud? Well, you’re in for a treat because in this article, we’re going to explore how you can launch an EC2 instance and use HTTPD to host a simple web server.</p>
<p>Don’t worry – it’s simpler than it sounds, and I promise to walk you through it step-by-step with a bit of fun along the way.</p>
<p>By the end of this guide, you’ll feel like a cloud wizard, casting spells that make servers appear out of thin air (well, out of Amazon’s data centers, but you get the point).</p>
<p>Ready? Let’s dive in!</p>
<h2 id="heading-table-of-content">Table Of Content</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-ec2">What Is EC2?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-httpd">What is HTTPD?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-how-to-launch-your-ec2-instance">Step 1: How to Launch Your EC2 Instance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-how-to-connect-to-your-ec2-instance">Step 2: How to Connect to Your EC2 Instance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-how-to-install-and-start-httpd-apache-web-server">Step 3: How to Install and Start HTTPD (Apache Web Server)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-how-to-host-your-custom-web-page">Step 4: How to Host Your Custom Web Page</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-what-is-ec2">What Is EC2?</h2>
<p>Think of EC2 (Elastic Compute Cloud) as a hotel room in the cloud. Instead of booking a physical server to store your website, you’re renting one from Amazon’s magical cloud infrastructure. This room (or instance) comes with all the amenities you need to host a website. Today, we’ll install <strong>HTTPD</strong> (a web server software) in our “room” to make our website live. 🏨✨</p>
<h2 id="heading-what-is-httpd">What is HTTPD?</h2>
<ul>
<li><p>At its core, HTTPD stands for Hypertext Transfer Protocol Daemon. Let’s break that down:</p>
</li>
<li><p><strong>Hypertext Transfer Protocol (HTTP)</strong>: This is the standard protocol used on the web. When you type a URL into your browser or click a link, you’re using HTTP to tell the server, “Hey, send me this web page!”</p>
</li>
<li><p><strong>Daemon (D)</strong>: A daemon is just a fancy term for a background process that runs continuously on a server. In this case, the daemon is responsible for responding to requests from web browsers (like Chrome or Firefox) and sending back the appropriate content.</p>
</li>
<li><p>So, <strong>HTTPD</strong> is a program that listens for incoming HTTP requests (like when you visit a webpage) and serves back the data (HTML, CSS, images, and so on) needed to display that page.</p>
</li>
</ul>
<h4 id="heading-httpd-vs-apache2-different-names-same-game">HTTPD vs. Apache2: Different Names, Same Game</h4>
<p>Depending on your Linux distribution, you may encounter different names for the same basic software:</p>
<ul>
<li><p>On RPM-based distributions (like Red Hat, CentOS, or Fedora), it’s called httpd.</p>
</li>
<li><p>On Debian-based distributions (like Ubuntu or Debian itself), it’s referred to as apache2.</p>
</li>
</ul>
<p>Let’s look at the steps you can use to launch your EC2 instance, and how to set up a web server using HTTPD.</p>
<h2 id="heading-step-1-how-to-launch-your-ec2-instance">Step 1: How to Launch Your EC2 Instance</h2>
<p>First things first, let’s launch our EC2 instance. You’ll need an AWS account—signing up is free, and AWS offers a free tier, so this won’t cost you a dime for small-scale experiments.</p>
<p>Head over to the AWS Management Console and log in. From the search bar, type “EC2” and click on <strong>EC2 Dashboard</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730267447129/5460a622-b2de-456a-9fae-b757caf37eef.png" alt="EC2 Dashboard" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Create a new instance by clicking on the orange <strong>Launch Instance</strong> button.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730345828735/7f2df691-278c-4945-97a6-44e173819eb0.png" alt="Create Instance on AWS" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Next, choose the Amazon Machine Image (AMI) by selecting the Amazon Linux AMI, which is free-tier eligible and super reliable. Don’t forget to give your instance a unique name!</p>
<p>Adding a "Name" tag with a value like "MyFirstInstance" or "ProductionServer" helps you keep track of multiple instances while adding a personal touch to your cloud workspace.</p>
<p>Also, remember to check the default username for the AMI you select. Since you’ve chosen Amazon Linux, the default username is <strong>ec2-user</strong>. Keep this in mind for connecting to your instance later!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730346031697/3c707686-c8f9-4cdf-aaec-c369722eaea0.png" alt="Amazon Machine Image (AMI) and Tags (Name)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730346255437/478efdf2-70b3-46e0-a0ca-131757929a69.png" alt="Amazon Machine Image (AMI) Default Username" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Select an Instance Type</strong>: The t2.micro is your best buddy here again, free-tier eligible and perfect for our needs.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730346372896/b902fd57-e7e3-4144-9186-832b590b3321.png" alt="Instance Type for EC2" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Key Pair for SSH Access</strong>: Here’s where it gets important to have a <strong>.pem</strong> file to securely connect to your instance. This file, also known as a key pair, acts like the secret key to your cloud “hotel room,” allowing you to log in via SSH.</p>
<p>If you already have a <strong>.pem</strong> file for a previously created key pair, go ahead and choose that from the dropdown menu.</p>
<p>If you don’t have a <strong>.pem</strong> file, no worries! Create a new key pair by clicking <strong>Create New Key Pair</strong>, and download the <strong>.pem</strong> file to your computer. Make sure to store this file safely—you’ll need it to log in, and if you lose it, you won’t be able to access your EC2 instance!</p>
<p>Why is this file important? The <strong>.pem</strong> file is your private key, and AWS uses it to verify that you are the rightful owner trying to connect to the instance. You won’t get access without it, just like how you can’t get into a hotel room without the key.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730346428068/e8d1c913-af2f-40ad-8a80-b2b31af934f4.png" alt="Key Pair for AWS EC2" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Configure Security Group</strong>: AWS EC2 security groups are like virtual firewalls that control traffic in and out of your instance, ensuring only specific types of access. To allow web visitors, set up an HTTP rule on port 80, and for secure server logins, enable SSH on port 22 with restricted IPs.</p>
<p>You can reuse security groups across instances, making configuration easier and more consistent. Regularly review these settings to keep your instance secure and organized.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730346477838/1b296a9d-ab53-48f6-a92b-07057332eaed.png" alt="Security Group for AWS EC2" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Launch the instance</strong>: Boom! You’ve just launched your very own server in the cloud.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730346693723/9aa28c70-8732-4071-ae03-12d983c6cb15.png" alt="Launch AWS EC2 Instance" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Wait a minute or two for your instance to come online. Now that we have our EC2 instance running, let’s move to the next step of `setting up our web server.</p>
<h3 id="heading-step-2-how-to-connect-to-your-ec2-instance">Step 2: How to Connect to Your EC2 Instance</h3>
<p>To connect, we’ll use the <strong>.pem</strong> file (key pair) we created earlier. If you’re on a Mac or Linux machine, this is super simple with SSH. For Windows folks, I recommend using <strong>MobaXterm</strong>—it’s a user-friendly terminal with SSH built-in.</p>
<p>If you’re new to connecting EC2 instances using MobaXterm, I’ve written a detailed guide in my previous blog post. You can check it out <a target="_blank" href="https://www.freecodecamp.org/news/connect-to-your-ec2-instance-using-mobaxterm/">here</a>, where I show how to set up and connect to an EC2 instance using MobaXterm.</p>
<p>For now, here’s a quick overview of the connection process using SSH:</p>
<pre><code class="lang-bash">ssh -i <span class="hljs-string">"your-key.pem"</span> ec2-user@your-ec2-public-ip
</code></pre>
<p>Replace <code>"your-key.pem"</code> with the name of your key pair and <code>"your-ec2-public-ip"</code> with the public IP of your instance (you can find this in the EC2 dashboard).</p>
<p>If you’ve connected successfully, congratulations! 🎉 You’re inside your cloud server.</p>
<h3 id="heading-step-3-how-to-install-and-start-httpd-apache-web-server">Step 3: How to Install and Start HTTPD (Apache Web Server)</h3>
<p>Alright, time to install our web server software (HTTPD)! We’ll be using Apache, one of the most popular web servers around. Don’t worry, you don’t need a degree in IT to get this working.</p>
<p>After you successfully connect to your EC2 instance from MobaXterm, you should be all set to start the installation. You’re just a few commands away from having your web server up and running!</p>
<p>It’s always good practice to make sure your server is up to date. To update your server, run:</p>
<pre><code class="lang-bash">sudo dnf update -y
</code></pre>
<p>Next, we’ll install HTTPD (Apache):</p>
<pre><code class="lang-bash">sudo dnf install httpd -y
</code></pre>
<p>Then start the HTTPD service. Run this command to get the server running.</p>
<pre><code class="lang-bash">sudo systemctl start httpd
</code></pre>
<p>Next, enable it to start on boot so that every time your EC2 instance reboots, your web server comes back to life automatically.</p>
<pre><code class="lang-bash">sudo systemctl <span class="hljs-built_in">enable</span> httpd
</code></pre>
<p>Time to test it out! Open your browser and type in your instance’s public IP. If you see the Apache test page, give yourself a high-five. 🖐️ You’ve just launched a web server!</p>
<h3 id="heading-step-4-how-to-host-your-custom-web-page">Step 4: How to Host Your Custom Web Page</h3>
<p>Now, let’s get creative! Instead of the default web server message, let’s host your very own custom web page in just one step. This will allow you to display a unique message on your site in no time.</p>
<p>Run the following command in your EC2 instance to create and display a simple, personalized web page:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-string">"Welcome to the Cloud! You’re now hosting your own custom web server 
using AWS EC2 and Apache!"</span> &gt; /var/www/html/index.html
</code></pre>
<p><strong>What does this command do?</strong></p>
<ul>
<li><p>The <code>echo</code> command outputs the text: <code>"Welcome to the Cloud! You’re now hosting your own custom web server using AWS EC2 and Apache!"</code>.</p>
</li>
<li><p>The <code>&gt;</code> symbol redirects this output to a file.</p>
</li>
<li><p><code>/var/www/html/index.html</code> is the path to the file where the message is saved. This file is the homepage of your web server.</p>
</li>
</ul>
<p>By running this command, you're replacing the default Apache test page with your custom message.</p>
<p>Now, select your EC2 instance, and you’ll find its public IP address. Open your browser, enter that IP, refresh the page, and boom! Your custom message is live on the site. 🎉</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730347026257/8ae32095-27f2-401a-a812-12b1354c3a93.png" alt="EC2 Instance Public IP Address" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Feel free to modify the text to make it uniquely yours!</p>
<h3 id="heading-wrapping-up">Wrapping Up</h3>
<p>And there you have it – you’ve just launched an EC2 instance and set up a simple web server using HTTPD! With these steps, you’ve not only spun up a server in the cloud but also configured it to be accessible to the world. By following along, you’ve learned the essentials of creating instances, setting up security groups, connecting via SSH, and installing Apache to serve up web content.</p>
<p>Keep exploring EC2’s features, and don’t hesitate to test new configurations and ideas. Each step adds to your cloud skills, bringing you one step closer to mastering AWS. So keep building, experimenting, and, most importantly, enjoying the journey. Happy cloud computing!</p>
<p>You can follow me on</p>
<ul>
<li><p><a target="_blank" href="https://twitter.com/Kedar__98">Twitter</a></p>
</li>
<li><p><a target="_blank" href="https://www.linkedin.com/in/kedar-makode-9833321ab/?originalSubdomain=in">LinkedIn</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Orchestrate an ETL Data Pipeline with Apache Airflow ]]>
                </title>
                <description>
                    <![CDATA[ By Aviator Ifeanyichukwu Data Orchestration involves using different tools and technologies together to extract, transform, and load (ETL) data from multiple sources into a central repository.  Data orchestration typically involves a combination of t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/orchestrate-an-etl-data-pipeline-with-apache-airflow/</link>
                <guid isPermaLink="false">66d45dd7052ad259f07e4a7d</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 01 Mar 2023 22:42:42 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/02/etl_image.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aviator Ifeanyichukwu</p>
<p>Data Orchestration involves using different tools and technologies together to extract, transform, and load (ETL) data from multiple sources into a central repository. </p>
<p>Data orchestration typically involves a combination of technologies such as data integration tools and data warehouses.</p>
<p>Apache Airflow is a tool for data orchestration.</p>
<p>With Airflow, data teams can schedule, monitor, and manage the entire data workflow. Airflow makes it easier for organizations to manage their data, automate their workflows, and gain valuable insights from their data</p>
<p>In this guide, you will be writing an ETL data pipeline. It will download data from Twitter, transform the data into a CSV file, and load the data into a Postgres database, which will serve as a data warehouse.  </p>
<p>External users or applications will be able to connect to the database to build visualizations and make policy decisions.</p>
<h3 id="heading-what-you-will-learn">What you will learn</h3>
<ol>
<li>How to extract data from Twitter</li>
<li>How to write a DAG script</li>
<li>How to load data into a database</li>
<li>How to use Airflow Operators</li>
</ol>
<h3 id="heading-what-you-need">What you need</h3>
<p>To follow along with this tutorial, you'll need the following:</p>
<ul>
<li>Apache Airflow installed on your machine</li>
<li>Airflow development environment up and running</li>
<li>An understanding of the building blocks of Apache Airflow (Tasks, Operators, etc)</li>
<li>An IDE of your choice. Mine is VsCode.</li>
</ul>
<p>Sounds interesting yeah? Let’s begin.</p>
<h2 id="heading-how-to-get-the-data-from-twitter">How to Get the Data from Twitter</h2>
<p>Twitter is a social media platform where users gather to share information and discuss trending world events/topics. Tons of data is generated daily through this platform. This will be your data source.</p>
<p>To get data from Twitter, you need to connect to its API. Numerous libraries make it easy to connect to the Twitter API. For this guide, we'll use snscrape. You will also need Pandas, a Python library for data exploration and transformation.</p>
<h3 id="heading-installation">Installation</h3>
<p>Make sure your Airflow virtual environment is currently active.</p>
<pre><code class="lang-python">pip install snscrape pandas
</code></pre>
<p>Inside the Airflow dags folder, create two files: extract.py and transform.py.</p>
<p>extract.py:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> snscrape.modules.twitter <span class="hljs-keyword">as</span> sntwitter
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> transform <span class="hljs-keyword">import</span> transform_data


<span class="hljs-comment"># Creating list to append tweet data to</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_data</span>():</span>

    <span class="hljs-comment"># scrape tweets and append to a list</span>
  <span class="hljs-keyword">for</span> i,tweet <span class="hljs-keyword">in</span> enumerate(sntwitter.TwitterSearchScraper(<span class="hljs-string">'Chatham House since:2023-01-14'</span>).get_items()):
    <span class="hljs-keyword">if</span> i&gt;<span class="hljs-number">1000</span>:
      <span class="hljs-keyword">break</span>
    tweets_list.append([tweet.date, tweet.user.username, tweet.rawContent, 
                          tweet.sourceLabel,tweet.user.location
                          ])

      <span class="hljs-comment"># convert tweets into a dataframe</span>
  tweets_df = pd.DataFrame(tweets_list, columns=[<span class="hljs-string">'datetime'</span>, <span class="hljs-string">'username'</span>, <span class="hljs-string">'text'</span>, <span class="hljs-string">'source'</span>, <span class="hljs-string">'location'</span>])

      <span class="hljs-comment"># save tweets as csv file</span>

  transform_data(tweets_df)
</code></pre>
<p>transform.py:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> airflow.hooks.postgres_hook <span class="hljs-keyword">import</span> PostgresHook

<span class="hljs-comment"># Load clean data into postgres database</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">task_data_upload</span>(<span class="hljs-params">data</span>):</span>
  print(data.head() )

  data = data.to_csv(index=<span class="hljs-literal">None</span>, header=<span class="hljs-literal">None</span>)

  postgres_sql_upload = PostgresHook(postgres_conn_id=<span class="hljs-string">"postgres_connection"</span>)
  postgres_sql_upload.bulk_load(<span class="hljs-string">'twitter_etl_table'</span>, data)

  <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>

<span class="hljs-comment">## perform data cleaning and transformation</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transform_data</span>(<span class="hljs-params">tweets_df</span>):</span>
  print(tweets_df.info() )
    <span class="hljs-comment">### Transformation happens here    </span>

  <span class="hljs-comment"># load transformed data into database</span>
  task_data_upload(tweets_df)
</code></pre>
<p>### </p>
<h3 id="heading-the-database">The Database</h3>
<p>Airflow comes with a SQLite3 database. To store your data, you'll use PostgreSQL as a database.</p>
<p>You should have PostgreSQL installed and running on your machine.</p>
<h3 id="heading-install-the-libraries">Install the libraries</h3>
<pre><code class="lang-python">pip install psycopg2
</code></pre>
<p>If this fails, try installing the binary version like this:</p>
<pre><code class="lang-python">pip install psycopg2-binary
</code></pre>
<p>Install the provider package for the Postgres database like this:</p>
<pre><code class="lang-python">pip install apache-airflow-providers-postgres
</code></pre>
<h2 id="heading-how-to-set-up-the-dag-script">How to Set Up the DAG Script</h2>
<p>Create a file named etl_pipeline.py inside the dags folder.</p>
<p>Start by importing the different airflow operators like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.operators.empty <span class="hljs-keyword">import</span> EmptyOperator
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta

<span class="hljs-keyword">with</span> DAG(
  <span class="hljs-string">'etl_twitter_pipeline'</span>,
  description=<span class="hljs-string">"A simple twitter ETL pipeline using Python,PostgreSQL and Apache Airflow"</span>,
  start_date=datetime(year=<span class="hljs-number">2023</span>, month=<span class="hljs-number">2</span>, day=<span class="hljs-number">5</span>),
  schedule_interval=timedelta(minutes=<span class="hljs-number">2</span>)
) <span class="hljs-keyword">as</span> dag:

  start_pipeline = EmptyOperator(
    task_id=<span class="hljs-string">'start_pipeline'</span>,
  )

start_pipeline
</code></pre>
<p>With a dag_id named 'etl_twitter_pipeline', this dag is scheduled to run every two minutes, as defined by the schedule interval.</p>
<h3 id="heading-how-to-view-the-web-ui">How to View the Web UI</h3>
<p>Start the scheduler with this command:</p>
<pre><code class="lang-python">airflow scheduler
</code></pre>
<p>Then start the web server with this command: </p>
<pre><code class="lang-python">airflow webserver
</code></pre>
<p>Open the browser on localhost:8080 to view the UI.</p>
<p>Search for a dag named ‘etl_twitter_pipeline’, and click on the toggle icon on the left to start the dag.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/airflow_ui_1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Airflow UI showing created dags</em></p>
<h2 id="heading-how-to-set-up-a-postgres-database-connection">How to Set Up a Postgres Database Connection</h2>
<p>You should already have apache-airflow-providers-postgres and psycopg2 or psycopg2-binary installed in your virtual environment.</p>
<p>From the UI, navigate to <em>Admin</em> -&gt; <em>Connections</em>. Click on the plus sign at the top left corner of your screen to add a new connection and specify the connection parameters. Click on test to verify the connection to the database server. Once completed, scroll to the bottom of the screen and click on <em>Save</em>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/postgres_connect-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>PostgreSQL database connection</em></p>
<p>Inside the Airflow directory created in the virtual environment, open the airflow.cfg file in your text editor, locate the variable named sql_alchemy_conn, and set the PostgreSQL connection string:</p>
<pre><code class="lang-python">sql_alchemy_conn = postgresql+psycopg2://postgres:<span class="hljs-number">1234</span>@localhost:<span class="hljs-number">5432</span>/test
</code></pre>
<p>The Airflow executor is currently set to SequentialExecutor. Change this to LocalExecutor:</p>
<pre><code class="lang-python">executor = LocalExecutor
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/executor.png" alt="Image" width="600" height="400" loading="lazy">
<em>Airflow DAG Executor</em></p>
<p>The Airflow UI is currently cluttered with samples of example dags. In the airflow.cfg config file, find the load_examples variable, and set it to False.</p>
<pre><code class="lang-python">load_examples = <span class="hljs-literal">False</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/load_eg.png" alt="Image" width="600" height="400" loading="lazy">
<em>Disable example dags</em></p>
<p>Restart the webserver, reload the web UI, and you should now have a clean UI:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/clean_dag.png" alt="Image" width="600" height="400" loading="lazy">
<em>Airflow UI</em></p>
<h2 id="heading-how-to-use-the-postgres-operator">How to Use the Postgres Operator</h2>
<p>Start by importing the different Airflow operators. You'll also need to import the extract and transform Python files.</p>
<p>etl_pipeline.py</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.operators.python <span class="hljs-keyword">import</span> PythonOperator
<span class="hljs-keyword">from</span> airflow.operators.empty <span class="hljs-keyword">import</span> EmptyOperator
<span class="hljs-keyword">from</span> airflow.operators.postgres_operator <span class="hljs-keyword">import</span> PostgresOperator

<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta

<span class="hljs-keyword">from</span> extract <span class="hljs-keyword">import</span> extract_data



<span class="hljs-keyword">with</span> DAG(
  <span class="hljs-string">'etl_twitter_pipeline'</span>,
  description=<span class="hljs-string">"A simple twitter ETL pipeline using Python,PostgreSQL and Apache Airflow"</span>,
  start_date=datetime(year=<span class="hljs-number">2023</span>, month=<span class="hljs-number">2</span>, day=<span class="hljs-number">5</span>),
  schedule_interval=timedelta(minutes=<span class="hljs-number">5</span>)
) <span class="hljs-keyword">as</span> dag:

  start_pipeline = EmptyOperator(
        task_id=<span class="hljs-string">'start_pipeline'</span>,
    )

  create_table = PostgresOperator(
    task_id=<span class="hljs-string">'create_table'</span>,
    postgres_conn_id=<span class="hljs-string">'postgres_connection'</span>,
    sql=<span class="hljs-string">'sql/create_table.sql'</span>
  )


  etl = PythonOperator(
    task_id = <span class="hljs-string">'extract_data'</span>,
    python_callable = extract_data
  )


  clean_table = PostgresOperator(
      task_id=<span class="hljs-string">'clean_sql_table'</span>,
      postgres_conn_id=<span class="hljs-string">'postgres_connection'</span>,
      sql=[<span class="hljs-string">"""TRUNCATE TABLE twitter_etl_table"""</span>]
  )

  end_pipeline = EmptyOperator(
      task_id=<span class="hljs-string">'end_pipeline'</span>,
  )
</code></pre>
<p>sql/create_table.sql</p>
<pre><code class="lang-sql">sql="""<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> twitter_etl_table(
      <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
      datetime <span class="hljs-built_in">DATE</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
      username <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">200</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
      <span class="hljs-built_in">text</span> <span class="hljs-built_in">TEXT</span>,
      <span class="hljs-keyword">source</span> <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">200</span>),
      location <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">200</span>)
    );"""
</code></pre>
<p>The create_table task makes a connection to postgres to create a table.</p>
<p>The ETL task makes a call to the extract_data() function which is where our ETL data processing takes place.</p>
<p>The clean_table task invokes the postgresOperator which truncates the table of previous contents before new contents in inserted into the postgres table.</p>
<p>The end_pipeline marks the end of the task definition.</p>
<h3 id="heading-how-to-create-dependencies-between-tasks">How to Create Dependencies Between Tasks</h3>
<p>The last step is to create a dependencies between tasks, to enable Airflow to know the order of priority to schedule tasks.</p>
<pre><code class="lang-python">start_pipeline &gt;&gt; create_table &gt;&gt; clean_table &gt;&gt; etl &gt;&gt; end_pipeline
</code></pre>
<h2 id="heading-how-to-test-the-workflow">How to Test the Workflow</h2>
<p>To start, click on the 'etl_twitter_pipeline' dag. Click on the graph view option, and you can now see the flow of your ETL pipeline and the dependencies between tasks.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/Screenshot-2023-02-27-at-17-04-55-etl_twitter_pipeline---Graph---Airflow.png" alt="Image" width="600" height="400" loading="lazy">
<em>Airflow running data pipeline</em></p>
<p>And there you have it – your ETL data pipeline in Airflow. I hope you found it useful and yours is working properly.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Apache Airflow is an easy-to-use orchestration tool making it easy to schedule and monitor data pipelines. With your knowledge of Python, you can write DAG scripts to schedule and monitor your data pipeline.</p>
<p>In this guide, you learned how to set up an ETL pipeline using Airflow and also how to schedule and monitor the pipeline.</p>
<p>You also have seen the usage of some Airflow operators such as PythonOperator, PostgresOperator, and EmptyOperator.</p>
<p>I hope you learned something from this guide.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Configure a Laravel Project with a Custom Domain Name on Windows with XAMPP ]]>
                </title>
                <description>
                    <![CDATA[ By Abdulwahab Ashimi Laravel's simplicity and MVC architecture make it an ideal PHP framework for building web applications.  In this article, I will show you how to set up Laravel on your Windows machine and configure it to run on a custom domain na... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/configure-a-laravel-project-with-custom-domain-name/</link>
                <guid isPermaLink="false">66d45d974bc8f441cb6df807</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Laravel ]]>
                    </category>
                
                    <category>
                        <![CDATA[ PHP ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 14 Feb 2023 02:58:05 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/02/cover--2-.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Abdulwahab Ashimi</p>
<p>Laravel's simplicity and MVC architecture make it an ideal PHP framework for building web applications. </p>
<p>In this article, I will show you how to set up Laravel on your Windows machine and configure it to run on a custom domain name.</p>
<p>This guide is best suited for a beginner trying to get Laravel up and running quickly and easily. But even as an advanced programmer, you'll likely find fresh insights into how you can simplify the process of configuring a Laravel project. So let's dive in!</p>
<h2 id="heading-how-to-install-and-start-xampp">How to Install and Start Xampp</h2>
<p>Xampp is an open-source tool that allows you to run an Apache server, MySQL database, and other tools from a single interface for development. </p>
<p>You can download and install Xampp from here: <a target="_blank" href="https://www.apachefriends.org/download.html">https://www.apachefriends.org/download.html</a>.</p>
<p>First, launch your Xampp Interface and start your Apache and MySQL Server.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/XAMPP-Control-Panel-v3.3.0-----Compiled_-Apr-6th-2021---2_8_2023-12_12_47-PM.png" alt="Image" width="600" height="400" loading="lazy">
<em>The Xampp Interface</em></p>
<p>Next, click on <code>Explorer</code> to launch your Xampp <code>htdocs</code> folder. Delete the files and folders inside the folder. Now you can setup your Laravel application.</p>
<h2 id="heading-how-to-set-up-laravel">How to Set Up Laravel</h2>
<p>Inside the <code>htdocs</code> folder, you can clone your existing Laravel application or set up a fresh installation using <code>composer create-project laravel/laravel example-app</code>. In this case, "example-app" is your project name but you can replace it with your preferred name for the project.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/htdocs-2_8_2023-12_25_22-PM.png" alt="Image" width="600" height="400" loading="lazy">
<em>The Laravel Directory Structure on htdocs</em></p>
<p>Open the htdocs folder in your preferred code editor. I will be using VScode for my example.</p>
<p>Replace the <code>APP_URL</code> value in the <code>.env</code> file of your Laravel project with the custom domain name:</p>
<pre><code class="lang-env">APP_URL=https://project.test
</code></pre>
<p>You can replace "project.test" with your prefered test domain name.</p>
<h2 id="heading-how-to-configure-your-hosts-file">How to Configure Your Hosts File</h2>
<p>In your Windows file explorer, navigate to the "hosts" file located at <code>C:\Windows\System32\drivers\etc\hosts</code> and open it with VSCode (or whatever editor you're using). I'd advise that you use VSCode with admin privileges.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/1.-Host-File.png" alt="Image" width="600" height="400" loading="lazy">
<em>etc directory containing the hosts file and other files</em></p>
<p>Add the following line to the file:</p>
<pre><code><span class="hljs-number">127.0</span><span class="hljs-number">.0</span><span class="hljs-number">.1</span> project.test
</code></pre><p>This will map the hostname "project.test" to the local IP address "127.0.0.1".</p>
<p>Now, if you launch your Apache server and visit project.test on your browser, it loads the "index of" project.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1675857587334/7a426d88-8963-4c75-a347-49054a33b8da.png" alt="Image" width="1600" height="644" loading="lazy">
<em>Index of' The Laravel Directory on Browser</em></p>
<p>This is because for your Laravel application to work, it needs to load the public folder. If you can load public.test/public on your browser, you will be redirected to the Laravel project. To fix that, you can configure the Apache root directory.</p>
<h2 id="heading-how-to-configure-your-apache-root-directory">How to Configure Your Apache Root Directory</h2>
<p>In your Windows file explorer, navigate to and open the "httpd.conf" file which contains the Apache configuration. It's located at <code>C:\xampp\apache\conf\httpd.conf</code> . You should also use VSCode with admin privileges in this case.</p>
<p>Right below <code># Virtual hosts</code>, add the following:</p>
<pre><code class="lang-conf">&lt;VirtualHost *:80&gt;
    ServerName project.test
    DocumentRoot "C:/xampp/htdocs/project/public"
    &lt;Directory "C:/xampp/htdocs/project/public"&gt;
        Options Indexes FollowSymLinks Includes ExecCGI
        AllowOverride All
        Require all granted
    &lt;/Directory&gt;
&lt;/VirtualHost&gt;
</code></pre>
<p>Note: Replace <code>project.test</code> with your custom domain name and <code>C:/xampp/htdocs/project/public</code> with the path to your public folder.</p>
<p>Stop and restart the Apache server from your Xampp interface and try visiting "<a target="_blank" href="http://decmark.test"><strong>http://project.test</strong></a>" on your browser to see the Laravel project's homepage.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You can have multiple projects with their own custom domains by setting them up in different directories inside the htdocs directory and specifying their individual Apache configurations.</p>
<p>If this article was helpful to you. Share it with friends or drop me a shout out on Twitter <a target="_blank" href="https://twitter.com/Adebowale1st">@adebowale1st</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Install Apache Airflow on Windows without Docker ]]>
                </title>
                <description>
                    <![CDATA[ By Aviator Ifeanyichukwu Apache Airflow is a tool that helps you manage and schedule data pipelines. According to the documentation, it lets you "programmatically author, schedule, and monitor workflows."  Airflow is a crucial tool for data engineers... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/install-apache-airflow-on-windows-without-docker/</link>
                <guid isPermaLink="false">66d45dd5a326133d124409d5</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 02 Feb 2023 00:18:32 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/02/Airflow_Install.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aviator Ifeanyichukwu</p>
<p>Apache Airflow is a tool that helps you manage and schedule data pipelines. According to the <a target="_blank" href="https://airflow.apache.org/">documentation</a>, it lets you "programmatically author, schedule, and monitor workflows." </p>
<p>Airflow is a crucial tool for data engineers and scientists. In this article, I'll show you how to install it on Windows without Docker.</p>
<p>Although it's recommended to run Airflow with Docker, this method works for low-memory machines that are unable to run Docker. </p>
<h3 id="heading-prerequisites">Prerequisites:</h3>
<p>This article assumes that you're familiar with using the command line and can set up your development environment as directed.</p>
<h3 id="heading-requirements">Requirements:</h3>
<p>You need Python 3.8 or higher, Windows 10 or higher, and the Windows Subsystem for Linux (WSL2) to follow this tutorial.</p>
<h3 id="heading-what-is-windows-subsystem-for-linux-wsl2">What is Windows Subsystem for Linux (WSL2)?</h3>
<p>WSL2 allows you to run Linux commands and programs on a Windows operating system. </p>
<p>It provides a Linux-compatible environment that runs natively on Windows, enabling users to use Linux command-line tools and utilities on a Windows machine.</p>
<p>You can read more <a target="_blank" href="https://www.freecodecamp.org/news/how-to-install-wsl2-windows-subsystem-for-linux-2-on-windows-10/">here to install WSL2</a> on your machine.</p>
<p>With Python and WSL2 installed and activated on your machine, launch the terminal by searching for Ubuntu from the start menu.</p>
<h2 id="heading-step-1-set-up-the-virtual-environment">Step 1: Set Up the Virtual Environment</h2>
<p>To work with Airflow on Windows, you need to set up a virtual environment. To do this, you'll need to install the virtualenv package. </p>
<p>Note: Make sure you are at the root of the terminal by typing:</p>
<pre><code>cd ~
</code></pre><pre><code>pip install virtualenv
</code></pre><p>Create the virtual environment like this:</p>
<pre><code>virtualenv airflow_env
</code></pre><p>And then activate the environment:</p>
<pre><code> source airflow_env/bin/activate
</code></pre><h2 id="heading-step-2-set-up-the-airflow-directory">Step 2: Set Up the Airflow Directory</h2>
<p>Create a folder named airflow. Mine will be located at c/Users/[Username]. You can put yours wherever you prefer.</p>
<p>If you do not know how to navigate the terminal, you can follow the steps in the image below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/set-virtual_env-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Create an Airflow directory from the terminal</em></p>
<p>Now that you have created this folder, you have to set it as an environment variable. Open a .bashrc script from the terminal with the command:</p>
<pre><code>nano ~/.bashrc
</code></pre><p>Then write the following:</p>
<pre><code>AIRFLOW_HOME=<span class="hljs-regexp">/c/</span>Users/[YourUsername]/airflow
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2023/01/airflow_env_variable.png" alt="Image" width="600" height="400" loading="lazy">
<em>Setup Airflow directory path as an environment variable</em></p>
<p>Press ctrl s and ctrl x to exit the nano editor.</p>
<p>This part of the Airflow directory will be permanently saved as an environment variable. Anytime you open a new terminal, you can recover the value of the variable by typing:</p>
<pre><code>cd $AIRFLOW_HOME
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2023/01/airflow_home-2.png" alt="Image" width="600" height="400" loading="lazy">
<em>Navigate to Airflow directory using the environment variable</em></p>
<h2 id="heading-step-3-install-apache-airflow">Step 3: Install Apache Airflow</h2>
<p>With the virtual environment still active and the current directory pointing to the created Airflow folder, install Apache Airflow:</p>
<pre><code> pip install apache-airflow
</code></pre><p>Initialize the database: </p>
<pre><code>airflow db init
</code></pre><p>Create a folder named dags inside the airflow folder. This will be used to store all Airflow scripts.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/airflow_db_init-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>View files and folders generated by Airflow db init</em></p>
<h2 id="heading-step-4-create-an-airflow-user">Step 4: Create an Airflow User</h2>
<p>When airflow is newly installed, you'll need to create a user. This user will be used to login into the Airflow UI and perform some admin functions.</p>
<pre><code>airflow users create --username admin –password admin –firstname admin –lastname admin –role Admin –email youremail@email.com
</code></pre><p>Check the created user: </p>
<pre><code>airflow users list
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2023/01/create-users.png" alt="Image" width="600" height="400" loading="lazy">
<em>Create an Airflow user and list the created user</em></p>
<h2 id="heading-step-5-run-the-webserver">Step 5: Run the Webserver</h2>
<p>Run the scheduler with this command:</p>
<pre><code>airflow scheduler
</code></pre><p>Launch another terminal, activate the airflow virtual environment, cd to $AIRFLOW_HOME, and run the webserver:</p>
<pre><code>airflow webserver
</code></pre><p>If the default port 8080 is in use, change the port by typing: </p>
<pre><code>airflow webserver –port &lt;port number&gt;
</code></pre><p>Log in to the UI using the username created earlier with "airflow users create".</p>
<p>In the UI, you can view pre-created DAGs that come with Airflow by default.</p>
<h2 id="heading-how-to-create-the-first-dag">How to Create the first DAG</h2>
<p>A DAG is a Python script for organizing and managing tasks in a workflow.</p>
<p>To create a DAG, navigate into the dags folder created inside the $AIRFLOW_HOME directory. Create a file named "hello_world_dag.py". Use VS Code if it's available.</p>
<p>Enter the code from the image below, and save it:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/01/first_dag.png" alt="Image" width="600" height="400" loading="lazy">
<em>Example DAG script in VS Code editor</em></p>
<p>Go to the Airflow UI and search for hello_world_dag. If it does not show up, try refreshing your browser.</p>
<p>That's it. This completes the installation of Apache Airflow on Windows.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>This guide covered how to install Apache Airflow on a Windows machine without Docker and how to write a DAG script. </p>
<p>I do hope the steps outlined above helped you install airflow on your Windows machine without Docker.</p>
<p>In subsequent articles, you will learn about Apache Airflow concepts and components. </p>
<p>Follow me on <a target="_blank" href="http://twitter.com/aviatorIfeanyi">Twitter</a> or <a target="_blank" href="https://www.linkedin.com/in/aviatorifeanyi/">LinkedIn</a> for more Analytics Engineering content.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create Better Policy with Open Policy Agent and the Apache APISIX OPA Plugin ]]>
                </title>
                <description>
                    <![CDATA[ By Njoku Samson Ebere One common thing in every organisation is policy. Policies define how an organisation operates.  They are essential to the long-term success of an organisation. They preserve significant knowledge about how to comply with matter... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-open-policy-agent-and-apache-apisix-opa-plugin/</link>
                <guid isPermaLink="false">66d84faff6f7ca5a604624fa</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Back end development  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ backend ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Policy ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 24 Jan 2023 19:28:56 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/01/pexels-pixabay-357514.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Njoku Samson Ebere</p>
<p>One common thing in every organisation is <a target="_blank" href="https://www.openpolicyagent.org/docs/latest/philosophy/#policy">policy</a>. Policies define how an organisation operates. </p>
<p>They are essential to the long-term success of an organisation. They preserve significant knowledge about how to comply with matters such as legal requirements, work within technical constraints, and avoid repeating mistakes.   </p>
<p>Softwares follow the same pattern by adhering to rules that govern its behavior. These rules (or policies) may specify the application's environments, permitted network routes, dependencies versions allowed, and when micro-services receive API requests. Usually, developers create them manually using documents like spreadsheets.   </p>
<p>The issue with this method is that it gradually becomes bulky. If each part of an application has its policy, things like authorization will be hard to manage across the whole application. There might also be the unnecessary repetition of policies across different parts of the application. </p>
<p>Aside from that, updating any policy will require the redeployment of the whole application. Fortunately, <strong>Open Policy Agent</strong>(OPA) found a way to fix these issues.  </p>
<p>This article will explain what OPA is, how it works, what the OPA plugin entails, and how to use it.   </p>
<p>Let’s get started!</p>
<h2 id="heading-what-is-opa">What is OPA?</h2>
<p><a target="_blank" href="https://www.openpolicyagent.org/docs/latest/">OPA</a> is an open-source general-purpose policy engine. It can replace built-in policy function modules in software and help users decouple services from the policy engine.</p>
<p>OPA provides a way to build applications separate from their policies and for them to be reusable in many applications.   </p>
<p>The OPA policy handling method reduces complexities and gives more control to the application owner. OPA allows users to integrate it with other services, such as program libraries, and HTTP APIs.</p>
<h2 id="heading-how-opa-works">How OPA Works</h2>
<p>OPA mediates between applications and policies to decide the rule to apply in handling a request. The image below describes its operation:</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_EFDBAAA4A6A8765E2C2CBACA1FE670A8A1A3C4F3B2852B5E7907B18C06560424_1662070285391_opa-service.svg" alt="Image" width="495.7270341207349" height="372.7821522309711" loading="lazy"></p>
<p>Here is a breakdown of the image above:</p>
<ol>
<li>A service (let’s say it is an authentication micro-service) receives a request (like a login request). For the service to decide how to handle the request, it needs to get the policy guiding authentication. That takes us to the next step.</li>
<li>The service sends a query (this can be in any JSON format) to OPA requesting for the policy to be adhered to in handling the request received.</li>
<li>OPA now compares the data and policies it has access to and makes the right decision.</li>
<li>Finally, OPA returns the policy decision (this can be in any JSON format) reached to the service.</li>
</ol>
<p>That is a summary of how OPA works. You can imagine many services attached to OPA and OPA helping them decide how to handle requests or events instead of each service managing its policies. It provides a more robust system that is easy to maintain. </p>
<p><a target="_blank" href="https://dev.to/ebereplenty/introduction-to-apache-apisix-5b4">Apache APISIX</a> decided to integrate with OPA by providing the OPA plugin. That's what we'll discuss now.</p>
<h2 id="heading-apache-apisix-opa-plugin">Apache APISIX OPA Plugin</h2>
<p>The plugin allows <a target="_blank" href="https://apisix.apache.org/">Apache APISIX</a> users to conveniently introduce the policy capabilities provided by OPA when using Apache APISIX. It enables flexible authentication and access control features.</p>
<h3 id="heading-how-it-works">How It Works</h3>
<p>Apache APISIX OPA Plugin follows two main steps to carry out its task:</p>
<p>First, APISIX re-constructs any request data it receives into acceptable JSON data and makes a policy query to OPA with it. The query is usually referred to as an <strong>APISIX to OPA service</strong> request. See the following example:</p>
<pre><code>
{
    <span class="hljs-string">"type"</span>: <span class="hljs-string">"http"</span>,
    <span class="hljs-string">"request"</span>: {
        <span class="hljs-string">"scheme"</span>: <span class="hljs-string">"http"</span>,
        <span class="hljs-string">"path"</span>: <span class="hljs-string">"\/get"</span>,
        <span class="hljs-string">"headers"</span>: {
            <span class="hljs-string">"user-agent"</span>: <span class="hljs-string">"curl\/7.68.0"</span>,
            <span class="hljs-string">"accept"</span>: <span class="hljs-string">"*\/*"</span>,
            <span class="hljs-string">"host"</span>: <span class="hljs-string">"127.0.0.1:9080"</span>
        },
        <span class="hljs-string">"query"</span>: {},
        <span class="hljs-string">"port"</span>: <span class="hljs-number">9080</span>,
        <span class="hljs-string">"method"</span>: <span class="hljs-string">"GET"</span>,
        <span class="hljs-string">"host"</span>: <span class="hljs-string">"127.0.0.1"</span>
    },
    <span class="hljs-string">"var"</span>: {
        <span class="hljs-string">"timestamp"</span>: <span class="hljs-number">1701234567</span>,
        <span class="hljs-string">"server_addr"</span>: <span class="hljs-string">"127.0.0.1"</span>,
        <span class="hljs-string">"server_port"</span>: <span class="hljs-string">"9080"</span>,
        <span class="hljs-string">"remote_port"</span>: <span class="hljs-string">"port"</span>,
        <span class="hljs-string">"remote_addr"</span>: <span class="hljs-string">"ip address"</span>
    },
    <span class="hljs-string">"route"</span>: {},
    <span class="hljs-string">"service"</span>: {},
    <span class="hljs-string">"consumer"</span>: {}
}
</code></pre><p>The JSON data above tells OPA that a user has made an HTTP request using the GET method via <code>127.0.0.1:9080/get</code> at <code>1701234567</code> timestamp (Wednesday, 29 November 2023 05:09:27).  </p>
<p>OPA now has to help Apache APISIX decide how to handle the request.</p>
<p>Next, OPA checks the policies and data available, compares them, and reaches the decision in JSON format below:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"result"</span>: {
        <span class="hljs-attr">"allow"</span>: <span class="hljs-literal">true</span>,
        <span class="hljs-attr">"reason"</span>: <span class="hljs-string">"test"</span>,
        <span class="hljs-attr">"headers"</span>: {
            <span class="hljs-attr">"an"</span>: <span class="hljs-string">"header"</span>
        },
        <span class="hljs-attr">"status_code"</span>: <span class="hljs-number">401</span>
    }
}
</code></pre>
<p>The policy decision above is an <strong>OPA service to APISIX</strong> response. It tells APISIX to accept the request due to the reason (test) given. When allow is false, Apache APISIX rejects it.  </p>
<p>The following is an explanation of some of the keys in the request and response above:</p>
<ul>
<li><code>type</code> indicates the request type (<code>HTTP</code> or <code>stream</code>).</li>
<li><code>request</code> is used when the <code>type</code> is <code>HTTP</code> and contains the basic request information like URL and headers.</li>
<li><code>var</code> holds the basic information about the requested connection (IP, port, server details, and request timestamp).</li>
<li><code>route</code>, <code>service</code>, and <code>consumer</code> contain the same data stored in APISIX. They require configuration for a user to see them after a transaction.</li>
<li><code>allow</code> is required and indicates whether the request is authorised to pass through APISIX.</li>
<li><code>reason</code>, <code>headers</code>, and <code>status_code</code> are optional and are returned when you configure a custom response.</li>
</ul>
<h3 id="heading-how-to-use-the-plugin">How to Use the Plugin</h3>
<p>This section will introduce you to some of the features of the plugin. You will see how to use Docker to build OPA services, create policy, create users’ data, create a custom route, test requests, and enable and disable the plugin.</p>
<h4 id="heading-how-to-use-dockerhttpswwwdockercom-to-build-opa-services">How to use <a target="_blank" href="https://www.docker.com/">docker</a> to build OPA services</h4>
<p>Use the command below to launch the OPA environment on port <code>8181</code></p>
<pre><code>docker run -d --name opa -p <span class="hljs-number">8181</span>:<span class="hljs-number">8181</span> openpolicyagent/opa:<span class="hljs-number">0.35</span><span class="hljs-number">.0</span> run -s
</code></pre><p>We will be using <a target="_blank" href="https://curl.se/">CURL</a> for the rest of this article. If you are new to it or you are coming from other programming languages, copy the requests or response code and <a target="_blank" href="https://curlconverter.com/">paste the code here</a> to convert it to your preferred language.</p>
<p>We will also stick to the <code>-H</code> and <code>-d</code> flags instead of <code>--header</code> and <code>--data-raw</code> respectively.</p>
<h4 id="heading-how-to-create-a-policy">How to create a policy</h4>
<p>Creating a policy follows the format below:</p>
<pre><code>curl -X PUT <span class="hljs-string">'127.0.0.1:8181/v1/policies/example1'</span> \
    -H <span class="hljs-string">'Content-Type: text/plain'</span> \
    -d <span class="hljs-string">'package example

import input.request

default allow = false

allow {
    # HTTP method must GET
    request.method == "GET"
}'</span>
</code></pre><p>The code above came about through the following steps:</p>
<ul>
<li>State the route: 127.0.0.1:8181/v1/policies/example1.</li>
<li>Import Request: import input.request.</li>
<li>State that no request is allowed: default allow = false.</li>
<li>Specify what is permissible:</li>
</ul>
<pre><code>
allow {
    # HTTP method must GET
    request.method == <span class="hljs-string">"GET"</span>
}
</code></pre><p>The code above instructs that the only acceptable HTTP method is GET. Every line in the allow object gets implemented as policies asides from the lines that begin with a # because they are comments.   </p>
<p>You can add as many rules as you want based on the policies you have in mind. For example, the code below contains five rules that must be adhered to:</p>
<pre><code># Create policy
curl -X PUT <span class="hljs-string">'127.0.0.1:8181/v1/policies/example1'</span> \
    -H <span class="hljs-string">'Content-Type: text/plain'</span> \
    -d <span class="hljs-string">'package example

import input.request
import data.users

default allow = false

allow {
    # has the name test-header with the value only-for-test request header
    request.headers["test-header"] == "only-for-test"

    # The request method is GET
    request.method == "GET"

    # The request path starts with /get
    startswith(request.path, "/get")

    # GET parameter test exists and is not equal to abcd
    request.query["test"] != "abcd"

    # GET parameter user exists
    request.query["user"]
}'</span>
</code></pre><p>With the configuration we have made so far, everything will work fine. But what happens when our users get something wrong and an error they don’t understand is returned to them? They will become frustrated and left with a bad user experience. We can avoid that by adding a <strong>custom response.</strong>  </p>
<p>A custom response provides extra details (body, header, and status code) about the result of a transaction. Our request now becomes:</p>
<pre><code>
# Create policy
curl -X PUT <span class="hljs-string">'127.0.0.1:8181/v1/policies/example1'</span> \
    -H <span class="hljs-string">'Content-Type: text/plain'</span> \
    -d <span class="hljs-string">'package example

import input.request
import data.users

default allow = false

allow {
    # has the name test-header with the value only-for-test request header
    request.headers["test-header"] == "only-for-test"
    # The request method is GET
    request.method == "GET"
    # The request path starts with /get
    startswith(request.path, "/get")
    # GET parameter test exists and is not equal to abcd
    request.query["test"] != "abcd"
    # GET parameter user exists
    request.query["user"]
}

# custom response body (Accepts a string or an object, the object will respond as JSON format)
reason = users[request.query["user"]].reason {
    not allow
    request.query["user"]
}

# custom response header (The data of the object can be written in this way)
headers = users[request.query["user"]].headers {
    not allow
    request.query["user"]
}

# custom response status code
status_code = users[request.query["user"]].status_code {
    not allow
    request.query["user"]
}'</span>
</code></pre><p>When a user gets an error, it becomes easier to debug because the error comes with a <code>reason</code>, <code>headers</code> details, and <code>status_code</code>.</p>
<h4 id="heading-how-to-create-users-data">How to create users’ data</h4>
<p>The users' data is an object of objects. Each user data is an object of custom details (body, header, and status code) that help with user authorization. </p>
<p>The code below is an example of users data containing four (4) users with different details:</p>
<pre><code># Create test user data
curl -X PUT <span class="hljs-string">'127.0.0.1:8181/v1/data/users'</span> \
    -H <span class="hljs-string">'Content-Type: text/plain'</span> \
    -d <span class="hljs-string">'{

    "alice": {
        "headers": {
            "Location": "http://example.com/auth"
        },
        "status_code": 302
    },

    "bob": {
        "headers": {
            "test": "abcd",
            "abce": "test"
        }
    },

    "carla": {
        "reason": "Give you a string reason"
    },

    "dylon": {
        "headers": {
            "Content-Type": "application/json"
        },
        "reason": {
            "code": 40001,
            "desc": "Give you a object reason"
        }
    }
}'</span>
</code></pre><p>Notice that each user’s custom details are optional and may differ for every user.</p>
<h4 id="heading-how-to-create-a-custom-route-and-enable-the-plugin">How to create a custom route and enable the plugin</h4>
<p>The APISIX OPA plugin's flexibility makes it possible for users to customize their route like in the code below:</p>
<pre><code>curl -X PUT <span class="hljs-string">'http://127.0.0.1:9080/apisix/admin/routes/r1'</span> \
    -H <span class="hljs-string">'X-API-KEY: &lt;api-key&gt;'</span> \
    -H <span class="hljs-string">'Content-Type: application/json'</span> \
    -d <span class="hljs-string">'{
    "uri": "/*",
    "methods": [
        "GET",
        "POST",
        "PUT",
        "DELETE"
    ],
    "plugins": {},
    "upstream": {
        "nodes": {
            "httpbin.org:80": 1
        },
        "type": "roundrobin"
    }
}'</span>
</code></pre><p>For this to work, the plugin has to be enabled. Enter the needed configuration into the <code>plugins</code> object to turn it on. So we have:</p>
<pre><code>
curl -X PUT <span class="hljs-string">'http://127.0.0.1:9080/apisix/admin/routes/r1'</span> \
    -H <span class="hljs-string">'X-API-KEY: &lt;api-key&gt;'</span> \
    -H <span class="hljs-string">'Content-Type: application/json'</span> \
    -d <span class="hljs-string">'{
    "uri": "/*",
    "methods": [
        "GET",
        "POST",
        "PUT",
        "DELETE"
    ],
    "plugins": {
        "opa": {
            "host": "http://127.0.0.1:8181",
            "policy": "example1"
        }
    },
    "upstream": {
        "nodes": {
            "httpbin.org:80": 1
        },
        "type": "roundrobin"
    }
}'</span>
</code></pre><p>Now that the plugin is enabled, you can use your route as you see fit.</p>
<h4 id="heading-how-to-test-the-requests">How to test the requests</h4>
<p>We have been able to create policies, users’ data, and custom routes and enabled the Apache APISIX OPA plugin so far. Let’s now test these setups and see the response we get for different scenarios:</p>
<p>Here's a test for when a request is allowed:</p>
<p>Request:</p>
<pre><code>
curl -XGET <span class="hljs-string">'127.0.0.1:9080/get?test=none&amp;user=dylon'</span> \
    --header <span class="hljs-string">'test-header: only-for-test'</span>
</code></pre><p>Response:</p>
<pre><code>{
    <span class="hljs-string">"args"</span>: {
        <span class="hljs-string">"test"</span>: <span class="hljs-string">"abcd1"</span>,
        <span class="hljs-string">"user"</span>: <span class="hljs-string">"dylon"</span>
    },
    <span class="hljs-string">"headers"</span>: {
        <span class="hljs-string">"Test-Header"</span>: <span class="hljs-string">"only-for-test"</span>,
        <span class="hljs-string">"with"</span>: <span class="hljs-string">"more"</span>
    },
    <span class="hljs-string">"origin"</span>: <span class="hljs-string">"127.0.0.1"</span>,
    <span class="hljs-string">"url"</span>: <span class="hljs-string">"http://127.0.0.1/get?test=abcd1&amp;user=dylon"</span>
}
</code></pre><p>Here's a test for when a request is rejected and the status code and response headers are re-written:</p>
<p>Request:</p>
<pre><code>
curl -XGET <span class="hljs-string">'127.0.0.1:9080/get?test=abcd&amp;user=alice'</span> \
    --header <span class="hljs-string">'test-header: only-for-test'</span>
</code></pre><p>Response:</p>
<pre><code>
HTTP/<span class="hljs-number">1.1</span> <span class="hljs-number">302</span> Moved Temporarily
<span class="hljs-attr">Date</span>: Mon, <span class="hljs-number">20</span> Dec <span class="hljs-number">2021</span> <span class="hljs-number">09</span>:<span class="hljs-number">37</span>:<span class="hljs-number">35</span> GMT
Content-Type: text/html
Content-Length: <span class="hljs-number">142</span>
<span class="hljs-attr">Connection</span>: keep-alive
<span class="hljs-attr">Location</span>: http:<span class="hljs-comment">//example.com/auth</span>
Server: APISIX/<span class="hljs-number">2.11</span><span class="hljs-number">.0</span>
</code></pre><p>Here's a test for when a request is rejected and a custom response header is returned:</p>
<p>Request:</p>
<pre><code>
curl -XGET <span class="hljs-string">'127.0.0.1:9080/get?test=abcd&amp;user=bob'</span> \
    --header <span class="hljs-string">'test-header: only-for-test'</span>
</code></pre><p>Response:</p>
<pre><code>
HTTP/<span class="hljs-number">1.1</span> <span class="hljs-number">403</span> Forbidden
<span class="hljs-attr">Date</span>: Mon, <span class="hljs-number">20</span> Dec <span class="hljs-number">2021</span> <span class="hljs-number">09</span>:<span class="hljs-number">38</span>:<span class="hljs-number">27</span> GMT
Content-Type: text/html; charset=utf<span class="hljs-number">-8</span>
Content-Length: <span class="hljs-number">150</span>
<span class="hljs-attr">Connection</span>: keep-alive
<span class="hljs-attr">abce</span>: test
<span class="hljs-attr">test</span>: abcd
<span class="hljs-attr">Server</span>: APISIX/<span class="hljs-number">2.11</span><span class="hljs-number">.0</span>
</code></pre><p>Here's a test for when a request is rejected and a custom response (string) is returned:</p>
<p>Request:</p>
<pre><code>
curl -XGET <span class="hljs-string">'127.0.0.1:9080/get?test=abcd&amp;user=carla'</span> \
    --header <span class="hljs-string">'test-header: only-for-test'</span>
</code></pre><p>Response:</p>
<pre><code>
HTTP/<span class="hljs-number">1.1</span> <span class="hljs-number">403</span> Forbidden
<span class="hljs-attr">Date</span>: Mon, <span class="hljs-number">20</span> Dec <span class="hljs-number">2021</span> <span class="hljs-number">09</span>:<span class="hljs-number">38</span>:<span class="hljs-number">58</span> GMT
Content-Type: text/plain; charset=utf<span class="hljs-number">-8</span>
Transfer-Encoding: chunked
<span class="hljs-attr">Connection</span>: keep-alive
<span class="hljs-attr">Server</span>: APISIX/<span class="hljs-number">2.11</span><span class="hljs-number">.0</span>

Give you a string <span class="hljs-keyword">of</span> reason
</code></pre><p>And here's a test for when a request is rejected and a custom response (JSON) is returned:</p>
<p>Request:</p>
<pre><code>
curl -XGET <span class="hljs-string">'127.0.0.1:9080/get?test=abcd&amp;user=dylon'</span> \
    --header <span class="hljs-string">'test-header: only-for-test'</span>
</code></pre><p>Response:</p>
<pre><code>
HTTP/<span class="hljs-number">1.1</span> <span class="hljs-number">403</span> Forbidden
<span class="hljs-attr">Date</span>: Mon, <span class="hljs-number">20</span> Dec <span class="hljs-number">2021</span> <span class="hljs-number">09</span>:<span class="hljs-number">42</span>:<span class="hljs-number">12</span> GMT
Content-Type: application/json
Transfer-Encoding: chunked
<span class="hljs-attr">Connection</span>: keep-alive
<span class="hljs-attr">Server</span>: APISIX/<span class="hljs-number">2.11</span><span class="hljs-number">.0</span>

{<span class="hljs-string">"code"</span>:<span class="hljs-number">40001</span>,<span class="hljs-string">"desc"</span>:<span class="hljs-string">"Give you a object reason"</span>}
</code></pre><h4 id="heading-how-to-disable-the-plugin">How to disable the plugin</h4>
<p>To disable the APISIX OPA plugin, remove all the configurations we added when we set up a custom route and enabled the plugin. We now have:</p>
<pre><code>
curl -X PUT <span class="hljs-string">'http://127.0.0.1:9080/apisix/admin/routes/r1'</span> \
    -H <span class="hljs-string">'X-API-KEY: &lt;api-key&gt;'</span> \
    -H <span class="hljs-string">'Content-Type: application/json'</span> \
    -d <span class="hljs-string">'{
    "uri": "/*",
    "methods": [
        "GET",
        "POST",
        "PUT",
        "DELETE"
    ],
    "plugins": {},
    "upstream": {
        "nodes": {
            "httpbin.org:80": 1
        },
        "type": "roundrobin"
    }
}'</span>
</code></pre><p>The <code>plugins</code> object being empty indicates that the plugin cannot work. It is that easy because of Apache APISIX’s dynamic nature.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>This article aimed to introduce you to the Apache APISIX OPA plugin and walk you through some of its features. </p>
<p>We began by looking at what OPA is and why APISIX adopted it by employing a plugin. Then we discussed how the plugin works and how we can use it.  </p>
<p>Apache APISIX currently has more than ten authentication and authorization-related plugins that support interfacing with mainstream authentication/authorization services in the industry.  </p>
<p>If you need to interface with other authentication authorities, you can visit <a target="_blank" href="https://github.com/apache/apisix/issues">Apache APISIX's GitHub</a> and leave your suggestions via an issue or subscribe to <a target="_blank" href="https://apisix.apache.org/zh/docs/general/subscribe-guide">Apache APISIX's mailing list</a> to express your ideas.  </p>
<p>I hope this article helps you understand how to use OPA in Apache APISIX so you can start adopting it yourself. I also encourage you to take the time to visit the <a target="_blank" href="https://apisix.apache.org/docs/apisix/plugins/opa/">Apache APISIX OPA plugin documentation</a> to see other use cases for the plugin. The more you practice with it, the better you get at using it.  </p>
<p>Happy Policy Making!  </p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Apache Airflow to Schedule and Manage Workflows ]]>
                </title>
                <description>
                    <![CDATA[ Apache Airflow is an open-source workflow management system that makes it easy to write, schedule, and monitor workflows. A workflow as a sequence of operations, from start to finish. The workflows in Airflow are authored as Directed Acyclic Graphs (... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-apache-airflow-to-manage-workflows/</link>
                <guid isPermaLink="false">66d460f2a326133d12440a7c</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ workflow ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sameer Shukla ]]>
                </dc:creator>
                <pubDate>Fri, 13 May 2022 15:11:17 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/05/My-project--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Apache Airflow is an open-source workflow management system that makes it easy to write, schedule, and monitor workflows.</p>
<p>A workflow as a sequence of operations, from start to finish. The workflows in Airflow are authored as Directed Acyclic Graphs (DAG) using standard Python programming.</p>
<p>You can configure when a DAG should start execution and when it should finish. You can also set up workflow monitoring through the very intuitive Airflow UI.</p>
<p>You can be up and running on Airflow in no time – it’s easy to use and you only need some basic Python knowledge. It's also completely open source.</p>
<p>Apache Airflow also has a helpful collection of operators that work easily with the Google Cloud, Azure, and AWS platforms.</p>
<p>In this article we are going to cover</p>
<ul>
<li><p>What are Directed Acyclic Graphs (DAGs)?</p>
</li>
<li><p>What are Operators?</p>
</li>
<li><p>How to Create your First DAG</p>
</li>
<li><p>A Use-Case for DAGs</p>
</li>
<li><p>How to Set Up Cloud Composer</p>
</li>
<li><p>How to Run the Pipeline on Composer</p>
</li>
</ul>
<h2 id="heading-what-are-directed-acyclic-graphs-or-dags">What are Directed Acyclic Graphs, or DAGs?</h2>
<p>DAGs, or Directed Acyclic Graphs, have nodes and edges. DAGs should not contain any loops and their edges should always be directed.</p>
<p>In short, a DAG is a data pipeline and each node in a DAG is a task. Some examples of nodes are downloading a file from GCS (Google Cloud Storage) to Local, applying business logic on a file using Pandas, querying the database, making a rest call, or uploading a file again to a GCS bucket.</p>
<h3 id="heading-visualizing-dags">Visualizing DAGs</h3>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-47.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Correct DAG with no loops</em></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-48.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Incorrect DAG with Loop</em></p>
<p>You can schedule DAGs in Airflow using the schedule_interval attribute. By default it’s "None" which means that the DAG can be run only using the Airflow UI.</p>
<p>You can schedule the DAG to run once every hour, every day, once a week, monthly, yearly or whatever you wish using the cron presets options (@hour, @daily, @weekly, @hourly, @monthly, @yearly).</p>
<p>If you need to run the DAG every 5 mins, every 10 mins, every day at 14:00, or once on a specific day like every Thursday at 10:00am, then you should use these cron-based expressions.</p>
<p>*/5 * * * * = Every 5 minutes</p>
<p>0 14 * * * = Every day at 14:00</p>
<h2 id="heading-what-are-operators">What are Operators?</h2>
<p>A DAG consists of multiple tasks. You can create tasks in a DAG using operators which are nodes in the graph.</p>
<p>There are various ready to use operators available in Airflow, such as:</p>
<ul>
<li><p>LocalFilesystemToGCSOperator – use it to upload a file from Local to GCS bucket.</p>
</li>
<li><p>PythonOperator – use it to execute Python callables.</p>
</li>
<li><p>functionEmailOperator – use it to send email.</p>
</li>
<li><p>SimpleHTTPOperator – use it to make an HTTP Request.</p>
</li>
</ul>
<h2 id="heading-how-to-create-your-first-dag">How to Create Your First DAG</h2>
<p>The example DAG we are going to create consists of only one operator (the Python operator) which executes a Python function.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">from</span> airflow.operators.python_operator <span class="hljs-keyword">import</span> PythonOperator

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">message</span>():</span>
    print(<span class="hljs-string">"First DAG executed Successfully!!"</span>)

<span class="hljs-keyword">with</span> DAG(dag_id=<span class="hljs-string">"FirstDAG"</span>, start_date=datetime(<span class="hljs-number">2022</span>,<span class="hljs-number">1</span>,<span class="hljs-number">23</span>), schedule_interval=<span class="hljs-string">"@hourly"</span>,
         catchup=<span class="hljs-literal">False</span>) <span class="hljs-keyword">as</span> dag:

    task = PythonOperator(
        task_id=<span class="hljs-string">"task"</span>,
        python_callable=message)

task
</code></pre>
<p>The first step is to import the necessary modules required for DAG development. The line <code>with DAG</code> is the DAG which is a data pipeline that has basic parameters like <code>dag_id</code>, <code>start_date</code>, and <code>schedule_interval</code>.</p>
<p>The <code>schedule_interval</code> is configured as @hourly which indicates that the DAG will run every hour.</p>
<p>The task in the DAG is to print a message in the logs. We have used the PythonOperator here. This operator is used to execute any callable Python function.</p>
<p>Once the execution is complete, we should see the message “First DAG executed Successfully” in the logs. We are going to execute all our DAGs on GCP Cloud Composer.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-49.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Airflow UI</em></p>
<p>After successful execution, the message is printed on the logs:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-50.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Logs</em></p>
<h2 id="heading-a-use-case-for-dags">A Use-Case for DAGs</h2>
<p>The use-case we are going to cover in this article involves a three-step process.</p>
<p>In step one, we will upload a .csv file in some input GCS bucket. This file should be processed by PythonOperator in the DAG. The function which will be executed by the PythonOperator consists of Pandas code, which represents how users can use Pandas code for transforming the data in the Airflow Data Pipeline.</p>
<p>In step two, we'll upload the transformed .csv file to another GCS bucket. This task will be handled by the GCSToGCSOperator.</p>
<p>Step three is to send the status email indicating the that the pipeline execution is completed which will be handled by the EmailOperator.</p>
<p>In this use-case we will also cover how to notify the team via email in case any step of the execution failed.</p>
<h2 id="heading-how-to-install-cloud-composer">How to Install Cloud Composer</h2>
<p>In GCP, Cloud Composer is a managed service built on Apache Airflow. Cloud Composer has default integration with other GCP Services such as GCS, BigQuery, Cloud Dataflow and so on.</p>
<p>First, we need to create the Cloud Composer Environment. So search for Cloud Composer on the search bar and click on "Create Environment" as shown below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-51.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Create Environment</em></p>
<p>In the Environments option, I am selecting the "Composer 1" option as we don’t need auto-scaling.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-54.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Once we select the type of composer we need, we'll need to do some basic configuration just like in any GCP managed service ("Instance Name", "Location", and so on).</p>
<p>The node count here should always be 3 as GCP will setup the 3 services needed for Airflow.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-56.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Once we're done with that, it'll set up an Airflow instance for us. To upload a DAG, we need to open the DAGs folder shown in ‘DAGs folder’ section.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-57.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Airflow Instance</em></p>
<p>If you go to the "Kubernetes Engine" section on GCP, we can see 3 services up and running:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-58.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Kubernetes Engine</em></p>
<p>All DAGs will reside in a bucket created by Airflow.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-59.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Airflow Instance bucket for DAGs</em></p>
<h2 id="heading-how-to-create-and-run-the-pipeline-on-composer">How to Create and Run the Pipeline on Composer</h2>
<p>In the Pipeline, we have two buckets. input_csv will contain the csv which requires some transformation, and the transformed_csv bucket will be the location where the file will be uploaded once the transformation is done.</p>
<p>The entire pipeline code is the following:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-keyword">from</span> airflow.utils.email <span class="hljs-keyword">import</span> send_email
<span class="hljs-keyword">from</span> airflow.operators.python_operator <span class="hljs-keyword">import</span> PythonOperator
<span class="hljs-keyword">from</span> airflow.operators.email_operator <span class="hljs-keyword">import</span> EmailOperator
<span class="hljs-keyword">from</span> airflow.providers.google.cloud.transfers.gcs_to_gcs <span class="hljs-keyword">import</span> GCSToGCSOperator


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transformation</span>():</span>
    trainDetailsDF = pd.read_csv(<span class="hljs-string">'gs://input_csv/Event_File_03_16_2022.csv'</span>)
    print(trainDetailsDF.head())


<span class="hljs-keyword">with</span> DAG(
        dag_id=<span class="hljs-string">"pipeline_demo"</span>,
        schedule_interval=<span class="hljs-string">"@hourly"</span>,
        start_date=datetime(<span class="hljs-number">2022</span>, <span class="hljs-number">1</span>, <span class="hljs-number">23</span>),
        catchup=<span class="hljs-literal">False</span>
) <span class="hljs-keyword">as</span> dag:
    buisness_logic_task = PythonOperator(
        task_id=<span class="hljs-string">'ApplyBusinessLogic'</span>,
        python_callable=transformation,
        dag=dag)

    upload_task = GCSToGCSOperator(
        task_id=<span class="hljs-string">'upload_task'</span>,
        source_bucket=<span class="hljs-string">'input_csv'</span>,
        destination_bucket=<span class="hljs-string">'transformed_csv'</span>,
        source_object=<span class="hljs-string">'Event_File_03_16_2022.csv'</span>,
        move_object=<span class="hljs-literal">True</span>,
        dag=dag
    )

    email_task = EmailOperator(
        task_id=<span class="hljs-string">"SendStatusEmail"</span>,
        depends_on_past=<span class="hljs-literal">True</span>,
        to=<span class="hljs-string">'youremail'</span>,
        subject=<span class="hljs-string">'Pipeline Status!'</span>,
        html_content=<span class="hljs-string">'&lt;p&gt;Hi Everyone, Process completed Successfully! &lt;p&gt;'</span>,
        dag=dag)

    buisness_logic_task &gt;&gt; upload_task &gt;&gt; email_task
</code></pre>
<p>In the first task, all we are doing is creating a DataFrame from the input file and printing the head elements. In the logs it looks like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-60.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>DataFrame Head</em></p>
<p>In the second task, GCSToGCSOperator, we have used the attribute move_object=True which will delete the file from the Source bucket.</p>
<p>Once we upload the file to the bucket, we can see that the DAG is being scheduled. The name of the DAG is "pipeline_demo".</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-61.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>DAGs</em></p>
<p>Note that in case if you encounter any "import errors" after uploading or executing a DAG, something like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-62.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can upload these missing packages through the "PYPI Packages" option in GCP. This will update the environment after few minutes.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-63.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Updating environment with missing Packages</em></p>
<p>To open an Airflow UI, Click on the "Airflow" link under Airflow webserver.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-64.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Airflow Instance, click Airflow link to Open UI</em></p>
<p>The Airflow UI looks like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-65.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Upon successful execution of Pipeline, here's what you should see:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-66.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>In order to send email if a task fails, you can use the on_failure_callback like this:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">notify_email</span>(<span class="hljs-params">contextDict, **kwargs</span>):</span>
    title = <span class="hljs-string">"Airflow alert: {task_name} Failed"</span>.format(**contextDict)
    body = <span class="hljs-string">"""
    Task Name :{task_name} Failed.&lt;br&gt;
    """</span>.format(**contextDict)
    send_email(<span class="hljs-string">'youremail’, title, body)



buisness_logic_task = PythonOperator(
    task_id='</span>ApplyBusinessLogic<span class="hljs-string">',
    python_callable=transformation,
    on_failure_callback=notify_email,
    dag=dag)</span>
</code></pre>
<p>We're doing the notification email configuration on composer through Sendgrid. Also, once you are done with Cloud Composer, don't forget to delete the instance as it cannot be stopped.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Apache Airflow is a fairly easy-to-use tool. There's also a lot of help now available on the internet and the community is growing.</p>
<p>GCP simplified working with Airflow a lot by creating a separate managed service for it.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Apache Cassandra Beginner Tutorial ]]>
                </title>
                <description>
                    <![CDATA[ By Sebastian Sigl There are lots of data-storage options available today. You have to choose between managed or unmanaged, relational or NoSQL, write- or read-optimized, proprietary or open-source — and it doesn't end there. Once you begin your searc... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-apache-cassandra-beginner-tutorial/</link>
                <guid isPermaLink="false">66d461053bc3ab877dae2232</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ backend ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cassandra ]]>
                    </category>
                
                    <category>
                        <![CDATA[ database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ NoSQL ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 15 Jul 2021 13:13:02 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/07/cassandra-welcome.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Sebastian Sigl</p>
<p>There are lots of data-storage options available today. You have to choose between managed or unmanaged, relational or NoSQL, write- or read-optimized, proprietary or open-source — and it doesn't end there.</p>
<p>Once you begin your search, you will end up in the universe that is database marketing. All of the vendors will tell you why their database is fantastic. </p>
<p>Unfortunately, it's difficult to find out when not to use a specific database, because this is not an attractive selling point.</p>
<p>If you know what questions to ask, you will eventually understand all the essential properties of a given system. In the end, your choice will depend on your expertise and your requirements.</p>
<p>In this tutorial I will introduce you to Apache Cassandra, a distributed, horizontally scalable, open-source database. Or as Cassandra users like to describe Cassandra: "It's a database that puts you in the driver seat."</p>
<p>I will share the essential gotchas and provide references to documentation. I’ll also provide insights based on my experience of running Cassandra on a large scale at work, with executable examples wherever possible.</p>
<p>Here’s an overview of everything you'll learn:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-61.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Along the way, you will learn to ask fundamental questions that will help you to chose a database that suits your needs. You'll also learn about other popular databases like Spanner, Cockroach, or FaunaDB, and how they can serve different use-cases.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><a class="post-section-overview" href="#heading-how-to-set-up-a-cassandra-cluster">How to Set Up a Cassandra Cluster</a></li>
<li><a class="post-section-overview" href="#heading-cassandra-architecture">Cassandra Architecture</a><ul>
<li><a class="post-section-overview" href="#heading-decentralization">Decentralization</a></li>
<li><a class="post-section-overview" href="#heading-every-node-is-a-coordinator">Every Node Is a Coordinator</a></li>
<li><a class="post-section-overview" href="#heading-data-partitioning">Data Partitioning</a></li>
<li><a class="post-section-overview" href="#heading-replication">Replication</a></li>
<li><a class="post-section-overview" href="#heading-consistency-level">Consistency Level</a></li>
<li><a class="post-section-overview" href="#heading-tune-for-consistency-by-setting-up-a-strong-consistency-application">Tune for Consistency by Setting up a Strong Consistency Application</a></li>
<li><a class="post-section-overview" href="#heading-tune-for-performance-by-using-eventual-consistency">Tune for Performance by Using Eventual Consistency</a></li>
<li><a class="post-section-overview" href="#heading-understanding-compaction">Understanding Compaction</a></li>
<li><a class="post-section-overview" href="#heading-presorting-data-on-cassandra-nodes">Presorting Data on Cassandra Nodes</a></li>
</ul>
</li>
<li><a class="post-section-overview" href="#heading-data-modeling">Data Modeling</a><ul>
<li><a class="post-section-overview" href="#heading-keep-data-in-sync-using-batch-statements">Keep Data in Sync Using <code>BATCH</code> Statements</a></li>
<li><a class="post-section-overview" href="#heading-use-foreign-keys-instead-of-duplicating-data-in-cassandra">Use Foreign Keys Instead of Duplicating Data in Cassandra</a></li>
<li><a class="post-section-overview" href="#heading-indexes-in-cassandra">Indexes in Cassandra</a></li>
<li><a class="post-section-overview" href="#heading-materialized-views">Materialized Views</a></li>
</ul>
</li>
<li><a class="post-section-overview" href="#heading-running-a-cluster">Running a Cluster</a><ul>
<li><a class="post-section-overview" href="#heading-fully-managed-cassandra">Fully Managed Cassandra</a></li>
<li><a class="post-section-overview" href="#heading-self-managed-cassandra">Self-Managed Cassandra</a></li>
</ul>
</li>
<li><a class="post-section-overview" href="#heading-other-learnings">Other Learnings</a><ul>
<li><a class="post-section-overview" href="#heading-data-migrations">Data Migrations</a></li>
<li><a class="post-section-overview" href="#heading-tombstones">Tombstones</a></li>
<li><a class="post-section-overview" href="#heading-updates-are-just-inserts-and-vice-versa"><code>UPDATE</code>s Are Just <code>INSERT</code>s, and Vice Versa</a></li>
<li><a class="post-section-overview" href="#heading-lightweight-transactions">Lightweight Transactions</a></li>
</ul>
</li>
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
<li><a class="post-section-overview" href="#heading-references">References</a></li>
</ul>
<h2 id="heading-how-to-set-up-a-cassandra-cluster">How to Set Up a Cassandra Cluster</h2>
<p>To execute the examples of this tutorial, you'll need a running Cassandra cluster. You can get this up and running quickly by using <a target="_blank" href="https://docs.docker.com/get-docker/">Docker</a>.</p>
<blockquote>
<p><strong>Required Docker settings</strong>  </p>
<p>Your device should have a minimum of 8GB of memory and at least 8GB of free disk space. Your Docker settings should be updated to be able to use at least 6GB of memory, or better, 8GB.  </p>
<p>To apply these suggestions, open your Docker preferences, go to Resources, and increase your memory threshold.</p>
</blockquote>
<p>Cassandra is built for scale, and some features only work on a multi-node Cassandra cluster, so let’s start one locally.</p>
<p>For Linux and Mac, run the following commands:</p>
<pre><code class="lang-shell"># Run the first node and keep it in background up and running
docker run --name cassandra-1 -p 9042:9042 -d cassandra:3.7
INSTANCE1=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-1)
echo "Instance 1: ${INSTANCE1}"

# Run the second node
docker run --name cassandra-2 -p 9043:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1 cassandra:3.7
INSTANCE2=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-2)
echo "Instance 2: ${INSTANCE2}"

echo "Wait 60s until the second node joins the cluster"
sleep 60

# Run the third node
docker run --name cassandra-3 -p 9044:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1,$INSTANCE2 cassandra:3.7
INSTANCE3=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-3)
</code></pre>
<p>For Windows, run the following commands in PowerShell:</p>
<pre><code class="lang-shell"># Run the first node and keep it in background up and running
docker run --name cassandra-1 -p 9042:9042 -d cassandra:3.7
$INSTANCE1=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-1)
echo "Instance 1: ${INSTANCE1}"

# Run the second node
docker run --name cassandra-2 -p 9043:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1 cassandra:3.7
$INSTANCE2=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-2)
echo "Instance 2: ${INSTANCE2}"

echo "Wait 60s until the second node joins the cluster"
sleep 60

# Run the third node
docker run --name cassandra-3 -p 9044:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1,$INSTANCE2 cassandra:3.7
$INSTANCE3=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-3)
</code></pre>
<p>The startup process can take a few minutes.</p>
<p>You can verify if everything is done and ready by executing a Cassandra utility tool called <code>nodetool</code> via <code>docker exec</code> on a node:</p>
<pre><code class="lang-shell">$ docker exec cassandra-3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.17.0.3  112.69 KiB  256          68.7%             bb5ef231-0dd2-4762-a447-806a45f710ac  rack1
UN  172.17.0.2  107.96 KiB  256          68.3%             d7392374-8daa-4292-b724-cb790b0ee6ad  rack1
UN  172.17.0.4  93.93 KiB  256          63.0%             386d094f-5483-4945-a1a7-2bb3975d6167  rack1
</code></pre>
<p>UN means <strong>U</strong>p and <strong>N</strong>ormal. Here, all 3 nodes are running and healthy.</p>
<p>In this tutorial we will send lots of queries to Cassandra. I recommend starting a new shell and connecting to one node using <code>cqlsh</code>. Here's how to start a <code>cqlsh</code> shell in Docker:</p>
<pre><code class="lang-shell">$ docker exec -it cassandra-1 cqlsh

Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh&gt;
</code></pre>
<p>And to execute your first query:</p>
<pre><code class="lang-shell">cqlsh&gt; DESCRIBE keyspaces;

system_traces  system_schema  system_auth  system  system_distributed
</code></pre>
<p>The response shows all the existing keyspaces. Keyspaces group tables and are similar to a database in a traditional relational database system. In other systems, groups of certain items are also known as namespaces.</p>
<p>Before you begin creating tables and inserting data, first create a keyspace in your local datacenter, which should replicate data 3 times:</p>
<pre><code class="lang-shell">cqlsh&gt; CREATE KEYSPACE learn_cassandra
  WITH REPLICATION = { 
   'class' : 'NetworkTopologyStrategy',
   'datacenter1' : 3 
  };
</code></pre>
<p>A keyspace with a replication factor of 3 using the <code>NetworkTopologyStrategy</code> was created. The strategy defines how data is replicated in different datacenters. This is the recommended strategy for all user created keyspaces.</p>
<blockquote>
<p><strong>Why should you start with 3 nodes?</strong>  </p>
<p>It’s recommended to have at least 3 nodes or more. One reason is, in case you need  strong consistency, you need to get confirmed data from at least 2 nodes. Or if 1 node goes down, your cluster would still be available because the 2 remaining nodes are up and running.  </p>
<p>You don’t need to fully understand this yet. After reading through the rest of this tutorial, things should be more clear.</p>
</blockquote>
<p>Now, all the nodes are up and healthy. You have a 3-node Cassandra setup listening on ports 9042, 9043, and 9044 for client requests. This is a realistic setup for a small cluster.  </p>
<p>In production, the instances would run on different machines to maximize performance. </p>
<p>Before you start creating tables, reading, and writing data, it's helpful to understand the basics of designing tables for scalability.  </p>
<p>In this tutorial, you will create tables with different settings for a to-do list application. If you want to get your hands dirty straight away, you can jump directly to the next <code>cqlsh</code> example.</p>
<h2 id="heading-cassandra-architecture">Cassandra Architecture</h2>
<p>Cassandra is a decentralized multi-node database that physically spans separate locations and uses replication and partitioning to infinitely scale reads and writes.</p>
<h3 id="heading-decentralization">Decentralization</h3>
<p>Cassandra is decentralized because no node is superior to other nodes, and every node acts in different roles as needed without any central controller. We'll get into examples of decentralization a bit later in this section.</p>
<p>Cassandra's decentralized property is what allows it to handle situations easily in case one node becomes unavailable or a new node is added.</p>
<h3 id="heading-every-node-is-a-coordinator">Every Node Is a Coordinator</h3>
<p>Data is replicated to different nodes. If certain data is requested, a request can be processed from any node.</p>
<p>This initial request receiver becomes the coordinator node for that request. If other nodes need to be checked to ensure consistency then the coordinator requests the required data from replica nodes.</p>
<p>The coordinator can calculate which node contains the data using a so-called <a target="_blank" href="https://cassandra.apache.org/doc/latest/architecture/dynamo.html?highlight=consistency#dataset-partitioning-consistent-hashing">consistent hashing algorithm</a>.</p>
<p><img src="https://lh6.googleusercontent.com/uSbZsiHVeCQ4Vqm_ow9951lfr1a-ZBaNqJWc03rhCn_Wn85qTYVhU3E0pXIU3giWC1juYN2ro8BRejURNu9J4NHcsin2vae3TPLvdeniOur2h1KZgPzmOKPaZMZ6KnIfm6jp1see" alt="Image" width="1600" height="930" loading="lazy">
<em>Every node can be a coordinator</em></p>
<p>The coordinator is responsible for many things, such as request batching, repairing data, or retries for reads and writes.</p>
<h3 id="heading-data-partitioning">Data Partitioning</h3>
<blockquote>
<p>“[Partitioning] is a method of splitting and storing a single logical dataset in multiple databases. By distributing the data among multiple machines, a cluster of database systems can store larger datasets and handle additional requests.  </p>
<p>”<a target="_blank" href="https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6">How Sharding Works</a> by <a target="_blank" href="https://medium.com/@jeeyoungk">Jeeyoung Kim</a></p>
</blockquote>
<p>As with many other databases, you store data in Cassandra in a predefined schema. You need to define a table with columns and types for each column. </p>
<p>Additionally, you need to think about the primary key of your table. A primary key is mandatory and ensures data is uniquely identifiable by one or multiple columns. </p>
<p>The concept of primary keys is more complex in Cassandra than in traditional databases like MySQL. In Cassandra, the primary key consists of 2 parts: </p>
<ul>
<li>a mandatory partition key and</li>
<li>an optional set of clustering columns.</li>
</ul>
<p>You will learn more about the partition key and clustering columns in the data modeling section.</p>
<p>For now, let's focus on the partition key and its impact on data partitioning.</p>
<p>Consider the following table:</p>
<pre><code class="lang-shell">Table Users | Legend: p - Partition-Key, c - Clustering Column

country (p) | user_email (c)  | first_name | last_name | age
----------------------------------------------------------------
US          | john@email.com  | John       | Wick      | 55  
UK          | peter@email.com | Peter      | Clark     | 65  
UK          | bob@email.com   | Bob        | Sandler   | 23 
UK          | alice@email.com | Alice      | Brown     | 26
</code></pre>
<p>Together, the columns <code>user_email</code> and <code>country</code> make up the primary key.</p>
<p>The <code>country</code> column is the partition key (p). The <code>CREATE</code>-statement for the table looks like this:</p>
<pre><code class="lang-shell">cqlsh&gt; 
CREATE TABLE learn_cassandra.users_by_country (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), user_email)
);
</code></pre>
<p>The first group of the primary key defines the partition key. All other elements of the primary key are clustering columns:</p>
<p><img src="https://lh4.googleusercontent.com/6WeEN0k3xnVfyOsFkZQctzCzUitUSPpM-kev6u5AvnzxCycPudQqfTX6XkiYwupwZ8XHCRJSwcGw1tB4BJe8qhZFybxshs1BZs6DlRg-Re0UCkyvS0oDRkUJhriqSYbjU7sdzMaK" alt="Image" width="1600" height="1087" loading="lazy"></p>
<p>Let’s fill the table  with some data:</p>
<pre><code class="lang-shell">cqlsh&gt; 
INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('US', 'john@email.com', 'John','Wick',55);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'peter@email.com', 'Peter','Clark',65);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'bob@email.com', 'Bob','Sandler',23);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'alice@email.com', 'Alice','Brown',26);
</code></pre>
<p>If you’re used to designing traditional relational database tables like it’s taught in school or university, you might be surprised. Why would you use <code>country</code> as an essential part of the primary key? </p>
<p>This example will make sense after you understand the basics of partitioning in Cassandra.</p>
<p>Partitioning is the foundation for scalability, and it is based on the partition key. In this example, partitions are created based on <code>country</code>. All rows with the <code>country</code> <code>US</code> are placed in a partition. All other rows with the country <code>UK</code> will be stored in another partition. </p>
<p>In the context of partitioning, the words partition and shard can be used interchangeably.</p>
<p><img src="https://lh4.googleusercontent.com/_APEp3Q3ugdLt1SR53Dej2x5_zOd17QrDFoBzVw9EFx6a0buHe9-A6eBZSAPRlPx-nyd_qU9WpUBcQIxN8uQDSFA_D3hWsFVb5TagJu3Y0fyRdpV0zdBTp8xZE4QWHIgfUg58AZo" alt="Image" width="1600" height="730" loading="lazy"></p>
<p>Partitions are created and filled based on partition key values. They are used to distribute data to different nodes. By distributing data to other nodes, you get scalability. You read and write data to and from different nodes by their partition key. </p>
<p>The distribution of data is a crucial point to understand when designing applications that store data based on partitions. It may take a while to get fully accustomed to this concept, especially if you are used to relational databases. </p>
<p>Instead, think about how you read and write data and how partitioning should be done to scale horizontally.</p>
<blockquote>
<p><strong>What does horizontal scaling mean?</strong>  </p>
<p>Horizontal scaling means you can increase throughput by adding more nodes. If your data is distributed to more servers, then more CPU, memory, and network capacity is available.</p>
</blockquote>
<p>You might ask, then why do you even need <code>email</code> in the primary key?</p>
<p>The answer is that the primary key defines what columns are used to identify rows. You need to add all columns that are required to identify a row uniquely to the primary key. Using only the country would not identify rows uniquely.</p>
<p>The partition key is vital to distribute data evenly between nodes and essential when reading the data. The previously defined schema is designed to be queried by <code>country</code> because <code>country</code> is the partition key. </p>
<p>A query that selects rows by <code>country</code> performs well:</p>
<pre><code class="lang-shell">cqlsh&gt; 
  SELECT * FROM learn_cassandra.users_by_country WHERE country='US';
</code></pre>
<p>In your <code>cqlsh</code> shell, you will send a request only to a single Cassandra node by default. This is called a consistency level of one, which enables excellent performance and scalability.</p>
<p>If you access Cassandra differently, the default consistency level might not be one.</p>
<blockquote>
<p><strong>What does consistency level of one mean?</strong>  </p>
<p>A consistency level of one means that only a single node is asked to return the data. With this approach, you will lose strong consistency guarantees and instead experience eventual consistency.  </p>
<p>We’ll dive deeper into consistency levels later on.</p>
</blockquote>
<p>Let's create another table. This one has a partition defined only by the <code>user_email</code> column:</p>
<pre><code class="lang-shell">cqlsh&gt; 
CREATE TABLE learn_cassandra.users_by_email (
    user_email text,
    country text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY (user_email)
);
</code></pre>
<p>Now let’s fill this table with some records:</p>
<pre><code class="lang-shell">cqlsh&gt; 
INSERT INTO learn_cassandra.users_by_email (user_email, country,first_name,last_name,age)
  VALUES('john@email.com', 'US', 'John','Wick',55);

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('peter@email.com', 'UK', 'Peter','Clark',65); 

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('bob@email.com', 'UK', 'Bob','Sandler',23);

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('alice@email.com', 'UK', 'Alice','Brown',26);
</code></pre>
<p>This time, each row is put in its own partition.</p>
<p><img src="https://lh3.googleusercontent.com/idG07l3IB5r_XmkI2drNIpOkB9fAhq4N9VNi_yiI6pLZFgDrFUrXizLSpO41-2RYfb_pUHqGdY641SkpUhHwz9zgWb5tQRJnccAkv0fVy4gr2wAx4orr0FPa_IaMfhkp1bmDi_5q" alt="Image" width="1600" height="817" loading="lazy"></p>
<p>This is not bad, per se. If you want to optimize for getting data by <code>email</code> only, it's a good idea:</p>
<pre><code class="lang-shell">cqlsh&gt; 
  SELECT * FROM learn_cassandra.users_by_email WHERE user_email='alice@email.com';
</code></pre>
<p>If you set up your table with a partition key for <code>user_email</code> and want to get all users by <code>age</code>, you would need to get the data from all partitions because the partitions were created by <code>user_email</code>.</p>
<p>Talking to all nodes is expensive and can cause performance issues on a large cluster.</p>
<p>Cassandra tries to avoid harmful queries. If you want to filter by a column that is not a partition key, you need to tell Cassandra explicitly that you want to filter by a non-partition key column:</p>
<pre><code class="lang-shell">cqlsh&gt; 
SELECT * FROM learn_cassandra.users_by_email WHERE age=26 ALLOW FILTERING;
</code></pre>
<p>Without <code>ALLOW FILTERING</code>, the query would not be executed to prevent harm to the cluster by accidentally running expensive queries. Executing queries without conditions (like without a <code>WHERE</code> clause) or with conditions that don’t use the partition key, are costly and should be avoided to prevent performance bottlenecks.</p>
<p>But how do you get all the rows from the table in a scalable way?</p>
<p>If you can, partition by a value like <code>country</code>. If you know all the countries, you can then iterate over all available countries, send a query for each one, and collect the results in your application.</p>
<p>In terms of scalability, it’s worse to just select all rows, because when you use a table partitioned by <code>user_email</code>, all the data is collected in 1 request in a single coordinator.</p>
<p>This is OK as long as you have no performance issues.</p>
<p>By comparison, sending multiple requests by <code>country</code> distributes the effort to different coordinator nodes, which scales a lot better.</p>
<p>If you still need access to all of the data, there is an excellent <a target="_blank" href="https://github.com/datastax/spark-cassandra-connector">integration between Spark and Cassandra</a> that allows efficient reads and writes for massive datasets. The Spark connector for Cassandra groups your data by partition key and can execute queries very efficiently.</p>
<h3 id="heading-replication">Replication</h3>
<p>Scalability using partitioning alone is limited.</p>
<p>Consider a lot of write requests arriving for a single partition. All requests would be sent to a single node with technical limitations such as CPU, memory, and bandwidth. Additionally, you want to handle read and write requests if this node is not available.</p>
<p>That is where the concept of replication comes in. By duplicating data to different nodes, so called replicas, you can serve more data simultaneously from other nodes to improve latency and throughput. It also enables your cluster to perform reads and writes in case a replica is not available.</p>
<p>In Cassandra, you need to define a replication factor for every keyspace. At the beginning of our example, you created a keyspace with a replication factor of 3 for our default datacenter:</p>
<pre><code class="lang-shell">cqlsh&gt; CREATE KEYSPACE learn_cassandra
  WITH REPLICATION = { 
   'class' : 'NetworkTopologyStrategy',
   'datacenter1' : 3 
  };
</code></pre>
<p>A replication factor of one means there’s only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved.</p>
<p>A replication factor of two means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica.</p>
<p>As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.</p>
<p>Usually, it's recommended to use a replication factor of 3 for production use cases. It makes sure your data is very unlikely to get lost or become inaccessible because there are three copies available. Also, if data is not consistent between replicas at any point in time, you can ask what information state is held by the majority.</p>
<p>In your local cluster setup, the majority means 2 out of 3 replicas. This allows us to use some powerful query options that you will see in the next section.</p>
<h3 id="heading-consistency-level">Consistency Level</h3>
<p>Now that you know about partitioning and replication, you are ready to think about consistency levels. Cassandra has a truly outstanding feature called tunable consistency. </p>
<p>You can define the consistency level of your read and write queries. You can check the <a target="_blank" href="https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/dml/dmlConfigConsistency.html">Cassandra docs</a> for all available settings.</p>
<p>Let’s focus on the most popular settings and try to understand when to choose each consistency level.</p>
<p>Let’s assume you have 3 replicas defined.</p>
<p>The first question you need to answer is, do you need strong consistency?</p>
<blockquote>
<p><strong>What does strong consistency mean?</strong>  </p>
<p>In contrast to eventual consistency, strong consistency means only one state of your data can be observed at any time in any location.  </p>
<p>For example, when consistency is critical, like in a banking domain, you want to be sure that everything is correct. You would rather accept a decrease in availability and increase of latency to ensure correctness.</p>
</blockquote>
<p>It all comes down to the <a target="_blank" href="https://en.wikipedia.org/wiki/CAP_theorem">CAP theorem</a>. You can not be available and consistent at the same time in case of connection issues between nodes of your cluster.  </p>
<p>Let's think through the following example:</p>
<p>You want to write a single value to a table. The data is replicated in 2 nodes, and the connection between the nodes is interrupted. First, a write-request is sent to node 1. Then, data is read from node 2.</p>
<p>How do you manage this situation?</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/07/image-62.png" alt="Image" width="600" height="400" loading="lazy"></p>
<ol>
<li>Should you disallow writes to all nodes to ensure consistency? This means availability would be sacrificed to ensure consistency and correctness.</li>
<li>Accept the write to node 1 and keep serving reads from both nodes. This would keep the system available but depending on what node you read from, the answer will be different, which means sacrificing consistency over availability.</li>
</ol>
<p>You can simplify the problem to make crucial decisions for your application: Do you want consistency or availability? </p>
<p>Another factor is latency. By talking to more nodes to ensure consistency, you need to wait longer to receive all nodes’ responses.</p>
<h3 id="heading-tune-for-consistency-by-setting-up-a-strong-consistency-application">Tune for Consistency by Setting up a Strong Consistency Application</h3>
<p>There is a very important formula that if true guarantees strong consistency:</p>
<pre><code>[read-consistency-level] + [write-consistency-level] &gt; [replication-factor]
</code></pre><blockquote>
<p><strong>What does consistency level mean?</strong>  </p>
<p>Consistency level means how many nodes need to acknowledge a read or a write query.</p>
</blockquote>
<p>You can shift read and write consistency levels to your favor if you want to keep strong consistency. Or you even give up strong consistency for better performance, which is also called eventual consistency:</p>
<p><img src="https://lh4.googleusercontent.com/TTm1Mgq3koomlkP5QWTzfdGrFwcII88ltYepXg5dVeF1JKaCp1K22qJHfhZN_WuG6B-MV3sWw8wNpOv26PtmlUbYTL001HPDPcQnS0wwgkSR4QxmP32_inoYa3gDcb6oUsmGSLPv" alt="Image" width="1600" height="488" loading="lazy"></p>
<p>For a read-heavy system, it’s recommended to keep read consistency low because reads happen more often than writes. Let's say you have a replication factor of 3. The formula would look like this:</p>
<pre><code><span class="hljs-number">1</span> + [write-consistency-level] &gt; <span class="hljs-number">3</span>
</code></pre><p>Therefore, the write consistency has to be set to 3 to have a strongly consistent system.</p>
<p>For a write-heavy system, you can do the same. Set the write consistency level to 1 and the read consistency level to 3.</p>
<p>You either check every node for a read to ensure all nodes have received the last updated state, or, for a write, you ensure that all nodes have written the update to their local storage. Both will make sure that data for reading and writing is correct.</p>
<p>This decision needs to be reflected in all the applications that access your Cassandra data because, on a query level, you need to set the required consistency level.</p>
<p>You set the replication factor of 3. Therefore, you can use a consistency level of <code>ALL</code> or <code>THREE</code>:</p>
<pre><code class="lang-shell">cqlsh&gt; 
   CONSISTENCY ALL;
   SELECT * FROM learn_cassandra.users_by_country WHERE country='US';
</code></pre>
<p>If just one of your applications violates the required consistency strategy, you are quickly at the risk of either dropping consistency or pressuring the cluster more than required.</p>
<h3 id="heading-tune-for-performance-by-using-eventual-consistency">Tune for Performance by Using Eventual Consistency</h3>
<p>If you don't need to be strongly consistent, you can reduce the consistency level for queries to 1 to gain performance:</p>
<pre><code class="lang-shell">cqlsh&gt; 
   CONSISTENCY ONE;
   SELECT * FROM learn_cassandra.users_by_country WHERE country='US';
</code></pre>
<p>Eventually, the data will be spread to all replicas and this will ensure <em>eventual</em> consistency. How fast data will be made consistent depends on different mechanics that sync data between nodes.</p>
<p>Various features can be tuned in Cassandra, like read-repairs and external processes that repair data continuously.</p>
<h3 id="heading-optimize-data-storage-for-reading-or-writing">Optimize Data Storage for Reading or Writing</h3>
<p>Writes are cheaper than reads in Cassandra due to its storage engine. Writing data means simply appending something to a so-called commit-log.</p>
<p>Commit-logs are append-only logs of all mutations local to a Cassandra node and reduce the required I/O to a minimum.</p>
<p>Reading is more expensive, because it might require checking different disk locations until all the query data is eventually found. </p>
<p>But this does not mean Cassandra is terrible at reading. Instead, Cassandra's storage engine can be tuned for reading performance or writing performance.</p>
<h3 id="heading-understanding-compaction">Understanding Compaction</h3>
<p>For every write operation, data is written to disk to provide durability. This means that if something goes wrong, like a power outage, data is not lost.</p>
<p>The foundation for storing data are the so-called <a target="_blank" href="https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html">SSTables</a>. SSTables are immutable data files Cassandra uses to persist data on disk.</p>
<p>You can set various strategies for a table that define how data should be merged and compacted. These strategies affect read and write performance:</p>
<ul>
<li><code>SizeTieredCompactionStrategy</code> is the default, and is especially performant if you have more writes than reads,</li>
<li><code>LeveledCompactionStrategy</code> optimizes for reads over writes. This optimization can be costly and needs to be tried out in production carefully</li>
<li><code>TimeWindowCompactionStrategy</code> is for Time-series data</li>
</ul>
<p>By default, tables use the <code>SizeTieredCompactionStrategy</code>:</p>
<pre><code class="lang-shell">cqlsh&gt; 
   DESCRIBE TABLE learn_cassandra.users_by_country;

CREATE TABLE learn_cassandra.users_by_country (
    country text,
    user_email text,
    age smallint,
    first_name text,
    last_name text,
    PRIMARY KEY (country, user_email)
) WITH CLUSTERING ORDER BY (user_email ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
</code></pre>
<p>Although you can alter the compaction strategy of an existing table, I would not suggest doing so, because all Cassandra nodes start this migration simultaneously. This will lead to significant performance issues in a production system.</p>
<p>Instead, define the compaction strategy explicitly during table creation of your new table:</p>
<pre><code class="lang-shell">cqlsh&gt; 
CREATE TABLE learn_cassandra.users_by_country_with_leveled_compaction (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), user_email)
) WITH
  compaction = { 'class' :  'LeveledCompactionStrategy'  };
</code></pre>
<p>Let’s check the result:</p>
<pre><code class="lang-shell">cqlsh&gt; 
   DESCRIBE TABLE learn_cassandra.users_by_country_with_leveled_compaction;

CREATE TABLE learn_cassandra.users_by_country_with_leveled_compaction (
    country text,
    user_email text,
    age smallint,
    first_name text,
    last_name text,
    PRIMARY KEY (country, user_email)
) WITH CLUSTERING ORDER BY (user_email ASC)
    AND bloom_filter_fp_chance = 0.1
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
</code></pre>
<p>The strategies define when and how compaction is executed. Compaction means rearranging data on disk to remove old data and keep performance as good as possible when more data needs to be stored.</p>
<p>Check out the excellent <a target="_blank" href="https://docs.datastax.com/en/dse/5.1/dse-arch/datastax_enterprise/dbInternals/dbIntHowDataMaintain.html#dbIntHowDataMaintain__dml_types_of_compaction">DataStax documentation about compaction</a> for details. There may even be better strategies in the future for the performance of your use-case.</p>
<h3 id="heading-presorting-data-on-cassandra-nodes">Presorting Data on Cassandra Nodes</h3>
<p>A table always requires a primary key. A primary key consists of 2 parts:</p>
<ul>
<li>At least 1 column(s) as partition key and</li>
<li>Zero or more clustering columns for nesting rows of the data.</li>
</ul>
<p>All columns of the partition key together are used to identify partitions. All primary key columns, meaning partition key and clustering columns, identify a specific row within a partition.</p>
<p>In Cassandra, data is already sorted on disk. So if you want to avoid sorting data later, you can make sure sorting is applied as needed. This can be ensured on the table level and avoids having to sort data in the client applications that query Cassandra.</p>
<p>In our <code>users_by_country</code> table, you can define <code>age</code> as another clustering column to sort stored data:</p>
<pre><code class="lang-shell">cqlsh&gt; 
CREATE TABLE learn_cassandra.users_by_country_sorted_by_age_asc (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), age, user_email)
) WITH CLUSTERING ORDER BY (age ASC);
</code></pre>
<p>Let’s add the same data again:</p>
<pre><code class="lang-shell">cqlsh&gt; 
INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('US','john@email.com', 'John','Wick',10);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'peter@email.com', 'Peter','Clark',30);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'bob@email.com', 'Bob','Sandler',20);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'alice@email.com', 'Alice','Brown',40);
</code></pre>
<p>And get the data by country:</p>
<pre><code class="lang-shell">cqlsh&gt; 
      SELECT * FROM learn_cassandra.users_by_country_sorted_by_age_asc WHERE country='UK';

 country | age | user_email       | first_name | last_name
---------+-----+------------------+------------+-----------
      UK |  20 | bob@email.com   |        Bob |   Sandler
      UK |  30 | peter@email.com |      Peter |     Clark
      UK |  40 | alice@email.com |      Alice |     Brown

(3 rows)
</code></pre>
<p>In this example, the clustering columns are <code>age</code> and <code>user_email</code>. So the data is first sorted by age and then by <code>user_email</code>. At its core, Cassandra is still like a key-value store. Therefore, you can only query the table by:</p>
<ul>
<li><code>country</code></li>
<li><code>country</code> and <code>age</code></li>
<li><code>country</code>, <code>age</code>, and <code>user_email</code></li>
</ul>
<p>But never by <code>country</code> and <code>user_email</code>.</p>
<p>After learning about partitioning, replication and consistency levels, let's head into data modeling and have more fun with the Cassandra cluster.</p>
<h2 id="heading-data-modeling">Data Modeling</h2>
<p>You've already learned a lot about the fundamentals of Cassandra.</p>
<p>Let's put your knowledge into practice and design a to-do list application that receives many more reads than writes.</p>
<p>The best approach is to analyze some user stories you want to fulfill with your table design:</p>
<ol>
<li>As a user, I want to create a to-do element   </li>
</ol>
<p>Note: This is only about creating data. For now, you can delay some decisions because you want to focus on how data is read.</p>
<ol start="2">
<li>As a user, I want to list all my to-do elements in ascending order  </li>
</ol>
<p>First, you need to query by <code>user_email</code>. Create a table called <code>todos_by_user_email</code>.</p>
<p>You need 1 table that contains all the information of a to-do element of a user. Data should be partitioned by <code>user_email</code> for efficient read and writes by <code>user_email</code>.</p>
<p>Also, the oldest records should be displayed first, which means using the creation date as a clustering column. The <code>creation_date</code> also ensures uniqueness.:</p>
<pre><code class="lang-shell">cqlsh&gt; 
CREATE TABLE learn_cassandra.todo_by_user_email (
    user_email text,
    name text,
    creation_date timestamp,
    PRIMARY KEY ((user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };
</code></pre>
<ol start="3">
<li>As a user, I want to share a to-do element with another user</li>
</ol>
<p>To get all the to-dos shared with a user, you need to create a table called <code>todos_shared_by_target_user_email</code> to display all shared to-dos for the target user. </p>
<p>The table contains the to-do name to display it.</p>
<p>But the user also wants to see the to-dos they shared with other users. This is another table, <code>todos_shared_by_source_user_email</code>.</p>
<p>Both tables have, according to the use-case, the required <code>user_email</code> as partition keys to allow efficient queries. Also, <code>creation_date</code> is added as a clustering column for sorting and uniqueness:</p>
<pre><code class="lang-shell">cqlsh&gt; 
CREATE TABLE learn_cassandra.todos_shared_by_target_user_email (
    target_user_email text,
    source_user_email text,
    creation_date timestamp,
    name text,
    PRIMARY KEY ((target_user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };

CREATE TABLE learn_cassandra.todos_shared_by_source_user_email (
    target_user_email text,
    source_user_email text,
    creation_date timestamp,
    name text,
    PRIMARY KEY ((source_user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };
</code></pre>
<p>This type of modeling is different than thinking about foreign keys and primary keys that you might know from traditional databases. In the beginning, it's all about defining tables and thinking about what values you want to filter and need to display.</p>
<p>You need to set a partition key to ensure the data is organised for efficient read and write operations. Also, you need to set clustering columns to ensure uniqueness, sort order, and optional query parameters.</p>
<h3 id="heading-keep-data-in-sync-using-batch-statements">Keep Data in Sync Using <code>BATCH</code> Statements</h3>
<p>Due to the duplication, you need to take care to keep data consistent. In Cassandra, you can do that by using <code>BATCH</code> statements that give you an all-at-once guarantee, also called atomicity.</p>
<p>This might sound like a lot of work, and yes, it is a lot of work! If you have a table schema with many relationships, you will have more work compared to a normalized table schema.</p>
<blockquote>
<p><strong>What is a normalized table schema?</strong>  </p>
<p>A normalized table schema is optimized to contain no duplications. Instead, data is referenced by ID and needs to be joined later.  </p>
<p>In Cassandra, you try to avoid normalized tables. It is not even possible to write a query that contains a join.</p>
</blockquote>
<p>Batch statements are cheap on a single partition, but dangerous when you execute them on different partitions, because:</p>
<ul>
<li>Data mutations will not be applied at the same time to all partitions, with no isolation</li>
<li>It is expensive for the coordinator node, because you have to talk to multiple nodes and prepare for a rollback if something goes wrong</li>
<li>There is a batch query size limit of 50kb to avoid overloading the coordinator. This limit can be increased, but this is not recommended</li>
</ul>
<p>In general, batches are costly.</p>
<p>There are other ways to apply changes eventually. If you need to execute them very often, consider using async queries instead with a proper retry mechanism. </p>
<p>Depending on the way you access your Cassandra, the driver might already offer you retry capabilities.</p>
<p>Still, this approach requires thinking about what will happen if a query is never executed. If every query really needs to be executed eventually, how can you make sure that it does not get lost if your service goes down?</p>
<p>The topic itself needs much more time to explain, and might be the main topic of another Cassandra tutorial.</p>
<p>The key learning here is: </p>
<ul>
<li>Single partition batches are cheap and should be used</li>
<li>Batches that include different partitions are expensive, and if there are a lot of reads/writes, this might be the reason why a Cassandra cluster is exhausted.  </li>
</ul>
<p>Let’s create a <code>BATCH</code> statement that contains a to-do element that is shared with a user:</p>
<pre><code class="lang-shell">cqlsh&gt; 

BEGIN BATCH
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('alice@email.com', toTimestamp(now()), 'My first todo entry')

  INSERT INTO learn_cassandra.todos_shared_by_target_user_email (target_user_email, source_user_email,creation_date,name) VALUES('bob@email.com', 'alice@email.com',toTimestamp(now()), 'My first todo entry')

  INSERT INTO learn_cassandra.todos_shared_by_source_user_email (target_user_email, source_user_email,creation_date,name) VALUES('alice@email.com', 'bob@email.com', toTimestamp(now()), 'My first todo entry')

APPLY BATCH;
</code></pre>
<p>Let’s look into one of the tables:</p>
<pre><code class="lang-shell">cqlsh&gt;          
 SELECT * FROM learn_cassandra.todos_shared_by_target_user_email WHERE target_user_email='bob@email.com';

 target_user_email | creation_date   | name   | source_user_email
-------------------+-----------------+--------+-------------------
bob@email.com | 2021-05-24 ...| My first todo entry |   alice@email.com
</code></pre>
<p>All the data exists and can be accessed in a performant way using all the defined tables.</p>
<h3 id="heading-use-foreign-keys-instead-of-duplicating-data-in-cassandra">Use Foreign Keys Instead of Duplicating Data in Cassandra</h3>
<p>You might consider using foreign keys instead of duplicating data.</p>
<p>Traditionally, foreign keys are ID-references of an entity that are located in another table and in relational database. They guarantee that the referenced ID exists.</p>
<p>In Cassandra, this might feel good because you have less duplicated data. At this point, think again about why you use Cassandra. Usually, the answer is high traffic and scalability.</p>
<p>Cassandra can scale enormously and comes with top performance when used correctly.</p>
<p>Normalizing tables is against a lot of principles in Cassandra. You can reference data by ID, but keep in mind this means you need to join the data yourself. This also means reading and writing data to multiple partitions at once.</p>
<p>Cassandra is built for scale. If you start normalizing your schema to reduce duplication, then you sacrifice horizontal scalability.</p>
<p>If you still want to use foreign keys instead of data duplication, you might want to use another database. But, everything comes with trade-offs.</p>
<p>Instead of using Cassandra, you could use a database that sacrifices performance and availability, and gives more consistency guarantees. In cases like this, I can recommend Cloud Spanner or Cockroach DB for a scalable relational database.</p>
<h3 id="heading-indexes-in-cassandra">Indexes in Cassandra</h3>
<p>There are index-like features in Cassandra that can reduce the number of tables you need to maintain on your own. One feature is called secondary indexes.</p>
<p>I cannot recommend them because they only operate locally to a node.</p>
<p>Using a secondary index means talking to all nodes because the coordinator doesn’t know which nodes contain the data if you use other columns to query data than the actual partition key.</p>
<h3 id="heading-materialized-views">Materialized Views</h3>
<p>Materialized views were designed with scalability in mind.</p>
<p>They make it easier to duplicate tables with different partition keys so you can  query data by different column combinations. They also simplify the process of creating a new table and ensuring data integrity for mutations.</p>
<p>There is only one drawback — the source table's full primary key needs to be part of the materialized view's primary key, and optionally, one other column.</p>
<p>The columns that act as partition keys can be different.</p>
<h2 id="heading-running-a-cluster">Running a Cluster</h2>
<p>Running a Cassandra cluster can be intense. It contains your business-critical data and is usually under heavy pressure.</p>
<p>I won't go into details because I am more a Cassandra user than an expert in cluster maintenance. Still, I want to share my knowledge.</p>
<h3 id="heading-fully-managed-cassandra">Fully Managed Cassandra</h3>
<p>Datastax started a fully managed Cassandra product called <a target="_blank" href="https://www.datastax.com/products/datastax-astra">Astra</a>. They promise a lot:</p>
<blockquote>
<ul>
<li>Start in minutes with a free tier, no credit card needed.  </li>
<li>Eliminate the overhead to install, operate, and scale Cassandra clusters.  </li>
<li>Build faster with REST, GraphQL, CQL, and JSON/Document APIs.  </li>
<li>Built on open-source Apache Cassandra™, used by the best of the internet.  </li>
<li>Scale elastically — apps are viral ready from Day 1.  </li>
<li>Deploy multi-cloud, multi-tenant or dedicated clusters on AWS, Azure, or GCP.  </li>
<li>Ensure enterprise-level reliability, security, and management.  </li>
</ul>
<p>Quoted from the <a target="_blank" href="https://www.datastax.com/products/datastax-astra">Astra docs</a></p>
</blockquote>
<p>I have no experience with their offering. But I would give it a try! Their <a target="_blank" href="https://www.datastax.com/products/datastax-astra/pricing">pricing</a> sounds reasonable.</p>
<h3 id="heading-self-managed-cassandra">Self-Managed Cassandra</h3>
<p>Cassandra is built with Java. So knowing the basics of running JVM applications is very beneficial.</p>
<p>If you run Kubernetes, then definitely check out <a target="_blank" href="https://k8ssandra.io/">K8ssandra</a>. It bundles all the helpful tools around Cassandra like:</p>
<ul>
<li><a target="_blank" href="https://stargate.io/">Stargate.io</a> for REST, Graphql, and API Documentation</li>
<li><a target="_blank" href="http://cassandra-reaper.io/">Reaper</a> for easier repair management</li>
<li><a target="_blank" href="https://github.com/spotify/cassandra-medusa">Medusa</a> for backups</li>
<li><a target="_blank" href="https://github.com/datastax/metric-collector-for-apache-cassandra">Metrics collector</a> for monitoring</li>
<li><a target="_blank" href="https://docs.k8ssandra.io/tasks/connect/ingress/">Traefik</a> for ingress</li>
</ul>
<p>This stack of tools is fully open source and can be used without any additional monetary costs.</p>
<p>For developers, there is one very beneficial tool called <a target="_blank" href="https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsNodetool.html">nodetool</a>. It can inspect and provide insights into how many nodes are up, what size certain tables have, how many SSTables and tombstones exist. Nodetool can also repair your data to enforce eventual consistency.</p>
<h2 id="heading-other-learnings">Other Learnings</h2>
<p>Even after years of using Cassandra, there are still things to learn that let you use Cassandra more efficiently. In this section, I want to share various topics that you will experience eventually.</p>
<h3 id="heading-data-migrations">Data Migrations</h3>
<p>If you have worked with other databases before, you might know database migration tools like flyway or liquibase. Since version 4.0 RC-1, there is basic <a target="_blank" href="https://docs.liquibase.com/workflows/database-setup-tutorials/cassandra.html">liquibase support</a>.   </p>
<p>Additionally, the community worked on something similar with <a target="_blank" href="https://github.com/patka/cassandra-migration">Cassandra-migration</a>. It already supports advanced features such as leader election, for when multiple services start at the same time.</p>
<p>Any type of export and import can be done using <a target="_blank" href="https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/dsbulkCmd.html">DSBulk</a> that allows loading and unloading data from and to Cassandra in CSV and JSON formats.</p>
<h3 id="heading-tombstones">Tombstones</h3>
<p>Cassandra is a multi-node cluster that contains replicated data on different nodes. Therefore, a delete can not simply delete a particular record.</p>
<p>For a delete operation, a new entry is added to the commit-log like for any other insert and update mutation. These deletes are called tombstones, and they flag a specific value for deletion.</p>
<p>Tombstones exist only on disk and can be analyzed and traced as described in this blog post: <a target="_blank" href="https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html">About Deletes and Tombstones in Cassandra</a>.</p>
<p>In Cassandra, you can set a time to live on inserted data. After the time passed, the record will be automatically deleted. When you set a time to live (TTL), a tombstone is created with a date in the future.</p>
<p>In comparison, a regular delete query is the same with the difference that the time date of the tombstone is set to the moment the delete is executed.</p>
<p>Let’s create a tombstone by setting a TTL in seconds which basically function as a delayed delete:</p>
<pre><code class="lang-shell">cqlsh&gt;     
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', toTimestamp(now()), 'This entry should be removed soon') USING TTL 60;
</code></pre>
<p>And the data is stored like regular data:</p>
<pre><code class="lang-shell">cqlsh&gt;      
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

  user_email    | creation_date | name
----------------+---------------+--------------------
 john@email.com | 2021-05-30... | This entry should be removed soon

(1 rows)
</code></pre>
<p>You can also read the TTL from the database for a given column:</p>
<pre><code class="lang-shell">cqlsh&gt; 
 SELECT TTL(name) FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

 ttl(name)
-----------
        43

(1 rows)
</code></pre>
<p>After 60 seconds, the row is gone.</p>
<pre><code class="lang-shell">cqlsh&gt;  
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';                                  

 user_email | creation_date | todo_uuid | name
-----------+---------------+-----------+------

(0 rows)
</code></pre>
<p>Setting a TTL is one of many ways to  create and execute tombstones.</p>
<p>Unfortunately, there are also others.</p>
<p>For example, when you insert a null value, a tombstone is created for the given cell. And as mentioned for delete requests, different types of tombstones are stored. </p>
<p>By default, after 10 days, data that is marked by a tombstone is freed with a compaction execution. This time can be configured and reduced using the <code>gc_grace_seconds</code> option in the Cassandra configuration.</p>
<blockquote>
<p><strong>When is a compaction executed?</strong>  </p>
<p>When the operation is executed depends mainly on the selected strategy. In general, a compaction execution takes <code>SSTables</code> and creates new <code>SSTables</code> out of it.  </p>
<p>The most common executions are:  </p>
<ul>
<li>When conditions for a compaction are true, that triggers compaction execution when data is inserted   </li>
<li>A manually executed major compaction using the nodetool</li>
</ul>
</blockquote>
<p>Sometimes, tombstones not deleted for the following reasons:</p>
<ul>
<li><strong>Null values</strong> mark values to be deleted and are stored as tombstones. This can be avoided by either replacing null with a static value, or not setting the value at all if the value is null</li>
<li><strong>Empty lists and sets</strong> are similar to null for Cassandra and create a tombstone, so don’t insert them if they’re empty. Take care to avoid null pointer exceptions when storing and retrieving data in your application</li>
<li><strong>Updated lists and sets</strong> create tombstones. If you update an entity and the list or set does not change, it still creates a tombstone to empty the list and set the same values. Therefore, only update necessary fields to avoid issues. The good thing is, they are compacted due to the new values</li>
</ul>
<p>If you have many tombstones, you might run into another Cassandra issue that prevents a query from being executed.</p>
<p>This happens when the <code>tombstone_failure_threshold</code> is reached, which is set by default to 100,000 tombstones. This means that, when a query has iterated over more than 100,000 tombstones, it will be aborted.</p>
<p>The issue here is, once a query stops executing, it’s not easy to tidy things up because Cassandra will stop even when you execute a delete, as it has reached the tombstone limit.</p>
<p>Usually you would never have that many tombstones. But mistakes happen, and you should take care to avoid this case.</p>
<p>There is a handy <a target="_blank" href="https://cassandra.apache.org/doc/latest/operating/metrics.html">operation metric</a> that you should observe called <code>TombstoneScannedHistogram</code> to avoid unexpected issues in production.</p>
<h3 id="heading-updates-are-just-inserts-and-vice-versa"><code>UPDATE</code>s Are Just <code>INSERT</code>s, and Vice Versa</h3>
<p>In Cassandra, everything is append-only. There is no difference between an update and insert.</p>
<p>You already learned that a primary key defines the uniqueness of a row. If there is no entry yet, a new row will appear, and if there is already an entry, the entry will be updated. It does not matter if you execute an update or insert a query.</p>
<p>The primary key in our example is set to <code>user_email</code> and <code>creation_date</code> that defines record uniqueness.</p>
<p>Let’s insert a new record:</p>
<pre><code class="lang-shell">cqlsh&gt;      
  INSERT INTO learn_cassandra.todo_by_user_email (user_email, creation_date, name) VALUES('john@email.com', '2021-03-14 16:07:19.622+0000', 'Insert query');
</code></pre>
<p>And execute an update with a new <code>todo_uuid</code>:</p>
<pre><code class="lang-shell">cqlsh&gt;    
  UPDATE learn_cassandra.todo_by_user_email SET 
    name = 'Update query'
  WHERE user_email = 'john@email.com' AND creation_date = '2021-03-14 16:10:19.622+0000';
</code></pre>
<p>2 new rows appear in our table:</p>
<pre><code class="lang-shell">cqlsh&gt;    
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';                                                                                                            

  user_email     | creation_date                   | name
----------------+---------------------------------+--------------
 john@email.com | 2021-03-14 16:10:19.622000+0000 | Update query
 john@email.com | 2021-03-14 16:07:19.622000+0000 | Insert query

(2 rows)
</code></pre>
<p>So you inserted a row using an update, and you can also use an insert to update:</p>
<pre><code class="lang-shell">cqlsh&gt;       
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', '2021-03-14 16:07:19.622+0000', 'Insert query updated');
</code></pre>
<p>Let’s check our updated row:</p>
<pre><code class="lang-shell">cqlsh&gt;   
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

 user_email     | creation_date            | name
----------------+--------------------------+----------------------
 john@email.com | 2021-03-14 16:10:19.62   |         Update query
 john@email.com | 2021-03-14 16:07:19.62   | Insert query updated


(2 rows)
</code></pre>
<p>So <code>UPDATE</code> and <code>INSERT</code> are technically the same. Don’t think that an <code>INSERT</code> fails if there is already a row with the same primary key.</p>
<p>The same applies to an <code>UPDATE</code> — it will be executed, even if the row doesn’t exist.</p>
<p>The reason for this is because, by design, Cassandra rarely reads before writing to keep performance high. The only exceptions are described in the next section about lightweight transactions.</p>
<p>But, there are restrictions what actions you can execute based on an update or insert:</p>
<ul>
<li>Counters can only be changed with <code>UPDATE</code>, not with <code>Insert</code></li>
<li><code>IF NOT EXISTS</code> can only be used in combination with an <code>INSERT</code></li>
<li><code>IF EXISTS</code> can only be used in combination with an <code>UPDATE</code></li>
</ul>
<p>You will learn more about conditions in queries within the next section.</p>
<h3 id="heading-lightweight-transactions">Lightweight Transactions</h3>
<p>You can use conditions in queries using a feature called lightweight transactions (LWTs), which execute a read to check a certain condition before executing the write.</p>
<p>Let’s only update if an entry already exists, by using <code>IF EXISTS</code>:</p>
<pre><code class="lang-shell">cqlsh&gt;     
  UPDATE learn_cassandra.todo_by_user_email SET
    name = 'Update query with LWT'
  WHERE user_email = 'john@email.com' AND creation_date = '2021-03-14 16:07:19.622+0000' IF EXISTS;

 [applied]
-----------
      True
</code></pre>
<p>The same works for an insert query using <code>IF NOT EXISTS</code>:</p>
<pre><code class="lang-shell">cqlsh&gt;      
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', toTimestamp(now()), 'Yet another entry') IF NOT EXISTS;

 [applied]
-----------
      True
</code></pre>
<p>Those executions are expensive compared to simple <code>UPDATE</code> and <code>INSERT</code> queries. Still, if it’s business-critical, they are an excellent way to achieve transactional safety.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>I hope you enjoyed the article.</p>
<p>If you liked it and feel the need to give me a round of applause, or just want to get in touch, <a target="_blank" href="https://twitter.com/sesigl">follow me on Twitter</a>.</p>
<p>I work at eBay Kleinanzeigen, one of the world’s biggest classified companies. By the way, <a target="_blank" href="https://jobs.ebayclassifiedsgroup.com/ebay-kleinanzeigen">we are hiring</a>!</p>
<p>Special thanks goes to <a target="_blank" href="https://twitter.com/infotexture">Roger Sheen</a>, <a target="_blank" href="https://twitter.com/michaeldlfx">Michael de la Fontaine</a>, <a target="_blank" href="https://twitter.com/donut1987">Christian Baer</a>, <a target="_blank" href="https://twitter.com/thomasuebel">Thomas Uebel</a> and Swen Fuhrmann for excellent feedback and proof-reading.</p>
<h2 id="heading-references">References</h2>
<ul>
<li><a target="_blank" href="https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archDataDistributeReplication.html">Cassandra docs about replication factory</a></li>
<li><a target="_blank" href="https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlshConsistency.html?hl=consistency%2Clevel">Cassandra docs about consistency</a></li>
<li><a target="_blank" href="https://docs.datastax.com/en/dse/5.1/dse-arch/datastax_enterprise/dbInternals/dbIntHowDataMaintain.html#dbIntHowDataMaintain__dml_types_of_compaction">Compaction strategy overview</a></li>
<li><a target="_blank" href="https://www.slideshare.net/DataStax/the-missing-manual-for-leveled-compaction-strategy-wei-deng-datastax-cassandra-summit-2016,%20%20https://www.youtube.com/watch?v=-5sNVvL8RwI">Details on Leveled Compaction Strategy</a></li>
<li><a target="_blank" href="https://www.datastax.com/blog/materialized-view-performance-cassandra-3x">How materialized views work</a></li>
<li><a target="_blank" href="https://issues.apache.org/jira/browse/CASSANDRA-15071?jql=status%20%3D%20Open%20AND%20priority%20in%20(Blocker%2C%20Urgent%2C%20Critical%2C%20High)%20AND%20text%20~%20%22materialized%20views%22">Known bugs with materialized views</a></li>
<li><a target="_blank" href="https://gist.github.com/irajhedayati/e5efba87c59d6bfca9550a039e84169b">Start multi-node cassandra base</a></li>
<li><a target="_blank" href="https://cassandra.apache.org/doc/latest/operating/metrics.html">Cassandra operation metrics</a></li>
<li><a target="_blank" href="https://docs.datastax.com/en/dse/5.1/dse-arch/datastax_enterprise/dbInternals/dbIntAboutDeletes.html">How is data deleted in Cassandra</a></li>
<li><a target="_blank" href="https://youtu.be/a84-UOGZiEg">How the Spark Cassandra connector works</a></li>
<li><a target="_blank" href="https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6">How sharding works</a></li>
<li><a target="_blank" href="https://www.baeldung.com/java-uuid">UUIDs in Java</a></li>
<li><a target="_blank" href="https://en.wikipedia.org/wiki/Universally_unique_identifier">Definition, history and definition of UUIDs</a></li>
<li><a target="_blank" href="https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html">Deletes and Tombstones in Cassandra</a></li>
<li><a target="_blank" href="https://www.datastax.com/blog/basic-rules-cassandra-data-modeling">Basic rules of Cassandra modeling</a></li>
<li><a target="_blank" href="https://www.datastax.com/dev">Data Stax</a></li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Apache Flink Batch Example in Java ]]>
                </title>
                <description>
                    <![CDATA[ Flink Batch Example JAVA Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Prerequisites Unix-like environment (Linux, Mac OS X, Cygwin) git Maven (we recommend version 3.0.4) Java 7 ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/apache-flink-batch-example-in-java/</link>
                <guid isPermaLink="false">66c344d7790a62b5fbf7b884</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Java ]]>
                    </category>
                
                    <category>
                        <![CDATA[ toothbrush ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Sun, 09 Feb 2020 23:27:00 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9c9ca8740569d1a4ca3379.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <h2 id="heading-flink-batch-example-java"><strong>Flink Batch Example JAVA</strong></h2>
<p>Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities.</p>
<h3 id="heading-prerequisites"><strong>Prerequisites</strong></h3>
<ul>
<li>Unix-like environment (Linux, Mac OS X, Cygwin)</li>
<li>git</li>
<li>Maven (we recommend version 3.0.4)</li>
<li>Java 7 or 8</li>
<li>IntelliJ IDEA or Eclipse IDE</li>
</ul>
<pre><code class="lang-text">git clone https://github.com/apache/flink.git
cd flink
mvn clean package -DskipTests # this will take up to 10 minutes
</code></pre>
<h2 id="heading-datasets">Datasets</h2>
<p>For the batch processing data we’ll be using the datasets in here: <a target="_blank" href="http://files.grouplens.org/datasets/movielens/ml-latest-small.zip">datasets</a> In this example we’ll be using the movies.csv and the ratings.csv, create a new java project and put them in a folder in the application base.</p>
<h2 id="heading-example">Example</h2>
<p>We’re going to make an execution where we retrieve the average rating by movie genre of the entire dataset we have.</p>
<h3 id="heading-environment-and-datasets">Environment and datasets</h3>
<p>First create a new Java file, I’m going to name it AverageRating.java</p>
<p>The first thing we’ll do is to create the execution environment and load the csv files in a dataset. Like this:</p>
<pre><code class="lang-text">ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet&lt;Tuple3&lt;Long, String, String&gt;&gt; movies = env.readCsvFile("ml-latest-small/movies.csv")
  .ignoreFirstLine()
  .parseQuotedStrings('"')
  .ignoreInvalidLines()
  .types(Long.class, String.class, String.class);

DataSet&lt;Tuple2&lt;Long, Double&gt;&gt; ratings = env.readCsvFile("ml-latest-small/ratings.csv")
  .ignoreFirstLine()
  .includeFields(false, true, true, false)
  .types(Long.class, Double.class);
</code></pre>
<p>There, we are making a dataset with a  for the movies, ignoring errors, quotes and the header line, and a dataset with  for the movie ratings, also ignoring the header, invalid lines and quotes.</p>
<h3 id="heading-flink-processing">Flink Processing</h3>
<p>Here we will process the dataset with flink. The result will be in a List of String, Double tuples. where the genre will be in the String and the average rating will be in the double.</p>
<p>First we’ll join the ratings dataset with the movies dataset by the moviesId present in each dataset. With this we’ll create a new Tuple with the movie name, genre and score. Later, we group this tuple by genre and add the score of all equal genres, finally we divide the score by the total results and we have our desired result.</p>
<pre><code class="lang-text">List&lt;Tuple2&lt;String, Double&gt;&gt; distribution = movies.join(ratings)
  .where(0)
  .equalTo(0)
  .with(new JoinFunction&lt;Tuple3&lt;Long, String, String&gt;,Tuple2&lt;Long, Double&gt;, Tuple3&lt;StringValue, StringValue, DoubleValue&gt;&gt;() {
    private StringValue name = new StringValue();
    private StringValue genre = new StringValue();
    private DoubleValue score = new DoubleValue();
    private Tuple3&lt;StringValue, StringValue, DoubleValue&gt; result = new Tuple3&lt;&gt;(name,genre,score);

    @Override
    public Tuple3&lt;StringValue, StringValue, DoubleValue&gt; join(Tuple3&lt;Long, String, String&gt; movie,Tuple2&lt;Long, Double&gt; rating) throws Exception {
      name.setValue(movie.f1);
      genre.setValue(movie.f2.split("\\|")[0]);
      score.setValue(rating.f1);
      return result;
    }
})
  .groupBy(1)
  .reduceGroup(new GroupReduceFunction&lt;Tuple3&lt;StringValue,StringValue,DoubleValue&gt;, Tuple2&lt;String, Double&gt;&gt;() {
    @Override
    public void reduce(Iterable&lt;Tuple3&lt;StringValue,StringValue,DoubleValue&gt;&gt; iterable, Collector&lt;Tuple2&lt;String, Double&gt;&gt; collector) throws Exception {
      StringValue genre = null;
      int count = 0;
      double totalScore = 0;
      for(Tuple3&lt;StringValue,StringValue,DoubleValue&gt; movie: iterable){
        genre = movie.f1;
        totalScore += movie.f2.getValue();
        count++;
      }

      collector.collect(new Tuple2&lt;&gt;(genre.getValue(), totalScore/count));
    }
})
  .collect();
</code></pre>
<p>With this you’ll have a working batch processing flink application. Enjoy!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Apache Nifi works — surf on your dataflow, don’t drown in it ]]>
                </title>
                <description>
                    <![CDATA[ By François Paupier Introduction That’s a crazy flow of water. Just like your application deals with a crazy stream of data. Routing data from one storage to another, applying validation rules and addressing questions of data governance, reliability ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/nifi-surf-on-your-dataflow-4f3343c50aa2/</link>
                <guid isPermaLink="false">66c35bf76f7f70d92b594d41</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Productivity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ software development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 03 May 2019 15:42:14 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/0*cAhBbxvhy-AOtmml" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By François Paupier</p>
<h3 id="heading-introduction">Introduction</h3>
<p>That’s a crazy flow of water. Just like your application deals with a crazy stream of data. Routing data from one storage to another, applying validation rules and addressing questions of data governance, reliability in a Big Data ecosystem is hard to get right if you do it all by yourself.</p>
<p>Good news, you don’t have to build your dataflow solution from scratch — Apache NiFi got your back!</p>
<p>At the end of this article, you’ll be a NiFi expert — ready to build your data pipeline.</p>
<h4 id="heading-what-i-will-cover-in-this-article">What I will cover in this article:</h4>
<ul>
<li>What Apache NiFi is, in which situation you should use it, and what are the key concepts to understand in NiFi.</li>
</ul>
<h4 id="heading-what-i-wont-cover">What I won’t cover:</h4>
<ul>
<li>Installation, deployment, monitoring, security, and administration of a NiFi cluster.</li>
</ul>
<p>For your convenience here is the table of content, feel free to go straight where your curiosity takes you. If you’re a NiFi first-timer, going through this article in the indicated order is advised.</p>
<h4 id="heading-table-of-content">Table of Content</h4>
<ul>
<li>I — <a class="post-section-overview" href="#741e">What is Apache NiFi?</a>  </li>
<li><a class="post-section-overview" href="#9421">Defining NiFi</a>   </li>
<li><a class="post-section-overview" href="#6cf2">Why using NiFi?</a></li>
<li>II — <a class="post-section-overview" href="#b75e">Apache Nifi under the microscope</a>  </li>
<li><a class="post-section-overview" href="#61bd">FlowFile</a>   </li>
<li><a class="post-section-overview" href="#d187">Processor</a>  </li>
<li><a class="post-section-overview" href="#924a">Process Group</a>  </li>
<li><a class="post-section-overview" href="#af10">Connection</a>  </li>
<li><a class="post-section-overview" href="#8ca0">Flow Controller</a></li>
<li><a class="post-section-overview" href="#812c">Conclusion and call to action</a></li>
</ul>
<h3 id="heading-what-is-apache-nifi">What is Apache NiFi?</h3>
<p>On the <a target="_blank" href="https://nifi.apache.org/index.html">website</a> of the Apache Nifi project, you can find the following definition:</p>
<blockquote>
<p>An easy to use, powerful, and reliable system to process and distribute data.</p>
</blockquote>
<p>Let’s analyze the keywords there.</p>
<h4 id="heading-defining-nifi">Defining NiFi</h4>
<p><strong>Process and distribute data</strong><br>That’s the gist of Nifi. It moves data around systems and gives you tools to process this data.</p>
<p>Nifi can deal with a great variety of data sources and format. You take data in from one source, transform it, and push it to a different data sink.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/oizS79jFx3hHFoRF7DfXvQya-hmSTbdlUbc1" alt="Image" width="800" height="485" loading="lazy">
<em>Ten thousand feet view of Apache Nifi — Nifi pulls data from multiple data sources, enrich it and transform it to populate a key-value store.</em></p>
<p><strong>Easy to use</strong><br>Processors — <em>the boxes —</em> linked by connectors — <em>the arrows</em> create a flow_. N_iFi offers a <a target="_blank" href="https://www.wikiwand.com/en/Flow-based_programming">flow-based programming</a> experience.</p>
<p>Nifi makes it possible to understand, at a glance, a set of dataflow operations that would take hundreds of lines of source code to implement.</p>
<p>Consider the pipeline below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/SDRmBt5o7tQkjmIn5iObqW6-spFw-NFEzaH4" alt="Image" width="800" height="493" loading="lazy">
<em>An overly minimalist data pipeline</em></p>
<p>To translate the data flow above in NiFi, you go to NiFi graphical user interface, drag and drop three components into the canvas, and<br>That’s it. It takes two minutes to build.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/phn6Q-c9SkDImkbUt6FVHkuiojIRTiBuuuzJ" alt="Image" width="800" height="367" loading="lazy">
<em>A simple validation data flow as seen through Nifi canvas</em></p>
<p>Now, if you write code to do the same thing, it’s likely to be a several hundred lines long to achieve a similar result.</p>
<p>You don’t capture the essence of the pipeline through code as you do with a flow-based approach. Nifi is more expressive to build a data pipeline; it’s <em>designed to do that</em>.</p>
<p><strong>Powerful</strong><br>NiFi provides <a target="_blank" href="https://www.nifi.rocks/apache-nifi-processors/">many processors</a> out of the box (293 in Nifi 1.9.2). You’re on the shoulders of a giant. Those standard processors handle the vast majority of use cases you may encounter.</p>
<p>NiFi is highly concurrent, yet its internals encapsulates the associated complexity. Processors offer you a high-level abstraction that hides the inherent complexity of parallel programming. Processors run simultaneously, and you can span multiple threads of a processor to cope with the load.</p>
<p>Concurrency is a computing Pandora’s box that you don’t want to open. NiFi conveniently shields the pipeline builder from the complexities of concurrency.</p>
<p><strong>Reliable</strong><br>The theory backing NiFi is not new; it has solid theoretical anchors. It’s similar to models like <a target="_blank" href="http://sosp.org/2001/papers/welsh.pdf">SEDA</a>.</p>
<p>For a dataflow system, one of the main topics to address is <a target="_blank" href="https://whatis.techtarget.com/definition/reliability">reliability</a>. You want to be sure that data sent somewhere is effectively received.</p>
<p>NiFi achieves a high level of reliability through multiple mechanisms that keep track of the state of the system at any point in time. Those mechanisms are configurable so you can make the appropriate <a target="_blank" href="http://apache-nifi-users-list.2361937.n4.nabble.com/template/NamlServlet.jtp?macro=print_post&amp;node=1532">tradeoffs</a> between latency and throughput required by your applications.</p>
<p>NiFi tracks the history of each piece of data with its lineage and provenance features. It makes it possible to know what transformation happens on each piece of information.</p>
<p>The data lineage solution proposed by Apache Nifi proves to be an excellent tool for auditing a data pipeline. Data lineage features are essential to bolster confidence in big data and AI systems in a context where transnational actors such as the European Union propose <a target="_blank" href="https://ec.europa.eu/futurium/en/ai-alliance-consultation/guidelines/1#privacy">guidelines</a> to support accurate data processing.</p>
<h4 id="heading-why-using-nifi">Why using Nifi?</h4>
<p>First, I want to make it clear I’m not here to evangelize NiFi. My goal is to give you enough elements so you can make an informed decision on the best way to build your data pipeline.</p>
<p>It’s useful to keep in mind the <a target="_blank" href="https://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data/">four Vs</a> of big data when dimensioning your solution.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9ct69RlHZVlEOBUUQXce2dQSUUyuQHlsycq2" alt="Image" width="800" height="488" loading="lazy">
<em>The four Vs of Big Data</em></p>
<ul>
<li><strong>Volume</strong> — At what scale do you operate? In order of magnitude, are you closer to a few GigaBytes or hundreds of PetaBytes?</li>
<li><strong>Variety</strong> — How many data sources do you have? Are your data structured? If yes, does the schema vary often?</li>
<li><strong>Velocity</strong> — What is the frequency of the events you process? Is it credit cards payments? Is it a daily performance report sent by an IoT device?</li>
<li><strong>Veracity</strong> — Can you trust the data? Alternatively, do you need to apply multiple cleaning operations before manipulating it?</li>
</ul>
<p>NiFi seamlessly ingests data from multiple data sources and provides mechanisms to handle different schema in the data. Thus, it shines when there is a high <strong>variety</strong> in the data.</p>
<p>Nifi is particularly valuable if data is of <strong>low veracity</strong>. Since it provides multiple processors to clean and format the data.</p>
<p>With its configuration options, Nifi can address a broad range of volume/velocity situations.</p>
<h4 id="heading-an-increasing-list-of-applications-for-data-routing-solutions">An increasing list of applications for data routing solutions</h4>
<p>New regulations, the rise of the Internet of Things and the flow of data it generates emphasize the relevance of tools such as Apache NiFi.</p>
<ul>
<li>Microservices are trendy. In those loosely coupled services, the <a target="_blank" href="https://auth0.com/blog/introduction-to-microservices-part-4-dependencies/">data is the contract</a> between the services. Nifi is a robust way to route data between those services.</li>
<li>Internet of Things brings a multitude of data to the cloud. Ingesting and validating data from the edge to the cloud poses a lot of new challenges that NiFi can efficiently address (primarily through <a target="_blank" href="https://nifi.apache.org/minifi/index.html">MiniFi</a>, NiFi project for edge devices)</li>
<li>New <a target="_blank" href="https://ec.europa.eu/futurium/en/ai-alliance-consultation/best-practices">guidelines</a> and regulations are put in place to readjust the Big Data economy. In this context of increasing monitoring, it is vital for businesses to have a clear overview of their data pipeline. NiFi data lineage, for example, can be helpful in a path towards compliance to regulations.</li>
</ul>
<h4 id="heading-bridge-the-gap-between-big-data-experts-and-the-others">Bridge the gap between big data experts and the others</h4>
<p>As you can see by the user interface, a dataflow expressed in NiFi is excellent to communicate about your data pipeline. It can help members of your organization become more knowledgeable about what’s going on in the data pipeline.</p>
<ul>
<li>An analyst is asking for insights about why this data arrives here that way? Sit together and walk through the flow. In five minutes you give someone a strong understanding of the Extract Transform and Load <em>-ETL-</em> pipeline.</li>
<li>You want feedback from your peers on a new <a target="_blank" href="https://community.hortonworks.com/questions/77336/nifi-best-practices-for-error-handling.html">error handling flow</a> you created? NiFi makes it a design decision to consider error paths as likely as valid outcomes. Expect the flow review to be shorter than a traditional code review.</li>
</ul>
<h4 id="heading-should-you-use-it-yes-no-maybe">Should you use it? Yes, No, Maybe?</h4>
<p>NiFi brands itself as easy to use. Still, it is an enterprise dataflow platform. It offers a complete set of features from which you may only need a reduced subset. Adding a new tool to the stack is not benign.</p>
<p>If you are starting from scratch and manage a few data from trusted data sources, you may be better off setting up your Extract Transform and Load — <em>ETL</em> pipeline. Maybe a <a target="_blank" href="https://martin.kleppmann.com/2015/06/02/change-capture-at-berlin-buzzwords.html">change data capture</a> from a database and some data preparations scripts are all you need.</p>
<p>On the other hand, if you work in an environment with existing big data solutions in use (be it for <a target="_blank" href="https://fr.hortonworks.com/apache/hdfs/">storage</a>, <a target="_blank" href="https://spark.apache.org/">processing</a> or <a target="_blank" href="https://kafka.apache.org/">messaging</a> ), NiFi integrates well with them and is more likely to be a quick win. You can leverage the out of the box connectors to those other Big Data solutions.</p>
<p>It’s easy to be hyped by new solutions. List your requirements and <strong>choose the solution that answers your needs as simply as possible</strong>.</p>
<p>Now that we have seen the very high picture of Apache NiFi, we take a look at its key concepts and dissect its internals.</p>
<h3 id="heading-apache-nifi-under-the-microscope">Apache Nifi under the microscope</h3>
<p>“NiFi is boxes and arrow programming” may be ok to communicate the big picture. However, if you have to operate with NiFi, you may want to understand a bit more about how it works.</p>
<p>In this second part, I explain the critical concepts of Apache NiFi with schemas. This black box model won’t be a black box to you afterward.</p>
<h4 id="heading-unboxing-apache-nifi">Unboxing Apache NiFi</h4>
<p>When you start NiFi, you land on its web interface. The web UI is the blueprint on which you design and control your data pipeline.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/7RJGNI9l458xNVh4-2Y3rm0Jt0iKLUWgAMVJ" alt="Image" width="800" height="433" loading="lazy">
<em>Apache NiFi user interface — build your pipeline by drag and dropping component on the interface</em></p>
<p>In Nifi, you assemble <em>processors</em> linked together by <em>connections</em>. In the sample dataflow introduced previously, there are three processors.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/2BFY2i1FOdRL91iGXkagqlQ3zNacNMFrDkZF" alt="Image" width="800" height="367" loading="lazy">
<em>Three processors linked together by two queues</em></p>
<p>The NiFi canvas user interface is the framework in which the pipeline builder evolves.</p>
<h4 id="heading-making-sense-of-nifi-terminology">Making sense of Nifi terminology</h4>
<p>To express your dataflow in Nifi, you must first master its language. No worries, a few terms are enough to grasp the concept behind it.</p>
<p>The black boxes are called <em>processors,</em> and they exchange chunks of information named <em>FlowFiles</em> through queues that are named <em>connections</em>. Finally, the <em>FlowFile Controller</em> is responsible for managing the resources between those components.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9F1Zm6QjmGg2HghZODu-E7c3-d9BcUxzLxuw" alt="Image" width="800" height="364" loading="lazy">
<em>Processor, FlowFile, Connector, and the FlowFile Controller: four essential concepts in NiFi</em></p>
<p>Let’s take a look at how this works under the hood.</p>
<h4 id="heading-flowfile">FlowFile</h4>
<p>In NiFi, the <strong>FlowFile</strong> is the information packet moving through the processors of the pipeline.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/IpdEyfHPnkqw-LLhHIcxHb7whRqxmsWg3unl" alt="Image" width="800" height="591" loading="lazy">
<em>Anatomy of a FlowFile — It contains attributes of the data as well as a reference to the associated data</em></p>
<p>A FlowFile comes in two parts:</p>
<ul>
<li><strong>Attributes</strong>, which are key/value pairs. For example, the file name, file path, and a unique identifier are standard attributes.</li>
<li><strong>Content</strong>, a reference to the stream of bytes compose the FlowFile content.</li>
</ul>
<p>The FlowFile does not contain the data itself. That would severely limit the throughput of the pipeline.</p>
<p>Instead, a FlowFile holds a pointer that references data stored at some place in the local storage. This place is called the <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#content-repository">Content Repository</a><em>.</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/YI-YbbYlradJJNETarUQDJgNeHrZOilsDt4E" alt="Image" width="800" height="482" loading="lazy">
<em>The Content Repository stores the content of the FlowFile</em></p>
<p>To access the content, the FlowFile <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#deeper-view-content-claim">claims</a> the resource from the Content Repository. The later keep tracks of the exact disk offset from where the content is and streams it back to the FlowFile.</p>
<p><strong>Not all processors need to access the content of the FlowFile</strong> to perform their operations — for example, aggregating the content of two FlowFiles doesn’t require to load their content in memory.</p>
<p>When a processor modifies the content of a FlowFile, the previous data is kept. NiFi <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#copy-on-write">copies-on-write</a>, it modifies the content while copying it to a new location. The original information is left intact in the Content Repository.</p>
<p><strong>Example</strong><br>Consider a processor that compresses the content of a FlowFile. The original content remains in the Content Repository, and a new entry is created for the compressed content.</p>
<p>The Content Repository finally returns the reference to the compressed content. The FlowFile is updated to point to the compressed data.</p>
<p>The drawing below sums up the example with a processor that compresses the content of FlowFiles.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/3EOfYKGRFYXePKqfELvAfh5Ds2n02yqH4OPE" alt="Image" width="800" height="477" loading="lazy">
<em>Copy-on-write in NiFi — The original content is still present in the repository after a FlowFile modification.</em></p>
<p><strong>Reliability</strong><br>NiFi claims to be reliable, how is it in practice? The attributes of all the FlowFiles currently in use, as well as the reference to their content, are stored in the <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#flowfile-repository">FlowFile Repository.</a></p>
<p>At every step of the pipeline, a modification to a Flowfile is first recorded in the FlowFile Repository, in a <a target="_blank" href="https://en.wikipedia.org/wiki/Write-ahead_logging">write-ahead log</a>, before it is performed.</p>
<p>For each FlowFile that currently exist in the system, the FlowFile repository stores:</p>
<ul>
<li>The FlowFile attributes</li>
<li>A pointer to the content of the FlowFile located in the FlowFile repository</li>
<li>The state of the FlowFile. For example: to which queue does the Flowfile belong at this instant.</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/SUxFXGFyO5SGAez3bfU8danIRdGW-Mqm447x" alt="Image" width="800" height="393" loading="lazy">
<em>The FlowFile Repository contains metadata about the files currently in the flow.</em></p>
<p>The FlowFile repository gives us the most current state of the flow; thus it’s a powerful tool to recover from an outage.</p>
<p>NiFi provides another tool to track the complete history of all the FlowFiles in the flow: the Provenance Repository.</p>
<p><strong>Provenance Repository</strong><br>Every time a FlowFile is modified, NiFi takes a snapshot of the FlowFile and its context at this point. The name for this snapshot in NiFi is a <em>Provenance Event</em>. The <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#provenance-repository">Provenance Repository</a> records Provenance Events.</p>
<p>Provenance enables us to retrace the lineage of the data and build the full chain of custody for every piece of information processed in NiFi.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/2-hoPUXtfTmAm4GzXmHS9l7NBiO5rvVaOqnt" alt="Image" width="800" height="497" loading="lazy">
<em>The Provenance Repository stores the metadata and context information of each FlowFile</em></p>
<p>On top of offering the complete lineage of the data, the Provenance Repository also offers to replay the data from any point in time.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/rudXk2KywkeoBRQfmIIXTgzYfxdySlEGdLrB" alt="Image" width="800" height="877" loading="lazy">
<em>Trace back the history of your data thanks to the Provenance Repository</em></p>
<p>Wait, what’s the difference between the FlowFile Repository and the Provenance Repository?</p>
<p>The idea behind the FlowFile Repository and the Provenance Repository is quite similar, but they don’t address the same issue.</p>
<ul>
<li>The FlowFile repository is a log that contains only the latest state of the in-use FlowFiles in the system. It is the most recent picture of the flow and makes it possible to recover from an outage quickly.</li>
<li>The Provenance Repository, on the other hand, is more exhaustive since it tracks the complete life cycle of every FlowFile that has been in the flow.</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/gKcfJu7dHmXo7oRscnS1ZPXS1Hsu5LggJO4B" alt="Image" width="800" height="488" loading="lazy">
<em>The Provenance Repository adds a time dimension where the FlowFile Repository is one snapshot</em></p>
<p>If you have only the most recent picture of the system with the FlowFile repository, the Provenance Repository gives you a collection of photos — <em>a video</em>. You can rewind to any moment in the past, investigate the data, replay operations from a given time. It provides a complete lineage of the data.</p>
<h4 id="heading-flowfile-processor">FlowFile Processor</h4>
<p>A <strong>processor</strong> is a black box that performs an operation. Processors have access to the attributes and the content of the FlowFile to perform all kind of actions. They enable you to perform many operations in data ingress, standard data transformation/validation tasks, and saving this data to various data sinks.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/8jBgnXwT8nBsYVkjUuIlec9CZRG0GXkRARff" alt="Image" width="800" height="485" loading="lazy">
<em>Three different kinds of processors</em></p>
<p>NiFi comes with many processors when you install it. If you don’t find the perfect one for your use case, it’s still possible to build your own processor. <a target="_blank" href="https://community.hortonworks.com/articles/4318/build-custom-nifi-processor.html">Writing custom processors</a> is outside the scope of this blog post.</p>
<p>Processors are high-level abstractions that fulfill one task. This abstraction is very convenient because it shields the pipeline builder from the inherent difficulties of concurrent programming and the implementation of error handling mechanisms.</p>
<p>Processors expose an interface with multiple configuration settings to fine-tune their behavior.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1DnkzZiW9KPhpovkcRIxrU6PI4yPIMhJCf53" alt="Image" width="800" height="467" loading="lazy">
_Zoom on a NiFi Processor for [record validation](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ValidateRecord/index.html" rel="noopener" target="<em>blank" title=") — pipeline builder specifies the high-level configuration options and the black box hides the implementation details.</em></p>
<p>The properties of those processors are the last link between NiFi and the business reality of your application requirements.</p>
<p>The devil is in the details, and pipeline builders spend most of their time fine-tuning those properties to match the expected behavior.</p>
<p><strong>Scaling</strong><br>For each processor, you can specify the number of concurrent tasks you want to run simultaneously. Like this, the <em>Flow Controller</em> allocates more resources to this processor, increasing its throughput. Processors share threads. If one processor requests more threads, other processors have fewer threads available to execute. Details on how the Flow Controller allocates threads are available <a target="_blank" href="https://community.hortonworks.com/articles/221808/understanding-nifi-max-thread-pools-and-processor.html">here</a>.</p>
<p><strong>Horizontal scaling.</strong> Another way to scale is to increase the number of nodes in your NiFi cluster. <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering">Clustering</a> servers make it possible to increase your processing capability using commodity hardware.</p>
<h4 id="heading-process-group">Process Group</h4>
<p>This one is straightforward now that we’ve seen what processors are.</p>
<p>A bunch of processors put together with their connections can form a process group. You add an input port and an output port so it can receive and send data.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/o97mtJQX9Lv2qGbgy8NBP8C2r2jNfQnQRg3I" alt="Image" width="800" height="603" loading="lazy">
<em>Building a new processor from three existing processors</em></p>
<p>Processor groups are an easy way to create new processors based from existing ones.</p>
<h4 id="heading-connections">Connections</h4>
<p>Connections are the queues between processors. These queues allow processors to interact at differing rates. Connections can have different capacities like there exist different size of water pipes.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/8iRHt6Xy7l2S8OWCfZyPEYAKmNXqOhCGqQ5h" alt="Image" width="800" height="230" loading="lazy">
<em>Various capacities for different connectors. Here we have capacity C1 &gt; capacity C2</em></p>
<p>Because processors consume and produce data at different rates depending on the operations they perform, connections act as buffers of FlowFiles.</p>
<p>There is a limit on how many data can be in the connection. Similarly, when your water pipe is full, you can’t add water anymore, or it overflows.</p>
<p>In NiFi you can set limits on the number of FlowFiles and the size of their aggregated content going through the connections.</p>
<p><strong>What happens when you send more data than the connection can handle?</strong></p>
<p>If the number of FlowFiles or the quantity of data goes above the defined threshold, <em>backpressure</em> is applied. The Flow Controller won’t schedule the previous processor to run again until there is room in the queue.</p>
<p>Let’s say you have a limit of 10 000 FlowFiles between two processors. At some point, the connection has 7 000 elements in it. It is ok since the limit is 10 000. <em>P1</em> can still send data through the connection to <em>P2</em>.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ZpaLFmUmNG2L16aBV7Kjk9ADhs8CCBc39Fzr" alt="Image" width="800" height="220" loading="lazy">
<em>Two processors linked by a connector with its limit respected.</em></p>
<p>Now let’s say that processor one sends 4 000 new FlowFiles to the connection.<br>7 0000 + 4 000 = 11 000 → We go above the connection threshold of 10 000 FlowFiles.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/wucrVEx2N8vgxf8e9Ss9gzR3C1PcOvo9uVDN" alt="Image" width="800" height="218" loading="lazy">
<em>Processor P1 not scheduled until the connector goes back below its threshold.</em></p>
<p>The limits are <em>soft limits,</em> meaning they can be exceeded. However, once they are, the previous processor, <em>P1</em> won’t be scheduled until the connector goes back below its threshold value — 10 000 FlowFiles.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/KKOd45PcA8yEav1p593VDtzf2MCX5Fc8g2pG" alt="Image" width="800" height="231" loading="lazy">
<em>Number of FlowFiles in the connector comes back below the threshold. The Flow Controller schedules the processor P1 for execution again.</em></p>
<p>This simplified example gives the big picture of how <a target="_blank" href="https://en.wikipedia.org/wiki/Back_pressure">backpressure</a> works.</p>
<p>You want to setup connection thresholds appropriate to the Volume and Velocity of data to handle. <em>Keep in mind the Four Vs</em>.</p>
<p>The idea of exceeding a limit may sound odd. When the number of FlowFiles or the associated data go beyond the threshold, a <a target="_blank" href="https://community.hortonworks.com/articles/184990/dissecting-the-nifi-connection-heap-usage-and-perf.html">swap mechanism</a> is triggered.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0Qf2xfUhSaq43Ma5pWYkgVnqBAWkSvu1gVlV" alt="Image" width="800" height="517" loading="lazy">
<em>Active queue and Swap in Nifi connectors</em></p>
<p>For another example on backpressure, <a target="_blank" href="http://mail-archives.apache.org/mod_mbox/nifi-users/201604.mbox/%3CBLU436-SMTP24995D5F6EDF5985AADFE23CE680@phx.gbl%3E">this mail thread</a> can help.</p>
<p><strong>Prioritizing FlowFiles</strong><br>The connectors in NiFi are highly configurable. You can choose <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#prioritization">how you prioritize</a> FlowFiles in the queue to decide which one to process next.</p>
<p>Among the available possibility, there is, for example, the First In First Out order — <em>FIFO. However,</em> you can even use an attribute of your choice from the FlowFile to prioritize incoming packets.</p>
<h4 id="heading-flow-controller">Flow Controller</h4>
<p>The Flow Controller is the glue that brings everything together. It allocates and manages threads for processors. It’s what executes the dataflow.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/XrTQX8uhG36C9plkkVd-BtbBe3hn5JEpNi8N" alt="Image" width="800" height="420" loading="lazy">
<em>The Flow Controller coordinates the allocation of resources for processors.</em></p>
<p>Also, the Flow Controller makes it possible to add Controller Services.</p>
<p>Those services facilitate the management of shared resources like database connections or cloud services provider credentials. Controller services are <a target="_blank" href="http://www.linfo.org/daemon.html">daemons</a>. They run in the background and provide configuration, resources, and parameters for the processors to execute.</p>
<p>For example, you may use an <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/components/nifi-docs/components/org.apache.nifi/nifi-aws-nar/1.9.0/org.apache.nifi.processors.aws.credentials.provider.service.AWSCredentialsProviderControllerService/index.html">AWS credentials provider service</a> to make it possible for your services to interact with S3 buckets without having to worry about the credentials at the processor level.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/myXlwFSHLAuCL2di582ctwMQ9ulz-SpS7Lcu" alt="Image" width="800" height="505" loading="lazy">
<em>An AWS credentials service provide context to two processors</em></p>
<p>Just like with processors, a <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/components/nifi-docs/">multitude of controller services</a> is available out of the box.</p>
<p>You can check out <a target="_blank" href="https://community.hortonworks.com/articles/90259/understanding-controller-service-availability-in-a.html">this article</a> for more content on the controller services.</p>
<h3 id="heading-conclusion-and-call-to-action">Conclusion and call to action</h3>
<p>In the course of this article, we discussed NiFi, an enterprise dataflow solution. You now have a strong understanding of what NiFi does and how you can leverage its data routing features for your applications.</p>
<p>If you’re reading this, congrats! You now know more about NiFi than 99.99% of the world’s population.</p>
<p>Practice makes perfect. You master all the concepts required to start building your own pipeline. <strong>Make it simple; make it work first.</strong></p>
<p>Here is a list of exciting resources I compiled on top of my work experience to write this article.</p>
<h4 id="heading-resources">Resources ?</h4>
<h4 id="heading-the-bigger-picture">The bigger picture</h4>
<p>Because designing data pipeline in a complex ecosystem requires proficiency in multiple areas, I highly recommend the book <a target="_blank" href="https://dataintensive.net/"><em>Designing Data-Intensive Applications</em></a> from Martin Kleppmann. It covers the fundamentals.</p>
<ul>
<li>A cheat sheet with all the references quoted in Martin’s book is available on his <a target="_blank" href="https://github.com/ept/ddia-references">Github repo</a>.</li>
</ul>
<p>This cheat sheet is a great place to start if you already know what kind of topic you’d like to study in-depth and you want to find quality materials.</p>
<h4 id="heading-alternatives-to-apache-nifi">Alternatives to Apache Nifi</h4>
<p>Other dataflow solutions exist.</p>
<p>Open source:</p>
<ul>
<li><a target="_blank" href="https://streamsets.com/">Streamsets</a> is similar to NiFi; a good comparison is available on <a target="_blank" href="https://statsbot.co/blog/open-source-etl/">this blog</a></li>
</ul>
<p>Most of the existing cloud providers offer dataflow solutions. Those solutions integrate easily with other products you use from this cloud provider. At the same time, it solidly ties you to a particular vendor.</p>
<ul>
<li><a target="_blank" href="https://azure.microsoft.com/en-us/services/data-factory/">Azure Data Factory</a>, A Microsoft solution</li>
<li>IBM has its <a target="_blank" href="https://www.ibm.com/us-en/marketplace/datastage">InfoSphere DataStage</a></li>
<li>Amazon proposes a tool named <a target="_blank" href="https://docs.aws.amazon.com/en_us/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html">Data Pipeline</a></li>
<li>Google offers its <a target="_blank" href="https://cloud.google.com/dataflow/">Dataflow</a></li>
<li>Alibaba cloud introduces a service <a target="_blank" href="https://www.alibabacloud.com/help/doc-detail/30256.htm?spm=a2c63.p38356.b99.2.d115c242ZFQbSN">DataWorks</a> with similar features</li>
</ul>
<h4 id="heading-nifi-related-resources">NiFi related resources</h4>
<ul>
<li>The official <a target="_blank" href="https://nifi.apache.org/docs.html">Nifi documentation</a> and especially the <a target="_blank" href="https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html">Nifi In-depth</a> section are gold mines.</li>
<li>Registering to Nifi users mailing list is also a great way to be informed — for example, <a target="_blank" href="http://mail-archives.apache.org/mod_mbox/nifi-users/201604.mbox/%3CBLU436-SMTP24995D5F6EDF5985AADFE23CE680@phx.gbl%3E">this conversation</a> explains back-pressure.</li>
<li>Hortonworks, a big data solutions provider, has a community website full of engaging resources and <em>how-to</em> for Apache Nifi.<br>— <a target="_blank" href="https://community.hortonworks.com/articles/184990/dissecting-the-nifi-connection-heap-usage-and-perf.html">This article</a> goes in depth about connectors, heap usage, and back pressure.<br>— <a target="_blank" href="https://community.hortonworks.com/articles/135337/nifi-sizing-guide-deployment-best-practices.html">This one</a> shares dimensioning best practices when deploying a NiFi cluster.</li>
<li>The <a target="_blank" href="https://blogs.apache.org/nifi/">NiFi blog</a> distills a lot of insights NiFi usage patterns as well as tips on how to build pipelines.</li>
<li><a target="_blank" href="https://www.enterpriseintegrationpatterns.com/patterns/messaging/StoreInLibrary.html">Claim Check pattern</a> explained</li>
<li>The theory behind Apache Nifi is not new, Seda referenced in Nifi Doc is extremely relevant<br>— Matt Welsh. Berkeley. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services [online]. Retrieved: 21 Apr 2019, from <a target="_blank" href="http://www.mdw.la/papers/seda-sosp01.pdf">http://www.mdw.la/papers/seda-sosp01.pdf</a></li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
