<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ hadoop - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ hadoop - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 25 May 2026 22:38:27 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/hadoop/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Use Google Dataproc – Example with PySpark and Jupyter Notebook ]]>
                </title>
                <description>
                    <![CDATA[ In this article, I'll explain what Dataproc is and how it works. Dataproc is a Google Cloud Platform managed service for Spark and Hadoop which helps you with Big Data Processing, ETL, and Machine Learning. It provides a Hadoop cluster and supports H... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/what-is-google-dataproc/</link>
                <guid isPermaLink="false">66d460f6d1ffc3d3eb89de60</guid>
                
                    <category>
                        <![CDATA[ Google Cloud Platform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hadoop ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ spark ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sameer Shukla ]]>
                </dc:creator>
                <pubDate>Tue, 03 May 2022 15:14:31 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/05/My-project.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this article, I'll explain what Dataproc is and how it works.</p>
<p>Dataproc is a Google Cloud Platform managed service for Spark and Hadoop which helps you with Big Data Processing, ETL, and Machine Learning. It provides a Hadoop cluster and supports Hadoop ecosystems tools like Flink, Hive, Presto, Pig, and Spark.</p>
<p>Dataproc is an auto-scaling cluster which manages logging, monitoring, cluster creation of your choice and job orchestration. You'll need to manually provision the cluster, but once the cluster is provisioned you can submit jobs to Spark, Flink, Presto, and Hadoop.</p>
<p>Dataproc has implicit integration with other GCP products like Compute Engine, Cloud Storage, Bigtable, BigQuery, Cloud Monitoring, and so on. The jobs supported by Dataproc are MapReduce, Spark, PySpark, SparkSQL, SparkR, Hive and Pig.</p>
<p>Apart from that, Dataproc allows native integration with Jupyter Notebooks as well, which we'll cover later in this article.</p>
<p>In the article, we are going to cover:</p>
<ol>
<li><p>Dataproc cluster types and how to set Dataproc up</p>
</li>
<li><p>How to submit a PySpark job to Dataproc</p>
</li>
<li><p>How to create a Notebook instance and execute PySpark jobs through Jupyter Notebook.</p>
</li>
</ol>
<h2 id="heading-how-to-create-a-dataproc-cluster">How to Create a Dataproc Cluster</h2>
<p>Dataproc has three cluster types:</p>
<ol>
<li><p>Standard</p>
</li>
<li><p>Single Node</p>
</li>
<li><p>High Availability</p>
</li>
</ol>
<p>The Standard cluster can consist of 1 master and N worker nodes. The Single Node has only 1 master and 0 worker nodes. For production purposes, you should use the High Availability cluster which has 3 master and N worker nodes.</p>
<p>For our learning purposes, a single node cluster is sufficient which has only 1 master Node.</p>
<p>Creating Dataproc clusters in GCP is straightforward. First, we'll need to enable Dataproc, and then we'll be able to create the cluster.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-185.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Start Dataproc cluster creation</em></p>
<p>When you click "Create Cluster", GCP gives you the option to select Cluster Type, Name of Cluster, Location, Auto-Scaling Options, and more.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-199.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Parameters required for Cluster</em></p>
<p>Since we've selected the Single Node Cluster option, this means that auto-scaling is disabled as the cluster consists of only 1 master node.</p>
<p>The Configure Nodes option allows us to select the type of machine family like Compute Optimized, GPU and General-Purpose.</p>
<p>In this tutorial, we'll be using the General-Purpose machine option. Through this, you can select Machine Type, Primary Disk Size, and Disk-Type options.</p>
<p>The Machine Type we're going to select is n1-standard-2 which has 2 CPU’s and 7.5 GB of memory. The Primary Disk size is 100GB which is sufficient for our demo purposes here.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-200.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Master Node Configuration</em></p>
<p>We've selected the cluster type of Single Node, which is why the configuration consists only of a master node. If you select any other Cluster Type, then you'll also need to configure the master node and worker nodes.</p>
<p>From the Customise Cluster option, select the default network configuration:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-201.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Use the option "Scheduled Deletion" in case no cluster is required at a specified future time (or say after a few hours, days, or minutes).</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/5_ml_resize_x2_colored_toned_light_ai-1.jpg" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Schedule Deleting Setting</em></p>
<p>Here, we've set "Timeout" to be 2 hours, so the cluster will be automatically deleted after 2 hours.</p>
<p>We'll use the default security option which is a Google-managed encryption key. When you click "Create", it'll start creating the cluster.</p>
<p>You can also create the cluster using the ‘gcloud’ command which you'll find on the ‘EQUIVALENT COMMAND LINE’ option as shown in image below.</p>
<p>And you can create a cluster using a POST request which you'll find in the ‘Equivalent REST’ option.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-203.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>gcloud and REST option for Cluster creation</em></p>
<p>After few minutes the cluster with 1 master node will be ready for use.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-204.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Cluster Up and Running</em></p>
<p>You can find details about the VM instances if you click on "Cluster Name":</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-205.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-206.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-submit-a-pyspark-job">How to Submit a PySpark Job</h2>
<p>Let’s briefly understand how a PySpark Job works before submitting one to Dataproc. It’s a simple job of identifying the distinct elements from the list containing duplicate elements.</p>
<pre><code class="lang-python"><span class="hljs-comment">#! /usr/bin/python</span>

<span class="hljs-keyword">import</span> pyspark

<span class="hljs-comment">#Create List</span>
numbers = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>]

<span class="hljs-comment">#SparkContext</span>
sc = pyspark.SparkContext()

<span class="hljs-comment"># Creating RDD using parallelize method of SparkContext</span>
rdd = sc.parallelize(numbers)

<span class="hljs-comment">#Returning distinct elements from RDD</span>
distinct_numbers = rdd.distinct().collect()

<span class="hljs-comment">#Print</span>
print(<span class="hljs-string">'Distinct Numbers:'</span>, distinct_numbers)
</code></pre>
<p>Upload the .py file to the GCS bucket, and we'll need its reference while configuring the PySpark Job.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-21.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Job GCS Location</em></p>
<p>Submitting jobs in Dataproc is straightforward. You just need to select “Submit Job” option:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-209.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Job Submission</em></p>
<p>For submitting a Job, you'll need to provide the Job ID which is the name of the job, the region, the cluster name (which is going to be the name of cluster, "first-data-proc-cluster"), and the job type which is going to be PySpark.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-223.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Parameters required for Job Submission</em></p>
<p>You can get the Python file location from the GCS bucket where the Python file is uploaded – you'll find it at gsutil URI.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-24.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>No other additional parameters are required, and we can now submit the job:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-224.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>After execution, you should be able to find the distinct numbers in the logs:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-213.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Logs</em></p>
<h2 id="heading-how-to-create-a-jupyter-notebook-instance">How to Create a Jupyter Notebook Instance</h2>
<p>You can associate a notebook instance with Dataproc Hub. To do that, GCP provisions a cluster for each Notebook Instance. We can execute PySpark and SparkR types of jobs from the notebook.</p>
<p>To create a notebook, use the "Workbench" option like below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-26.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Make sure you go through the usual configurations like Notebook Name, Region, Environment (Dataproc Hub), and Machine Configuration (we're using 2 vCPUs with 7.5 GB RAM). We're using the default Network settings, and in the Permission section, select the "Service account" option.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-225.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Parameters required for Notebook Cluster Creation</em></p>
<p>Click Create:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-216.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Notebook Cluster Up &amp; Running</em></p>
<p>The "OPEN JUPYTYERLAB" option allows users to specify the cluster options and zone for their notebook.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-226.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-227.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Once the provisioning is completed, the Notebook gives you a few kernel options:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/image-27.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Click on PySpark which will allow you to execute jobs through the Notebook.</p>
<p>A SparkContext instance will already be available, so you don't need to explicitly create SparkContext. Apart from that, the program remains the same.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/image-220.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Code snapshot on Notebook</em></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Working on Spark and Hadoop becomes much easier when you're using GCP Dataproc. The best part is that you can create a notebook cluster which makes development simpler.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ A Quick Overview of the Apache Hadoop Framework ]]>
                </title>
                <description>
                    <![CDATA[ Hadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stu... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/a-quick-overview-of-the-apache-hadoop-framework/</link>
                <guid isPermaLink="false">66c3431d4f1fc448a3678f81</guid>
                
                    <category>
                        <![CDATA[ big data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hadoop ]]>
                    </category>
                
                    <category>
                        <![CDATA[ toothbrush ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Sat, 01 Feb 2020 00:00:00 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9c9d24740569d1a4ca3622.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Hadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stuffed elephant that inspired the name appears in Hadoop’s logo.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/04/1200px-Hadoop_logo_new.svg.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-what-is-apache-hadoop">What is Apache Hadoop?</h2>
<blockquote>
<p>The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.  </p>
<p>Source: <a target="_blank" href="https://hadoop.apache.org/">Apache Hadoop</a></p>
</blockquote>
<p>In 2003 Google released their paper on the Google File System (GFS). It detailed a proprietary distributed file system intended to provide efficient access to large amounts of data using commodity hardware. A year later, Google released another paper entitled “MapReduce: Simplified Data Processing on Large Clusters.” At the time, Doug was working at Yahoo. These papers were the inspiration for his open-source project Apache Nutch. In 2006, the project components then known as Hadoop moved out of Apache Nutch and was released.</p>
<h2 id="heading-why-is-hadoop-useful">Why is Hadoop useful?</h2>
<p>Every day, billions of gigabytes of data are created in a variety of forms. Some examples of frequently created data are:</p>
<ul>
<li>Metadata from phone usage</li>
<li>Website logs</li>
<li>Credit card purchase transactions</li>
<li>Social media posts</li>
<li>Videos</li>
<li>Information gathered from medical devices</li>
</ul>
<p>“Big data” refers to data sets that are too large or complex to process using traditional software applications. Factors that contribute to the complexity of data are the size of the data set, speed of available processors, and the data’s format.</p>
<p>At the time of its release, Hadoop was capable of processing data on a larger scale than traditional software.</p>
<h3 id="heading-core-hadoop"><strong>Core Hadoop</strong></h3>
<p>Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.</p>
<p>HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.</p>
<h3 id="heading-hadoop-ecosystem"><strong>Hadoop Ecosystem</strong></h3>
<p>Many other software packages exist to complement Hadoop. These programs comprise the the Hadoop Ecosystem. Some programs make it easier to load data into the Hadoop cluster, while others make Hadoop easier to use.</p>
<p>The Hadoop Ecosystem includes:</p>
<ul>
<li>Apache Hive</li>
<li>Apache Pig</li>
<li>Apache HBase</li>
<li>Apache Phoenix</li>
<li>Apache Spark</li>
<li>Apache ZooKeeper</li>
<li>Cloudera Impala</li>
<li>Apache Flume</li>
<li>Apache Sqoop</li>
<li>Apache Oozie</li>
</ul>
<h2 id="heading-more-information">More Information:</h2>
<ul>
<li><a target="_blank" href="http://hadoop.apache.org/">Apache Hadoop</a></li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ An in-depth introduction to SQOOP architecture ]]>
                </title>
                <description>
                    <![CDATA[ By Jayvardhan Reddy Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and structured data-stores such as relational databases, and vice-versa. _Image Credits: [hdfstutorial.com](https://www.h... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/an-in-depth-introduction-to-sqoop-architecture-ad4ae0532583/</link>
                <guid isPermaLink="false">66c343e4ccd54aa295e92c8a</guid>
                
                    <category>
                        <![CDATA[ architecture ]]>
                    </category>
                
                    <category>
                        <![CDATA[ big data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hadoop ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 26 Feb 2019 17:53:46 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*3aWPwVLlbZ8sq4aboE_CQw.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Jayvardhan Reddy</p>
<p><strong>Apache Sqoop</strong> is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and structured data-stores such as relational databases, and vice-versa.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/yA5Wt8JEHKyIDA-bK2ehlgYGN03XXPgKFmdz" alt="Image" width="585" height="271" loading="lazy">
_Image Credits: [hdfstutorial.com](https://www.hdfstutorial.com/sqoop-architecture/" rel="noopener" target="<em>blank" title=")</em></p>
<p>As part of this blog, I will be explaining how the architecture works on executing a Sqoop command. I’ll cover details such as the jar generation via Codegen, execution of MapReduce job, and the various stages involved in running a Sqoop import/export command.</p>
<h3 id="heading-codegen"><strong>Codegen</strong></h3>
<p>Understanding Codegen is essential, as internally this converts our Sqoop job into a jar which consists of several Java classes such as POJO, ORM, and a class that implements DBWritable, extending SqoopRecord to read and write the data from relational databases to Hadoop &amp; vice-versa.</p>
<p>You can create a Codegen explicitly as shown below to check the classes present as part of the jar.</p>
<pre><code>sqoop codegen \   -- connect jdbc:mysql:<span class="hljs-comment">//ms.jayReddy.com:3306/retail_db \   -- username retail_user \   -- password ******* \   -- table products</span>
</code></pre><p>The output jar will be written in your local file system. You will get a Jar file, Java file and java files which are compiled into .class files:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9584V87MKrbpriG-qjgfkRmU95O4DqMWBMpw" alt="Image" width="800" height="146" loading="lazy"></p>
<p>Let us see a snippet of the code that will be generated.</p>
<p>ORM class for table ‘products’ // Object-relational modal generated for mapping:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/R-YKp7vBHJyG0U9fkHmUrbDnie-O2-3G8PwV" alt="Image" width="800" height="229" loading="lazy"></p>
<p>Setter &amp; Getter methods to get values:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/do50-AIfmsBnWa0UmDElJu8lfJMl0lgATWtk" alt="Image" width="734" height="196" loading="lazy"></p>
<p>Internally it uses JDBC prepared statements to write to Hadoop and ResultSet to read data from Hadoop.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/dp1Oud1aHWZ6zPFRmDhTf5p4KUuPrYIUg4K6" alt="Image" width="800" height="251" loading="lazy"></p>
<h3 id="heading-sqoop-import"><strong>Sqoop Import</strong></h3>
<p>It is used to import data from traditional relational databases into Hadoop.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1fuDCRMH99ZB3HA496qQycCcGQFzs7c6kdNl" alt="Image" width="535" height="346" loading="lazy">
_Image Credits: [dummies.com](https://www.dummies.com/programming/big-data/hadoop/hadoop-for-dummies-cheat-sheet/" rel="noopener" target="<em>blank" title=")</em></p>
<p>Let’s see a sample snippet for the same.</p>
<pre><code>sqoop <span class="hljs-keyword">import</span> \   -- connect jdbc:mysql:<span class="hljs-comment">//ms.jayReddy.com:3306/retail_db \   -- username retail_user \   -- password ******* \   -- table products \   -- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir \   -- delete-target-dir</span>
</code></pre><p>The following steps take place internally during the execution of sqoop.</p>
<p><strong>Step 1</strong>: Read data from MySQL in streaming fashion. It does various operations before writing the data into HDFS.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/FSvnl854UDyo8C9QKIOO0aMLNBw5uWed-KEJ" alt="Image" width="687" height="24" loading="lazy"></p>
<p>As part of this process, it will first generate code (typical Map reduce code) which is nothing but Java code. Using this Java code it will try to import.</p>
<ul>
<li>Generate the code. (Hadoop MR)</li>
<li>Compile the code and generate the Jar file.</li>
<li>Submit the Jar file and perform the import operations</li>
</ul>
<p>During the import, it has to make certain decisions as to how to divide the data into multiple threads so that Sqoop import can be scaled.</p>
<p><strong>Step 2</strong>: Understand the structure of the data and perform CodeGen</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/iw8VSQhwmd4uvmqwN0MxCG15xrFBPGyRZdXy" alt="Image" width="800" height="45" loading="lazy"></p>
<p>Using the above SQL statement, it will fetch one record along with the column names. Using this information, it will extract the metadata information of the columns, datatype etc.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/rEfjXBnXyMjmyvtcIub-cxby3LS31vpFCFyt" alt="Image" width="476" height="418" loading="lazy">
_Image Credits: [cs.tut.fi](http://www.cs.tut.fi/~aaltone3/kurssit/hadoop/Sqoop_pdf.pdf" rel="noopener" target="<em>blank" title=")</em></p>
<p><strong>Step 3</strong>: Create the java file, compile it and generate a jar file</p>
<p>As part of code generation, it needs to understand the structure of the data and it has to apply that object on the incoming data internally to make sure the data is correctly copied onto the target database. Each unique table has one Java file talking about the structure of data.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/IVi4qXeQV0wHLso3jw-YSNt4Qdq1jz5WOSSQ" alt="Image" width="800" height="39" loading="lazy"></p>
<p>This jar file will be injected into Sqoop binaries to apply the structure to incoming data.</p>
<p><strong>Step 4</strong>: Delete the target directory if it already exists.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/PnNCCNdcFYG63ckjOdNQz9sLwHwp-xQhA6mh" alt="Image" width="800" height="22" loading="lazy"></p>
<p><strong>Step 5</strong>: Import the data</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/L0xKeU6eZzzFNXq9GTUizLA9daPHdicLyKcm" alt="Image" width="800" height="60" loading="lazy"></p>
<p>Here, it connects to a resource manager, gets the resource, and starts the application master.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0k04I6Df7Ox1UGcxyOEqh-WENTYZtboAPfAH" alt="Image" width="719" height="22" loading="lazy"></p>
<p>To perform equal distribution of data among the map tasks, it internally executes a boundary query based on the primary key by default<br> to find the minimum and maximum count of records in the table.<br> Based on the max count, it will divide by the number of mappers and split it amongst each mapper.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ixpOtqkpYybBmnTLp1o9vsvkG5Z22ybCYWMB" alt="Image" width="800" height="92" loading="lazy"></p>
<p>It uses 4 mappers by default:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/SvulfY8XlKP3-Th9pY7nLI0RBaWZs4spjFWv" alt="Image" width="800" height="284" loading="lazy"></p>
<p>It executes these jobs on different executors as shown below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/4doX1MPcDsGOirBF0qyTlaEZCUEvZfiCqg1w" alt="Image" width="800" height="140" loading="lazy"></p>
<p>The default number of mappers can be changed by setting the following parameter:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/J4gGRZO4nsjSvBqfH8yopHaCgYyodWutLGLl" alt="Image" width="739" height="26" loading="lazy"></p>
<p>So in our case, it uses 4 threads. Each thread processes mutually exclusive subsets, that is each thread processes different data from the others.</p>
<p>To see the different values, check out the below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/bRNZNgynB99qUWQVotlG0PCM7UYUYMzatqE1" alt="Image" width="800" height="338" loading="lazy"></p>
<p>Operations that are being performed under each executor nodes:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Q6V3RYKFJ56mlEPTX5VTPGQqBYWpdlSGBoXW" alt="Image" width="800" height="153" loading="lazy"></p>
<p>In case you perform a Sqooop hive import, one extra step as part of the execution takes place.</p>
<p><strong>Step 6</strong>: Copy data to hive table</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/TRcmgwhHAQy2SutU-R13R53ejFPJ2j2JsB7R" alt="Image" width="800" height="126" loading="lazy"></p>
<h3 id="heading-sqoop-export"><strong>Sqoop Export</strong></h3>
<p>This is used to export data from Hadoop into traditional relational databases.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/s1lKtokuWsuEqmHsb92--czqDaFuQKd8Dvtm" alt="Image" width="800" height="337" loading="lazy">
_Image Credits: [slideshare.net](https://www.slideshare.net/gharriso/from-oracle-to-hadoop-with-sqoop-and-other-tools" rel="noopener" target="<em>blank" title=")</em></p>
<p>Let’s see a sample snippet for the same:</p>
<pre><code>sqoop <span class="hljs-keyword">export</span> \  -- connect jdbc:mysql:<span class="hljs-comment">//ms.jayReddy.com:3306/retail_export \  -- username retail_user \  -- password ******* \  -- table product_sqoop_exp \  -- export-dir /user/jvanchir/sqoop_prac/import_table_dir/products</span>
</code></pre><p>On executing the above command, the execution steps (1–4) similar to Sqoop import take place, but the source data is read from the file system (which is nothing but HDFS). Here it will use boundaries upon block size to divide the data and it is internally taken care by Sqoop.</p>
<p>The processing splits are done as shown below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/pFCifYgZx8KRMRCVxfJdk7HOigxDTZOX5UQz" alt="Image" width="800" height="67" loading="lazy"></p>
<p>After connecting to the respective database to which the records are to be exported, it will issue a JDBC insert command to read data from HDFS and store it into the database as shown below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/dWB1TZmEH07zlJ3TOKnVZm1dvzVvbEKOIa5c" alt="Image" width="737" height="26" loading="lazy"></p>
<p>Now that we have seen how Sqoop works internally, you can determine the flow of execution from jar generation to execution of a MapReduce task on the submission of a Sqoop job.</p>
<p><strong>Note<em>:</em></strong> The commands that were executed related to this post are added as part of my <a target="_blank" href="https://github.com/Jayvardhan-Reddy/BigData-Ecosystem-Architecture">GIT</a> account.</p>
<p>Similarly, you can also read more here:</p>
<ul>
<li><a target="_blank" href="https://medium.com/plumbersofdatascience/hive-architecture-in-depth-ba44e8946cbc">Hive Architecture in Depth</a> with <strong>code</strong>.</li>
<li><a target="_blank" href="https://medium.com/plumbersofdatascience/hdfs-architecture-in-depth-1edb822b95fa">HDFS Architecture in Depth</a> with <strong>code</strong>.</li>
</ul>
<p>If you would like too, you can connect with me on LinkedIn - <a target="_blank" href="https://www.linkedin.com/in/jayvardhan-reddy-vanchireddy">Jayvardhan Reddy</a>.</p>
<p>If you enjoyed reading this article, you can click the clap and let others know about it. If you would like me to add anything else, please feel free to leave a response ?</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
