hadoop - freeCodeCamp.org

How to Use Google Dataproc – Example with PySpark and Jupyter Notebook

Sameer Shukla — Tue, 03 May 2022 15:14:31 +0000

In this article, I'll explain what Dataproc is and how it works.

Dataproc is a Google Cloud Platform managed service for Spark and Hadoop which helps you with Big Data Processing, ETL, and Machine Learning. It provides a Hadoop cluster and supports Hadoop ecosystems tools like Flink, Hive, Presto, Pig, and Spark.

Dataproc is an auto-scaling cluster which manages logging, monitoring, cluster creation of your choice and job orchestration. You'll need to manually provision the cluster, but once the cluster is provisioned you can submit jobs to Spark, Flink, Presto, and Hadoop.

Dataproc has implicit integration with other GCP products like Compute Engine, Cloud Storage, Bigtable, BigQuery, Cloud Monitoring, and so on. The jobs supported by Dataproc are MapReduce, Spark, PySpark, SparkSQL, SparkR, Hive and Pig.

Apart from that, Dataproc allows native integration with Jupyter Notebooks as well, which we'll cover later in this article.

In the article, we are going to cover:

Dataproc cluster types and how to set Dataproc up
How to submit a PySpark job to Dataproc
How to create a Notebook instance and execute PySpark jobs through Jupyter Notebook.

How to Create a Dataproc Cluster

Dataproc has three cluster types:

Standard
Single Node
High Availability

The Standard cluster can consist of 1 master and N worker nodes. The Single Node has only 1 master and 0 worker nodes. For production purposes, you should use the High Availability cluster which has 3 master and N worker nodes.

For our learning purposes, a single node cluster is sufficient which has only 1 master Node.

Creating Dataproc clusters in GCP is straightforward. First, we'll need to enable Dataproc, and then we'll be able to create the cluster.

Start Dataproc cluster creation

When you click "Create Cluster", GCP gives you the option to select Cluster Type, Name of Cluster, Location, Auto-Scaling Options, and more.

Parameters required for Cluster

Since we've selected the Single Node Cluster option, this means that auto-scaling is disabled as the cluster consists of only 1 master node.

The Configure Nodes option allows us to select the type of machine family like Compute Optimized, GPU and General-Purpose.

In this tutorial, we'll be using the General-Purpose machine option. Through this, you can select Machine Type, Primary Disk Size, and Disk-Type options.

The Machine Type we're going to select is n1-standard-2 which has 2 CPU’s and 7.5 GB of memory. The Primary Disk size is 100GB which is sufficient for our demo purposes here.

Master Node Configuration

We've selected the cluster type of Single Node, which is why the configuration consists only of a master node. If you select any other Cluster Type, then you'll also need to configure the master node and worker nodes.

From the Customise Cluster option, select the default network configuration:

Use the option "Scheduled Deletion" in case no cluster is required at a specified future time (or say after a few hours, days, or minutes).

Schedule Deleting Setting

Here, we've set "Timeout" to be 2 hours, so the cluster will be automatically deleted after 2 hours.

We'll use the default security option which is a Google-managed encryption key. When you click "Create", it'll start creating the cluster.

You can also create the cluster using the ‘gcloud’ command which you'll find on the ‘EQUIVALENT COMMAND LINE’ option as shown in image below.

And you can create a cluster using a POST request which you'll find in the ‘Equivalent REST’ option.

gcloud and REST option for Cluster creation

After few minutes the cluster with 1 master node will be ready for use.

Cluster Up and Running

You can find details about the VM instances if you click on "Cluster Name":

How to Submit a PySpark Job

Let’s briefly understand how a PySpark Job works before submitting one to Dataproc. It’s a simple job of identifying the distinct elements from the list containing duplicate elements.

#! /usr/bin/python

import pyspark

#Create List
numbers = [1,2,1,2,3,4,4,6]

#SparkContext
sc = pyspark.SparkContext()

# Creating RDD using parallelize method of SparkContext
rdd = sc.parallelize(numbers)

#Returning distinct elements from RDD
distinct_numbers = rdd.distinct().collect()

#Print
print('Distinct Numbers:', distinct_numbers)

Upload the .py file to the GCS bucket, and we'll need its reference while configuring the PySpark Job.

Job GCS Location

Submitting jobs in Dataproc is straightforward. You just need to select “Submit Job” option:

Job Submission

For submitting a Job, you'll need to provide the Job ID which is the name of the job, the region, the cluster name (which is going to be the name of cluster, "first-data-proc-cluster"), and the job type which is going to be PySpark.

Parameters required for Job Submission

You can get the Python file location from the GCS bucket where the Python file is uploaded – you'll find it at gsutil URI.

No other additional parameters are required, and we can now submit the job:

After execution, you should be able to find the distinct numbers in the logs:

Logs

How to Create a Jupyter Notebook Instance

You can associate a notebook instance with Dataproc Hub. To do that, GCP provisions a cluster for each Notebook Instance. We can execute PySpark and SparkR types of jobs from the notebook.

To create a notebook, use the "Workbench" option like below:

Make sure you go through the usual configurations like Notebook Name, Region, Environment (Dataproc Hub), and Machine Configuration (we're using 2 vCPUs with 7.5 GB RAM). We're using the default Network settings, and in the Permission section, select the "Service account" option.

Parameters required for Notebook Cluster Creation

Click Create:

Notebook Cluster Up & Running

The "OPEN JUPYTYERLAB" option allows users to specify the cluster options and zone for their notebook.

Once the provisioning is completed, the Notebook gives you a few kernel options:

Click on PySpark which will allow you to execute jobs through the Notebook.

A SparkContext instance will already be available, so you don't need to explicitly create SparkContext. Apart from that, the program remains the same.

Code snapshot on Notebook

Conclusion

Working on Spark and Hadoop becomes much easier when you're using GCP Dataproc. The best part is that you can create a notebook cluster which makes development simpler.

A Quick Overview of the Apache Hadoop Framework

freeCodeCamp — Sat, 01 Feb 2020 00:00:00 +0000

Hadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stuffed elephant that inspired the name appears in Hadoop’s logo.

What is Apache Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Source: Apache Hadoop

In 2003 Google released their paper on the Google File System (GFS). It detailed a proprietary distributed file system intended to provide efficient access to large amounts of data using commodity hardware. A year later, Google released another paper entitled “MapReduce: Simplified Data Processing on Large Clusters.” At the time, Doug was working at Yahoo. These papers were the inspiration for his open-source project Apache Nutch. In 2006, the project components then known as Hadoop moved out of Apache Nutch and was released.

Why is Hadoop useful?

Every day, billions of gigabytes of data are created in a variety of forms. Some examples of frequently created data are:

Metadata from phone usage
Website logs
Credit card purchase transactions
Social media posts
Videos
Information gathered from medical devices

“Big data” refers to data sets that are too large or complex to process using traditional software applications. Factors that contribute to the complexity of data are the size of the data set, speed of available processors, and the data’s format.

At the time of its release, Hadoop was capable of processing data on a larger scale than traditional software.

Core Hadoop

Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.

HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.

Hadoop Ecosystem

Many other software packages exist to complement Hadoop. These programs comprise the the Hadoop Ecosystem. Some programs make it easier to load data into the Hadoop cluster, while others make Hadoop easier to use.

The Hadoop Ecosystem includes:

Apache Hive
Apache Pig
Apache HBase
Apache Phoenix
Apache Spark
Apache ZooKeeper
Cloudera Impala
Apache Flume
Apache Sqoop
Apache Oozie

More Information:

Apache Hadoop

An in-depth introduction to SQOOP architecture

freeCodeCamp — Tue, 26 Feb 2019 17:53:46 +0000

By Jayvardhan Reddy

Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and structured data-stores such as relational databases, and vice-versa.

_Image Credits: [hdfstutorial.com](https://www.hdfstutorial.com/sqoop-architecture/" rel="noopener" target="blank" title=")

As part of this blog, I will be explaining how the architecture works on executing a Sqoop command. I’ll cover details such as the jar generation via Codegen, execution of MapReduce job, and the various stages involved in running a Sqoop import/export command.

Codegen

Understanding Codegen is essential, as internally this converts our Sqoop job into a jar which consists of several Java classes such as POJO, ORM, and a class that implements DBWritable, extending SqoopRecord to read and write the data from relational databases to Hadoop & vice-versa.

You can create a Codegen explicitly as shown below to check the classes present as part of the jar.

sqoop codegen \   -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_db \   -- username retail_user \   -- password ******* \   -- table products

The output jar will be written in your local file system. You will get a Jar file, Java file and java files which are compiled into .class files:

Let us see a snippet of the code that will be generated.

ORM class for table ‘products’ // Object-relational modal generated for mapping:

Setter & Getter methods to get values:

Internally it uses JDBC prepared statements to write to Hadoop and ResultSet to read data from Hadoop.

Sqoop Import

It is used to import data from traditional relational databases into Hadoop.

_Image Credits: [dummies.com](https://www.dummies.com/programming/big-data/hadoop/hadoop-for-dummies-cheat-sheet/" rel="noopener" target="blank" title=")

Let’s see a sample snippet for the same.

sqoop import \   -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_db \   -- username retail_user \   -- password ******* \   -- table products \   -- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir \   -- delete-target-dir

The following steps take place internally during the execution of sqoop.

Step 1: Read data from MySQL in streaming fashion. It does various operations before writing the data into HDFS.

As part of this process, it will first generate code (typical Map reduce code) which is nothing but Java code. Using this Java code it will try to import.

Generate the code. (Hadoop MR)
Compile the code and generate the Jar file.
Submit the Jar file and perform the import operations

During the import, it has to make certain decisions as to how to divide the data into multiple threads so that Sqoop import can be scaled.

Step 2: Understand the structure of the data and perform CodeGen

Using the above SQL statement, it will fetch one record along with the column names. Using this information, it will extract the metadata information of the columns, datatype etc.

_Image Credits: [cs.tut.fi](http://www.cs.tut.fi/~aaltone3/kurssit/hadoop/Sqoop_pdf.pdf" rel="noopener" target="blank" title=")

Step 3: Create the java file, compile it and generate a jar file

As part of code generation, it needs to understand the structure of the data and it has to apply that object on the incoming data internally to make sure the data is correctly copied onto the target database. Each unique table has one Java file talking about the structure of data.

This jar file will be injected into Sqoop binaries to apply the structure to incoming data.

Step 4: Delete the target directory if it already exists.

Step 5: Import the data

Here, it connects to a resource manager, gets the resource, and starts the application master.

To perform equal distribution of data among the map tasks, it internally executes a boundary query based on the primary key by default
to find the minimum and maximum count of records in the table.
Based on the max count, it will divide by the number of mappers and split it amongst each mapper.

It uses 4 mappers by default:

It executes these jobs on different executors as shown below:

The default number of mappers can be changed by setting the following parameter:

So in our case, it uses 4 threads. Each thread processes mutually exclusive subsets, that is each thread processes different data from the others.

To see the different values, check out the below:

Operations that are being performed under each executor nodes:

In case you perform a Sqooop hive import, one extra step as part of the execution takes place.

Step 6: Copy data to hive table

Sqoop Export

This is used to export data from Hadoop into traditional relational databases.

_Image Credits: [slideshare.net](https://www.slideshare.net/gharriso/from-oracle-to-hadoop-with-sqoop-and-other-tools" rel="noopener" target="blank" title=")

Let’s see a sample snippet for the same:

sqoop export \  -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_export \  -- username retail_user \  -- password ******* \  -- table product_sqoop_exp \  -- export-dir /user/jvanchir/sqoop_prac/import_table_dir/products

On executing the above command, the execution steps (1–4) similar to Sqoop import take place, but the source data is read from the file system (which is nothing but HDFS). Here it will use boundaries upon block size to divide the data and it is internally taken care by Sqoop.

The processing splits are done as shown below:

After connecting to the respective database to which the records are to be exported, it will issue a JDBC insert command to read data from HDFS and store it into the database as shown below.

Now that we have seen how Sqoop works internally, you can determine the flow of execution from jar generation to execution of a MapReduce task on the submission of a Sqoop job.

Note: The commands that were executed related to this post are added as part of my GIT account.

Similarly, you can also read more here:

Hive Architecture in Depth with code.
HDFS Architecture in Depth with code.

If you would like too, you can connect with me on LinkedIn - Jayvardhan Reddy.

If you enjoyed reading this article, you can click the clap and let others know about it. If you would like me to add anything else, please feel free to leave a response ?