Azure Databricks is a cloud-based data analytics platform hosted on Microsoft Azure. It helps you analyze data using Apache Spark and allows developers to create data apps more quickly. This in turn unlocks insights from all your data and helps you build Artificial Intelligence solutions.

Azure Databricks fuses the scalability and security of Microsoft's Azure platform with the power of Databricks as an end-to-end Apache Spark platform.

In this tutorial, you will learn how to get started with the platform in Microsoft Azure and see how to perform data interactions including reading, writing, and analyzing datasets.

By the end of this tutorial, you will be able to use Azure Databricks to read multiple file types, both with and without a schema.

Prerequisites

You will need a valid and active Microsoft Azure account.

  • Free Azure Trial: With this option, you will start with $200 Azure credit and will have 30 days to use it in addition to free services.
  • Azure for Students: This offer is available for students only. With this option, you will start with $100 Azure credit with no credit card required. You'll get access to popular services for free whilst you have your credit.

How to Create Your Databricks Workspace

To use Azure Databricks, you must set up an Azure Databricks workspace in your Azure subscription. To do this, navigate to the Azure portal. This will work provided you've created a valid and active Microsoft Azure account.

image-137
The Microsoft Azure Home Page

Once there, click the Create a resource button.

On the search prompt in the Create a resource page, you will search for Azure Databricks and select the Azure Databricks option.

image-138
The Microsoft Azure page showing the list of popular resources

Open the Azure Databricks tab and create an instance.

image-140
The Azure Databricks pane.

Click the blue Create button (arrow pointed at it) to create an instance.

Then enter the project details before clicking the Review + create button.

image-142
The Azure Databricks configuration page

Note: The Subscription option will differ from yours. It will depend on the Azure subscription you have available on your account.

The resource is a group of similar Azure resources. You can create a new one or use an existing one.

The Workspace name must be filled in with a globally unique name. Mine is named salim-freeCodeCamp-databricks1.

The Region option should be filled in with the location closest to where you are. A region is a set of physical data centers that serve as servers. Since I am in based in Lagos, Nigeria, I selected South Africa North.

For the Pricing Tier option, select the Standard option that includes Apache Spark with Azure AD.

At this point, click the Review + create button. The validation process usually takes about three minutes.

When the validation and deployment processes are completed for the workspace, launch the workspace using the Launch Workspace button that appears.

image-144
The home page for the created instance of Azure databricks - salim-freeCodeCamp-databricks

Click the button and you will automatically be signed in using the Azure Directory Single Sign On.

image-145
Signing into the workspace of the integration of Microsoft Azure and Databricks

The Microsoft Azure Databricks home page will come up in a new tab.

image-146
The Microsoft Azure Databricks home page

Create a cluster using the Create a cluster option on the left of the page.

Upon clicking that button, a list of your available clusters will come up. If, like myself, you have not created any, you'll see yours empty as well.

image-147
A list of clusters in the Azure Databricks workspace. It is seen that there is no cluster. 

Create a new cluster using the Create Cluster button.

image-148
Set the configurations for the Azure Databricks cluster

Click the Single node option (changing from the Multi node default option) and maintain other settings as default. Then click the Create Cluster button at the bottom of the page. This will take a few minutes.

Note: If your dataset is large, you can explore the Multi node option. Leave all other configuration settings as default.

After you've created the cluster, import some ready-to-use notebooks by navigating to Workspace > Users > your_account on the left taskbar.

Right-click and select the Import option on the dropdown menu.

With the cluster created, you will then have to import some ready to use notebooks.

To do this, using the left taskbar, you will navigate through Workspace > Users > your_account . Then right-click to see the dropdown menu. You will then select the Import option on the dropdown menu.

image-150
The import button will be used to import the dataset to be used

Once you click on the Import button, you will then select the URL option and paste the following URL:

https://github.com/salimcodes/microsoft-learning-paths-databricks-notebooks/blob/master/data-engineering/DBC/03-Reading-and-writing-data-in-Azure-Databricks.dbc

image-151
The database folder named 03-Reading-and-writing-data-in-Azure-Databricks.dbc will be used,
image-152
You will see he list of files in the 03-Reading-and-writing-data-in-Azure-Databricks.dbc database folder

The image above is what the workspace will like after downloading the file. As such, you have created a Databricks workspace.

How to Read the Data in CSV Format

Open the file named Reading Data - CSV.

Upon opening the file, you will see the notebook shown below:

image-153
You will see that the cluster created earlier has not been attached.

On the top left corner, you will change the dropdown which initially shows Detached to your cluster's name. Mine is named Salim Oyinlola's freeCodeCamp Cluster.

image-154
The cluster initially created is now attahed to the python notebook

With your cluster attached, you will then run all the cells one after the other.

image-261
Running the first cell of the python notebook will initialize the classroom variables & function, mount the dataset and create user-specific database

At its core, the notebook simply reads the data in csv format. Then it adds an option that tells the reader that the data contains a header and to use that header to determine our column names.

You can also add an option that tells the reader to infer each column's data types (also known as a schema).

It is important to note that data can be read in different formats such as JSON (with or without schemas), parquet, and table and views. To achieve this, you can simply run the respective notebooks for each format.

How to Write Data into a Parquet File

Just as there are many ways to read data, there are many ways to write data. But in this notebook, we'll get a quick peek of how to write data back out to Parquet files.

Apache Parquet is a column storage file format that Hadoop systems (such as Spark and Hive) use. The file format is cross-platform, language independent, and it stores data in a column layout using a binary representation.

Parquet files, which effectively store large datasets, have the extension .parquet.

Like what you did when reading data, you will also run the cells one after the other.

image-275
The cell to write data into a parquet file

Integral to writing into the parquet file is creating a DataFrame. You will be creating one by running this cell.

image-276
This cell shows that the existing files are being overwritten

The .mode"overwrite" method shown below implies that by writing DataFrame to parquet files, you are replacing existing files.

image-277
The file has been written and saved in an output location.

At its core, the notebook reads a .tsv file (the same used to read for the .csv file) and writes it back out as a Parquet file.

How to Delete the Azure Databricks Instance (Optional)

Finally, the Azure resources that you created in this tutorial can incur ongoing costs. To avoid such costs, it is important to delete the resource or resource group that contains all those resources. You can do that by using the Azure portal.

  • Navigate to the Azure portal.
  • Navigate to the resource group that contains your Azure Databricks instance.
  • Select Delete resource group.
  • Type the name of the resource group in the confirmation text box.
  • Select Delete.

Conclusion

In this tutorial, you have learned the basics about reading and writing data in Azure Databricks.

You now know what Azure Databricks is, how to set it up, how to read CSV and parquet files, and how to read parquet files to the Databricks file system (DBFS) with compression options.

Finally, I share my writings on Twitter if you enjoyed this article and want to see more.

Thank you for reading :)