big data - freeCodeCamp.org

How to Read and Write Deeply Partitioned Files Using Apache Spark

Arun Shanmugam Kumar — Sun, 31 Aug 2025 21:23:23 +0000

If you use Apache Spark to write your data pipeline, you might need to export or copy data from a source to destination while preserving the partition folders between the source and destination.

When I researched online on how to do this in Spark, I found very few tutorials giving an end-to-end solution that worked – especially when the partitions are deeply nested and you don't know beforehand the values these folder names will take (for example year=*/month=*/day=*/hour=*/*.csv).

In this tutorial, I have provided one such implementation using Spark.

Prerequisite

To follow along in this tutorial, you need to have basic understanding of distributed computing using frameworks like Hadoop and Spark, as well as code that’s programmed in Object Oriented languages like Scala/Java. The code is tested using the below dependencies:

Scala 2.12+
Java 17 (earlier versions might work)
Sbt

Setup

I’m assuming you have partition folders that are created at the source with the below pattern (which is a standard partition column involving date-time):

year/month/day/hour

Crucially, as I mentioned above, I’m assuming that you don’t know the full name of the folders – except that they have some constant prefix pattern in them.

False Starts

If you think of using recursiveFileLookup and pathGlobFilter option while both reading and writing, it doesn’t quite work, as the above functions are only available on read API.
If you think of parameterizing the reading and writing based on all the possible year/month/day/hour combination and skip export if the corresponding partition folder is not found, then it might work but won’t be very efficient.

My Solution

After a few trials and errors and searching in Stack Overflow and the Spark documentation, I hit upon an idea to use a combination of input_file_name(), regexp_extract(), and partitionBy() API's on the write side to achieve the end goal. You can find a Scala-based sample code below:

package main.scala.blog

/**
*  Spark stream example code to read and write from a partitioned folder
*  to a partitioned folder without explicitly known datetime.
*/

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.functions.{udf, input_file_name, col, lit, regexp_extract}

object PartitionedReaderWriter {

    def main(args: Array[String]) {
        // 1.
        val spark = SparkSession
                    .builder
                    .appName("PartitionedReaderWriterApp")
                    .getOrCreate()

        val sourceBasePath = "data/partitioned_files_source/user"
        // 2.
        val sourceDf = spark.read
                            .format("csv")
                            .schema("State STRING, Color STRING, Count INT")
                            .option("header", "true")
                            .option("pathGlobFilter", "*.csv")
                            .option("recursiveFileLookup", "true")
                            .load(sourceBasePath)

        val destinationBasePath = "data/partitioned_files_destination/user"
        // 3.
        val writeDf = sourceDf
                        .withColumn("year", regexp_extract(input_file_name(), "year=(\\d{4})", 1))
                        .withColumn("month", regexp_extract(input_file_name(), "month=(\\d{2})", 1))
                        .withColumn("day", regexp_extract(input_file_name(), "day=(\\d{2})", 1))
                        .withColumn("hour", regexp_extract(input_file_name(), "hour=(\\d{2})", 1))

        // 4.
        writeDf.write
                .format("csv")
                .option("header", "true")
                .mode("overwrite")
                .partitionBy("year", "month", "day", "hour")
                .save(destinationBasePath)

        // 5.
        spark.stop()        
    }
}

Here’s what’s going on in the above code:

Inside main method, you begin by adding Spark initialization setup code to create a Spark session.
You read the data from sourceBasePath using spark read() API with the format as csv (you can also optionally provide the schema). Options recursiveFileLookup and pathGlobFilter are needed to recursively read through nested folders and to specify any csv file, respectively.
Th next section contains the core logic where you can use input_file_name() to return the full path of the file and regexp_extract() to extract year , month, day, and hour from the corresponding subfolders in the path and store them as auxiliary columns on the dataframe.
Finally, you write the dataframe using the csv format again and crucially use partitionBy to specify the previously created auxiliary columns as partition columns. Then save the dataframe in the destinationBasePath.
After the copy is done, you stop the Spark session by calling the stop() API.

Conclusion

In this article I have shown you how to export / copy a deeply nested data files from source to destination using Apache Spark in an efficient way. I hope you find it useful!

You can read my other articles at https://www.beyonddream.me.

Data-Driven Reality – Exploring the Power of AI, ML, Virtual and Augmented Reality

David Clinton — Tue, 21 Feb 2023 22:57:20 +0000

By now it's no secret that digital data is generated by the truckload and that it can be worth its weight in gold.

But that knowledge isn't half as important as understanding how you can tame the data beast and then wring out every drop of its value.

Naturally, creative and resourceful people in one place or another are always finding new processes and applications that'll make better use of their data. So we'll explore some of today's dominant data utilization trends and leave predicting tomorrow's technology for the pundits.

This chapter was taken from my book, Keeping Up: Backgrounders to All the Big Technology Trends You Can't Afford to Ignore. If you'd prefer to watch this chapter as a video, feel free to follow along here:

What Exactly Is Data?

Before priming our understanding of what's available to help us work productively with data, it's a good idea to first define exactly what data is.

Sure, we saw plenty of great individual examples in my previous article on Managing Data Storage, including the huge volumes of performance and status information produced by digital components of complex systems like cars. But that's not the same as a definition.

So then let's define it. Data, for our purposes, is any digital information that is generated by, or used for, your compute operations. That will include log messages produced by a compute device, weather information relayed through remote sensors to a server, digital imaging files (like CT, tomography and ultrasound scans), and the numbers you enter into a spreadsheet. And everything in between.

Which brings us to big data – another one of those buzz phrases that get thrown around a lot, often without accompanying context or clarity.

On a first glance, you'd probably figure that big data describes data sets that come in volumes higher than traditional data software and hardware solutions are capable of handling.

Indeed, your way of figuring it would be largely correct. Although we could add one or two secondary characteristics. The complexity of a data set, for instance, is also something that could force you to consider big data solutions. And sets of data that must be consumed and analyzed while in motion (streaming data) are also often better addressed using big data tools.

It's worth mentioning that big data workloads will often seek to solve large scale predictive analytics or behavior analytics problems. Such problems are common within domains like healthcare, Internet of Things (IoC), and information technology.

With that out of the way, we can now get to work understanding how – and why – all that data is being used.

Virtual Reality and Augmented Reality

What? Plain old reality suddenly not good enough for you?

Well yes, in some cases, plain old reality really isn't good enough. At least if you have a strong interest in engaging in experiences that are difficult or impossible under normal conditions.

A virtual reality (VR) device lets you immerse yourself in a non-existent environment.

The most common examples of currently available VR technology feature some kind of headset that projects visual images in front of your eyes while tracking your head movements and, in some cases, the way you're moving other parts of your body. The visual images will adapt to your physical movements, giving you the sensation that you're actually within and manipulating the virtual projection.

VR has potential applications in educational, healthcare, research, and military fields. The ability to simulate distant, prohibitively expensive, or theoretical environments can make training more realistic and immediate than would be otherwise possible.

VR technologies have been arriving – and then disappearing – for decades already. For the most part, they've focused on providing immersive gaming and entertainment environments. But they've never really caught on in a big way beyond the niche product level.

This might be partly due to high prices, and because some people experienced forms of motion sickness and disorientation.

But maybe – just maybe – (insert the current year here) will finally be the year VR hits the big time.

But where VR can leverage data in a really meaningful way is when, rather than blocking out your physical surroundings, the virtual environment is overlaid on top of your actual field of vision.

Imagine you're a technician working on electrical switching hardware under a sidewalk. You're wearing goggles that let you see the equipment in front of you, but that also project text and icons clearly identifying labels for each part and that show you where a replacement part should go and how it's connected. This is augmented reality.

I'm sure you can easily imagine how powerful this kind of dynamic display could be in the right conditions.

Surgeons are able to access a patient's history or even consult relevant medical literature without having to divert their eyes from the operation. Military pilots can similarly enjoy "heads up" displays that show them timely status reports describing their own aircraft and broader air traffic conditions without distraction.

Artificial Intelligence and Machine Learning

As a rule, computers are even better at performing dull, repetitive tasks over and over again than bored teenagers pretending to do homework. And they make less noise in the process.

The trick with computers is to cleverly string lots of dull, repetitive tasks together so that they can approximate intelligent and useful behavior.

The prize at the end of that road is called automation. Or, in other words, a state where computers can be confidently left alone to perform complex and useful tasks without supervision.

In many ways, we've been living in an age of sophisticated computer automation for decades. Domains as diverse as security monitoring, urban traffic control, book manufacturing, and heavy industry are already being handled with little or no human supervision.

But artificial intelligence (AI) seeks to go beyond relatively simple repetition to train computers to think for themselves – and thereby efficiently solve far more difficult problems.

Great idea. Somewhat harder to achieve in the real world.

What can AI actually do?

Understanding how effective AI can be will depend on what you expect it to do. Can you design software to search for and flag a handful of suspicious financial transactions from among the millions of credit card transactions a large bank processes? Yes. (Although I'm not quite sure that's truly AI at work and not just automation.)

Can you deploy "intelligent" chatbots on your website to help customers solve their problems without needing actual (and expensive) human interaction? Yes. In fact, I just had a surprisingly effective conversation with my mobile phone carrier's chatbot that did quickly solve my problem.

Can the first stages of a rocket you've just used to launch a payload into space use AI to guide it to a safe landing on a moving platform in the middle of the ocean? If you'd ask me, I'd say it's impossible. But SpaceX went ahead anyway and did it multiple times. Good thing they didn't ask me.

But can AI reliably make strategic decisions that intelligently account for all the many moving parts and complexity that exist in your industry? Can an AI-powered machine pass the Turing test (where a human evaluator is unable to be sure whether the machine is also human)? Perhaps not just yet. And perhaps never.

One tool used in many AI processes is the neural network. The original neural network consists of the many neurons that carry information about the state of a biological environment to the brain.

Artificial and virtual neural networks are systems for assessing, processing, and responding to the large physical or virtual data sets that feed AI-controlled systems.

Such data can come from cameras or other physical sensors, or from multiple data sources. The processed data can sometimes be used for predictive modeling, where the likelihood of future outcomes are compared.

Exciting stuff, to be sure. But the tools used for some of the most significant accomplishments attributed to artificial intelligence aren't actually artificial. Nor did they necessarily require all that much intelligence.

For example, Amazon Mechanical Turk (MTurk) is a service that connects client companies with remote freelancing "human intelligence" workers. The workers will, for what usually amounts to dreadfully low pay, perform "mechanical" tasks like labelling the content of hundreds or thousands of images. The labelling will cover areas like "is the subject a male or female?" or "is the subject a car or a bus?"

It could be that, over time, services like Mechanical Turk will become less important as improving AI methodologies might one day completely replace the human element for this kind of work. But in the meantime, MTurk and its competitors are still steaming along at full speed, churning out millions of units of "artificial" artificial intelligence.

One methodology that can help reduce reliance on human intervention is machine learning (ML).

How can machine learning help?

ML works by leveraging various kinds of manual assistance to help achieve greater task automation. An ML system can hopefully "learn" how to manage our tasks by being exposed to existing training data. Only once the system has demonstrated sufficient skill at solving the problems you have for it, will it be let loose on "real world" data.

These are some common approaches to training your ML system:

Supervised learning lets the ML software read data sets that include both "problems" (images, for example) and their "solutions" (full labels). By seeing enough of the provided examples, the system should be able to apply its experience to similar problems that arrive without solutions.
Unsupervised learning simply throws raw data without any associated solutions at the system. The goal is for the software to recognize enough patterns in the data to allow it to solve the problems on its own.
Reinforcement learning learns from interactions with its environment. Ideally, the software recognizes and understands positive results and evolves its methodology to reliably and consistently produce similar results.
Deep learning algorithms apply multiple layers of analysis to transform the raw target data. The full, multi-layer process in deep learning is known as the substantial credit assignment path (CAP).

AI in general, and ML in particular, are effective at building tools for tasks like autonomous driving, drug discovery, email filtering, and speech recognition, and for deriving sentiment analysis from massive data sets made up of human communications.

What AI and ML share in common with all the other technologies like virtual reality and augmented reality that we've discussed here – and in that other "How to Manage Data Storage" article – is the need to control and make better sense of the endless streams of information our digital products keep generating. The better we get at this kind of control, the more value we'll get from our data.

YouTube videos of all ten chapters from this book are available here. Lots more tech goodness - in the form of books, courses, and articles - can be had here. And consider taking my AWS, security, and container technology courses here.

The Apache Kafka Handbook – How to Get Started Using Kafka

Gerard Hynes — Fri, 03 Feb 2023 23:48:22 +0000

Apache Kafka is an open-source event streaming platform that can transport huge volumes of data at very low latency.

Companies like LinkedIn, Uber, and Netflix use Kafka to process trillions of events and petabtyes of data each day.

Kafka was originally developed at LinkedIn, to help handle their real-time data feeds. It's now maintained by the Apache Software Foundation, and is widely adopted in industry (being used by 80% of Fortune 100 companies).

Why Should You Learn Apache Kafka?

Kafka lets you:

Publish and subscribe to streams of events
Store streams of events in the same order they happened
Process streams of events in real time

The main thing Kafka does is help you efficiently connect diverse data sources with the many different systems that might need to use that data.

Kafka helps you connect data sources to the systems using that data

Some of the things you can use Kafka for include:

Personalizing recommendations for customers
Notifying passengers of flight delays
Payment processing in banking
Online fraud detection
Managing inventory and supply chains
Tracking order shipments
Collecting telemetry data from Internet of Things (IoT) devices

What all these uses have in common is that they need to take in and process data in real time, often at huge scales. This is something Kafka excels at. To give one example, Pinterest uses Kafka to handle up to 40 million events per second.

Kafka is distributed, which means it runs as a cluster of nodes spread across multiple servers. It's also replicated, meaning that data is copied in multiple locations to protect it from a single point of failure. This makes Kafka both scalable and fault-tolerant.

Kafka is also fast. It's optimized for high throughput, making effective use of disk storage and batched network requests.

This article will:

Introduce you to the core concepts behind Kafka
Show you how to install Kafka on your own computer
Get you started with the Kafka Command Line Interface (CLI)
Help you build a simple Java application that produces and consumes events via Kafka

Things the article won't cover:

More advanced Kafka topics, such as security, performance, and monitoring
Deploying a Kafka cluster to a server
Using managed Kafka services like Amazon MSK or Confluent Cloud

Event Streaming and Event-Driven Architectures
Core Kafka Concepts
a. Event Messages in Kafka
b. Topics in Kafka
c. Partitions in Kafka
d. Offsets in Kafka
e. Brokers in Kafka
f. Replication in Kafka
g. Producers in Kafka
h. Consumers in Kafka
i. Consumer Groups in Kafka
j. Kafka Zookeeper
How to Install Kafka on Your Computer
How to Start Zookeeper and Kafka
The Kafka CLI
a. How to List Topics
b. How to Create a Topic
c. How to Describe Topics
d. How to Partition a Topic
e. How to Set a Replication Factor
f. How to Delete a Topic
g. How to use kafka-console-producer
h. How to use kafka-console-consumer
i. How to use kafka-consumer-groups
How to Build a Kafka Client App with Java
a. How to Set Up the Project
b. How to Install the Dependencies
c. How to Create a Kafka Producer
d. How to Send Multiple Messages and Use Callbacks
e. How to Create a Kafka Consumer
f. How to Shut Down the Consumer
Where to Take it From Here

Before we dive into Kafka, we need some context on event streaming and event-driven architectures.

Event Streaming and Event-Driven Architectures

An event is a record that something happened, as well as information about what happened. For example: a customer placed an order, a bank approved a transaction, inventory management updated stock levels.

Events can triggers one or more processes to respond to them. For example: sending an email receipt, transmitting funds to an account, updating a real-time dashboard.

Event streaming is the process of capturing events in real-time from sources (such as web applications, databases, or sensors) to create streams of events. These streams are potentially unending sequences of records.

The event stream can be stored, processed, and sent to different destinations, also called sinks. The destinations that consume the streams could be other applications, databases, or data pipelines for further processing.

As applications have become more complex, often being broken up into different microservices distributed across multiple data centers, many organizations have adopted an event-driven architecture for their applications.

This means that instead of parts of your application directly asking each other for updates about what happened, they each publish events to event streams. Other parts of the application continuously subscribe to these streams and only act when they receive an event that they are interested in.

This architecture helps ensure that if part of your application goes down, other parts won't also fail. Additionally, you can add new features by adding new subscribers to the event stream, without having to rewrite the existing codebase.

Core Kafka Concepts

Kafka has become one of the most popular ways to implement event streaming and event-driven architectures. But it does have a bit of a learning curve and you need to understand a couple of concepts before you can make effective use of it.

These core concepts are:

event messages
topics
partitions
offsets
brokers
producers
consumers
consumer groups
Zookeeper

Event Messages in Kafka

When you write data to Kafka, or read data from it, you do this in the form of messages. You'll also see them called events or records.

A message consists of:

a key
a value
a timestamp
a compression type
headers for metadata (optional)
partition and offset id (once the message is written to a topic)

A Kafka message consisting of key, value, timestamp, compression type, and headers

Every event in Kafka is, at its simplest, a key-value pair. These are serialized into binary, since Kafka itself handles arrays of bytes rather than complex language-specific objects.

Keys are usually strings or integers and aren't unique for every message. Instead, they point to a particular entity in the system, such as a specific user, order, or device. Keys can be null, but when they are included they are used for dividing topics into partitions (more on partitions below).

The message value contains details about the event that happened. This could be as simple as a string or as complex as an object with many nested properties. Values can be null, but usually aren't.

By default, the timestamp records when the message was created. You can overwrite this if your event actually occurred earlier and you want to record that time instead.

Messages are usually small (less than 1 MB) and sent in a standard data format, such as JSON, Avro, or Protobuf. Even so, they can be compressed to save on data. The compression type can be set to gzip, lz4, snappy, zstd, or none.

Events can also optionally have headers, which are key-value pairs of strings containing metadata, such as where the event originated from or where you want it routed to.

Once a message is sent into a Kafka topic, it also receives a partition number and offset id (more about these later).

Topics in Kafka

Kafka stores messages in a topic, an ordered sequence of events, also called an event log.

A Kafka topic containing messages, each with a unique offset

Different topics are identified by their names and will store different kinds of events. For example a social media application might have posts, likes, and comments topics to record every time a user creates a post, likes a post, or leaves a comment.

Multiple applications can write to and read from the same topic. An application might also read messages from one topic, filter or transform the data, and then write the result to another topic.

One important feature of topics is that they are append-only. When you write a message to a topic, it's added to the end of the log. Events in a topic are immutable. Once they're written to a topic, you can't change them.

A Producer writing events to topics and a Consumer reading events from topics

Unlike with messaging queues, reading an event from a topic doesn't delete it. Events can be read as often as needed, perhaps several times by multiple different applications.

Topics are also durable, holding onto messages for a specific period (by default 7 days) by saving them to physical storage on disk.

You can configure topics so that messages expire after a certain amount of time, or when a certain amount of storage is exceeded. You can even store messages indefinitely as long as you can pay for the storage costs.

Partitions in Kafka

In order to help Kafka to scale, topics can be divided into partitions. This breaks up the event log into multiple logs, each of which lives on a separate node in the Kafka cluster. This means that the work of writing and storing messages can be spread across multiple machines.

When you create a topic, you specify the amount of partitions it has. The partitions are themselves numbered, starting at 0. When a new event is written to a topic, it's appended to one of the topic's partitions.

A topic divided into three partitions

If messages have no key, they will be evenly distributed among partitions in a round robin manner: partition 0, then partition 1, then partition 2, and so on. This way, all partitions get an even share of the data but there's no guarantee about the ordering of messages.

Messages that have the same key will always be sent to the same partition, and in the same order. The key is run through a hashing function which turns it into an integer. This output is then used to select a partition.

Messages without keys are sent across partitions, while messages with the same keys are sent to the same partition

Messages within each partition are guaranteed to be ordered. For example, all messages with the same customer_id as their key will be sent to the same partition in the order in which Kafka received them.

Offsets in Kafka

Each message in a partition gets an id that is an incrementing integer, called an offset. Offsets start at 0 and are incremented every time Kafka writes a message to a partition. This means that each message in a given partition has a unique offset.

Offsets are unique within a partition but not between partitions

Offsets are not reused, even when older messages get deleted. They continue to increment, giving each new message in the partition a unique id.

When data is read from a partition, it is read in order from the lowest existing offset upwards. We'll see more about offsets when we cover Kafka consumers.

Brokers in Kafka

A single "server" running Kafka is called a broker. In reality, this might be a Docker container running in a virtual machine. But it can be a helpful mental image to think of brokers as individual servers.

A Kafka cluster made up of three brokers

Multiple brokers working together make up a Kafka cluster. There might be a handful of brokers in a cluster, or more than 100. When a client application connects to one broker, Kafka automatically connects it to every broker in the cluster.

By running as a cluster, Kafka becomes more scalable and fault-tolerant. If one broker fails, the others will take over its work to ensure there is no downtime or data loss.

Each broker manages a set of partitions and handles requests to write data to or read data from these partitions. Partitions for a given topic will be spread evenly across the brokers in a cluster to help with load balancing. Brokers also manage replicating partitions to keep their data backed up.

Partitions spread across brokers

Replication in Kafka

To protect against data loss if a broker fails, Kafka writes the same data to copies of a partition on multiple brokers. This is called replication.

The main copy of a partition is called the leader, while the replicas are called followers.

The data from the leader partition is copied to follower partitions on different brokers

When a topic is created, you set a replication factor for it. This controls how many replicas get written to. A replication factor of three is common, meaning data gets written to one leader and replicated to two followers. So even if two brokers failed, your data would still be safe.

Whenever you write messages to a partition, you're writing to the leader partition. Kafka then automatically copies these messages to the followers. As such, the logs on the followers will have the same messages and offsets as on the leader.

Followers that are up to date with the leader are called In-Sync Replicas (ISRs). Kafka considers a message to be committed once a minimum number of replicas have saved it to their logs. You can configure this to get higher throughput at the expense of less certainty that a message has been backed up.

Producers in Kafka

Producers are client applications that write events to Kafka topics. These apps aren't themselves part of Kafka – you write them.

Usually you will use a library to help manage writing events to Kafka. There is an official client library for Java as well as dozens of community-supported libraries for languages such as Scala, JavaScript, Go, Rust, Python, C#, and C++.

A Producer application writing to multiple topics

Producers are totally decoupled from consumers, which read from Kafka. They don't know about each other and their speed doesn't affect each other. Producers aren't affected if consumers fail, and the same is true for consumers.

If you need to, you could write an application that writes certain events to Kafka and reads other events from Kafka, making it both a producer and a consumer.

Producers take a key-value pair, generate a Kafka message, and then serialize it into binary for transmission across the network. You can adjust the configuration of producers to batch messages together based on their size or some fixed time limit to optimize writing messages to the Kafka brokers.

It's the producer that decides which partition of a topic to send each message to. Again, messages without keys will be distributed evenly among partitions, while messages with keys are all sent to the same partition.

Consumers in Kafka

Consumers are client applications that read messages from topics in a Kafka cluster. Like with producers, you write these applications yourself and can make use of client libraries to support the programming language your application is built with.

A Consumer reading messages from multiple topics

Consumers can read from one or more partitions within a topic, and from one or more topics. Messages are read in order within a partition, from the lowest available offset to the highest. But if a consumer reads data from several partitions in the same topic, the message order between these partitions is not guaranteed.

For example, a consumer might read messages from partition 0, then partition 2, then partition 1, then back to partition 0. The messages from partition 0 will be read in order, but there might be messages from the other partitions mixed among them.

It's important to remember that reading a message does not delete it. The message is still available to be read by any other consumer that needs to access it. It's normal for multiple consumers to read from the same topic if they each have uses for the data in it.

By default, when a consumer starts up it will read from the current offset in a partition. But consumers can also be configured to go back and read from the oldest existing offset.

Consumers deserialize messages, converting them from binary into a collection of key-value pairs that your application can then work with. The format of a message should not change during a topic's lifetime or your producers and consumers won't be able to serialize and deserialize it correctly.

One thing to be aware of is that consumers request messages from Kafka, it doesn't push messages to them. This protects consumers from becoming overwhelmed if Kafka is handling a high volume of messages. If you want to scale consumers, you can run multiple instances of a consumer together in a consumer group.

Consumer Groups in Kafka

An application that reads from Kafka can create multiple instances of the same consumer to split up the work of reading from different partitions in a topic. These consumers work together as a consumer group.

When you create a consumer, you can assign it a group id. All consumers in a group will have the same group id.

You can create consumer instances in a group up to the number of partitions in a topic. So if you have a topic with 5 partitions, you can create up to 5 instances of the same consumer in a consumer group. If you ever have more consumers in a group than partitions, the extra consumer will remain idle.

Consumers in a consumer group reading messages from a topic's partitions

If you add another consumer instance to a consumer group, Kafka will automatically redistribute the partitions among the consumers in a process called rebalancing.

Each partition is only assigned to one consumer in a group, but a consumer can read from multiple partitions. Also, multiple different consumer groups (meaning different applications) can read from the same topic at the same time.

Kafka brokers use an internal topic called __consumer_offsets to keep track of which messages a specific consumer group has successfully processed.

As a consumer reads from a partition, it regularly saves the offset it has read up to and sends this data to the broker it is reading from. This is called the consumer offset and is handled automatically by most client libraries.

A Consumer committing the offsets it has read up to

If a consumer crashes, the consumer offset helps the remaining consumers to know where to start from when they take over reading from the partition.

The same thing happens if a new consumer is added to the group. The consumer group rebalances, the new consumer is assigned a partition, and it picks up reading from the consumer offset of that partition.

Kafka Zookeeper

One other topic that we briefly need to cover here is how Kafka clusters are managed. Currently this is usually done using Zookeeper, a service for managing and synchronizing distributed systems. Like Kafka, it's maintained by the Apache Foundation.

Kafka uses Zookeeper to manage the brokers in a cluster, and requires Zookeeper even if you're running a Kafka cluster with only one broker.

Recently, a proposal has been accepted to remove Zookeeper and have Kafka manage itself (KIP-500), but this is not yet widely used in production.

Zookeeper keeps track of things like:

Which brokers are part of a Kafka cluster
Which broker is the leader for a given partition
How topics are configured, such as the number of partitions and the location of replicas
Consumer groups and their members
Access Control Lists – who is allowed to write to and read from each topic

A Zookeeper ensemble managing the brokers in a Kafka cluster

Zookeeper itself runs as a cluster called an ensemble. This means that Zookeeper can keep working even if one node in the cluster fails. New data gets written to the ensemble's leader and replicated to the followers. Your Kafka brokers can read this data from any of the Zookeeper nodes in the ensemble.

Now that you understand the main concepts behind Kafka, let's get some hands-on practice working with Kafka.

You're going to install Kafka on your own computer, practice interacting with Kafka brokers from the command line, and then build a simple producer and consumer application with Java.

How to Install Kafka on Your Computer

At the time of writing this guide, the latest stable version of Kafka is 3.3.1. Check kafka.apache.org/downloads to see if there is a more recent stable version. If there is, you can replace "3.3.1" with the latest stable version in all of the following instructions.

Install Kafka on macOS

If you're using macOS, I recommend using Homebrew to install Kafka. It will make sure you have Java installed before it installs Kafka.

If you don't already have Homebrew installed, install it by following the instructions at brew.sh.

Next, run brew install kafka in a terminal. This will install Kafka's binaries at usr/local/bin.

Finally, run kafka-topics --version in a terminal and you should see 3.3.1. If you do, you're all set.

To make it easier to work with Kafka, you can add Kafka to the PATH environment variable. Open your ~/.bashrc (if using Bash) or ~/.zshrc (if using Zsh) and add the following line, replacing USERNAME with your username:

PATH="$PATH:/Users/USERNAME/kafka_2.13-3.3.1/bin"

You'll need to close your terminal for this change to take effect.

Now, if you run echo $PATH you should see that the Kafka bin directory has been added to your path.

Install Kafka on Windows (WSL2) and Linux

Kafka isn't natively supported on Windows, so you will need to use either WSL2 or Docker. I'm going to show you WSL2 since it's the same steps as Linux.

To set up WSL2 on Widows, follow the instructions in the official docs.

From here on, the instructions are the same for both WSL2 and Linux.

First, install Java 11 by running the following commands:

wget -O- https://apt.corretto.aws/corretto.key | sudo apt-key add - 

sudo add-apt-repository 'deb https://apt.corretto.aws stable main'

sudo apt-get update; sudo apt-get install -y java-11-amazon-corretto-jdk

Once this has finished, run java -version and you should see something like:

openjdk version "11.0.17" 2022-10-18 LTS
OpenJDK Runtime Environment Corretto-11.0.17.8.1 (build 11.0.17+8-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.17.8.1 (build 11.0.17+8-LTS, mixed mode)

From your root directory, download Kafka with the following command:

wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz

The 2.13 means it is using version 2.13 of Scala, while 3.3.1 refers to the Kafka version.

Extract the contents of the download with:

tar xzf kafka_2.13-3.3.1.tgz

If you run ls, you'll now see kafka_2.13-3.3.1 in your root directory.

PATH="$PATH:home/USERNAME/kafka_2.13-3.3.1/bin"

You'll need to close your terminal for this change to take effect.

Now, if you run echo $PATH you should see that the Kafka bin directory has been added to your path.

Run kafka-topics.sh --version in a terminal and you should see 3.3.1. If you do, you're all set.

How to Start Zookeeper and Kafka

Since Kafka uses Zookeeper to manage clusters, you need to start Zookeeper before you start Kafka.

How to Start Kafka on macOS

In one terminal window, start Zookeeper with:

/usr/local/bin/zookeeper-server-start /usr/local/etc/zookeeper/zoo.cfg

In another terminal window, start Kafka with:

/usr/local/bin/kafka-server-start /usr/local/etc/kafka/server.properties

While using Kafka, you need to keep both these terminal windows open. Closing them will shut down Kafka.

How to Start Kafka on Windows (WSL2) and Linux

In one terminal window, start Zookeeper with:

~/kafka_2.13-3.3.1/bin/zookeeper-server-start.sh ~/kafka_2.13-3.3.1/config/zookeeper.properties

In another terminal window, start Kafka with:

~/kafka_2.13-3.3.1/bin/kafka-server-start.sh ~/kafka_2.13-3.3.1/config/server.properties

While using Kafka, you need to keep both these terminal windows open. Closing them will shut down Kafka.

Now that you have Kafka installed and running on your machine, it's time to get some hands-on practice.

The Kafka CLI

When you install Kafka, it comes with a Command Line Interface (CLI) that lets you create and manage topics, as well as produce and consume events.

First, make sure Zookeeper and Kafka are running in two terminal windows.

In a third terminal window, run kafka-topics.sh (on WSL2 or Linux) or kafka-topics (on macOS) to make sure the CLI is working. You'll see a list of all the options you can pass to the CLI.

kafka-topics options

Note: When working with the Kafka CLI, the command will be kafka-topics.sh on WSL2 and Linux. It will be kafka-topics.sh on macOS if you directly installed the Kafka binaries and kafka-topics if you used Homebrew. So if you're using Homebrew, remove the .sh extension from the example commands in this section.

How to List Topics

To see the topics available on the Kafka broker on your local machine, use:

kafka-topics.sh --bootstrap-server localhost:9092 --list

This means "Connect to the Kafka broker running on localhost:9092 and list all topics there". --bootstrap-server refers to the Kafka broker you are trying to connect to and localhost:9092 is the IP address it's running at. You won't see any output since you haven't created any topics yet.

How to Create a Topic

To create a topic (with the default replication factor and number of partitions), use the --create and --topic options and pass them a topic name:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_first_topic

If you use an _ or . in your topic name, you will see the following warning:

WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.

Since Kafka could confuse my.first.topic with my_first_topic, it's best to only use either underscores or periods when naming topics.

How to Describe Topics

To describe the topics on a broker, use the --describe option:

kafka-topics.sh --bootstrap-server localhost:9092 --describe

This will print the details of all the topics on this broker, including the number of partitions and their replication factor. By default, these will both be set to 1.

If you add the --topic option and the name of a topic, it will describe only that topic:

kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my_first_topic

How to Partition a Topic

To create a topic with multiple partitions, use the --partitions option and pass it a number:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_second_topic --partitions 3

How to Set a Replication Factor

To create a topic with a replication factor higher than the default, use the --replication-factor option and pass it a number:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my_third_topic --partitions 3 --replication-factor 3

You should get the following error:

ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 2 larger than available brokers: 1.

Since you're only running one Kafka broker on your machine, you can't set a replication factor higher than one. If you were running a cluster with multiple brokers, you could set a replication factor as high as the total number of brokers.

How to Delete a Topic

To delete a topic, use the --delete option and specify a topic with the --topic option:

kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic my_first_topic

You won't get any output to say the topic was deleted but you can check using --list or --describe.

How to Use `kafka-console-producer`

You can produce messages to a topic from the command line using kafka-console-producer.

Run kafka-console-producer.sh to see the options you can pass to it.

kafka-console-producer options

To create a producer connected to a specific topic, run:

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic TOPIC_NAME

Let's produce messages to the my_first_topic topic.

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic my_first_topic

Your prompt will change and you will be able to type text. Press enter to send that message. You can keep sending messages until you press ctrl + c.

Sending messages using kafka-console-producer

If you produce messages to a topic that doesn't exist, you'll get a warning, but the topic will be created and the messages will still get sent. It's better to create a topic in advance, however, so you can specify partitions and replication.

By default, the messages sent from kafka-console-producer have their keys set to null, and so they will be evenly distributed to all partitions.

You can set a key by using the --property option to set parse.key to be true and providing a key separator, such as :

For example, we can create a books topic and use the books' genre as a key.

kafka-topics.sh --bootstrap-server localhost:9092 --topic books --create

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic books --property parse.key=true --property key.separator=:

Now you can enter keys and values in the format key:value. Anything to the left of the key separator will be interpreted as a message key, anything to the right as a message value.

science_fiction:All Systems Red
fantasy:Uprooted
horror:Mexican Gothic

Producing messages with keys and values

Now that you've produced messages to a topic from the command line, it's time to consume those messages from the command line.

How to Use `kafka-console-consumer`

You can consumer messages from a topic from the command line using kafka-console-consumer.

Run kafka-console-consumer.sh to see the options you can pass to it.

kafka-console-consumer options

To create a consumer, run:

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TOPIC_NAME

When you start a consumer, by default it will read messages as they are written to the end of the topic. It won't read messages that were previously sent to the topic.

If you want to read the messages you already sent to a topic, use the --from-beginning option to read from the beginning of the topic:

kafka-console-consumer --bootstrap-server localhost:9092 --topic my_first_topic --from-beginning

The messages might appear "out of order". Remember, messages are ordered within a partition but ordering can't be guaranteed between partitions. If you don't set a key, they will be sent round robin between partitions and ordering isn't guaranteed.

You can display additional information about messages, such as their key and timestamp, by using the --property option and setting the print property to true.

Use the --formatter option to set the message formatter and the --property option to select which message properties to print.

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my_first_topic --from-beginning --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true

Consuming messages from a topic

We get the messages' timestamp, key, and value. Since we didn't assign any keys when we sent these messages to my_first_topic, their key is null.

How to Use `kafka-consumer-groups`

You can run consumers in a consumer group using the Kafka CLI. To view the documentation for this, run:

kafka-consumer-groups.sh

kafka-consumer-groups options

First, create a topic with three partitions. Each consumer in a group will consume from one partition. If there are more consumers than partitions, any extra consumers will be idle.

kafka-topics.sh --bootstrap-server localhost:9092 --topic fantasy_novels --create --partitions 3

You add a consumer to a group when you create it using the --group option. If you run the same command multiple times with the same group name, each new consumer will be added to the group.

To create the first consumer in your consumer group, run:

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic fantasy_novels --group fantasy_consumer_group

Next, open two new terminal windows and run the same command again to add a second and third consumer to the consumer group.

Three consumers running in a consumer group

In a different terminal window, create a producer and send a few messages with keys to the topic.

Note: Since Kafka 2.4, Kafka will send messages in batches to one "sticky" partition for better performance. In order to demonstrate messages being sent round robin between partitions (without sending a large volume of messages), we can set the partitioner to RoundRobinPartitioner.

kafka-console-producer.sh --bootstrap-server localhost:9092 --topic fantasy_novels --property parse.key=true --property key.separator=: --property partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitioner

tolkien:The Lord of the Rings
le_guin:A Wizard of Earthsea
leckie:The Raven Tower
de_bodard:The House of Shattered Wings
okorafor:Who Fears Death
liu:The Grace of Kings

Messages spread between consumers in a consumer group

If you stop one of the consumers, the consumer group will rebalance and future messages will be sent to the remaining consumers.

Now that you have some experience working with Kafka from the command line, the next step is to build a small application that connects to Kafka.

How to Build a Kafka Client App with Java

We're going to build a simple Java app that both produces messages to and consumes messages from Kafka. For this we'll use the official Kafka Java client.

If at any point you get stuck, the full code for this project is available on GitHub.

Preliminaries

First of all, make sure you have Java (at least JDK 11) and Kafka installed.

We're going to send messages about characters from The Lord of the Rings. So let's create a topic for these messages with three partitions.

From the command line, run:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic lotr_characters --partitions 3

How to Set Up the Project

I recommend using IntelliJ for Java projects, so go ahead and install the Community Edition if you don't already have it. You can download it from jetbrains.com/idea

In Intellij, select File, New, and Project.

Give your project a name and select a location for it on your computer. Make sure you have selected Java as the language, Maven as the build system, and that the JDK is at least Java 11. Then click Create.

Setting up a Maven project in IntelliJ

Note: If you're on Windows, IntelliJ can't use a JDK installed on WSL. To install Java on the Windows side of things, go to docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list and download the Windows installer. Follow the installation steps, open a command prompt, and run java -version. You should see something like:

openjdk version "11.0.18" 2023-01-17 LTS
OpenJDK Runtime Environment Corretto-11.0.18.10.1 (build 11.0.18+10-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.18.10.1 (build 11.0.18+10-LTS, mixed mode)

Once your Maven project finishes setting up, run the Main class to see "Hello world!" and make sure everything worked.

How to Install the Dependencies

Next, we're going to install our dependencies. Open up pom.xml and inside the element, create a element.

We're going to use the Java Kafka client for interacting with Kafka and SLF4J for logging, so add the following inside your element:

  
<dependency>  
    <groupId>org.apache.kafkagroupId>  
    <artifactId>kafka-clientsartifactId>  
    <version>3.3.1version>  
dependency>  
  
<dependency>  
    <groupId>org.slf4jgroupId>  
    <artifactId>slf4j-apiartifactId>  
    <version>2.0.6version>  
dependency>  
  
<dependency>  
    <groupId>org.slf4jgroupId>  
    <artifactId>slf4j-simpleartifactId>  
    <version>2.0.6version>  
dependency>

The package names and version numbers might be red, meaning you haven't downloaded them yet. If this happens, click on View, Tool Windows, and Maven to open the Maven menu. Click on the Reload All Maven Projects icon and Maven will install these dependencies.

Reloading Maven dependencies in IntelliJ

Create a HelloKafka class in the same directory as your Main class and give it the following contents:

package org.example;

import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  

public class HelloKafka {  
    private static final Logger log = LoggerFactory.getLogger(HelloKafka.class);  

    public static void main(String[] args) {  
        log.info("Hello Kafka");  
    }  
}

To make sure your dependencies are installed, run this class and you should see [main] INFO org.example.HelloKafka - Hello Kafka printed to the IntelliJ console.

How to Create a Kafka Producer

Next, we're going to create a Producer class. You can call this whatever you want as long as it doesn't clash with another class. So don't use KafkaProducer as you'll need that class in a minute.

package org.example;  

import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  

public class Producer {  
    private static final Logger log = LoggerFactory.getLogger(KafkaProducer.class);  

    public static void main(String[] args) {  
        log.info("This class will produce messages to Kafka");  
    }  
}

All of our Kafka-specific code is going to go inside this class's main() method.

The first thing we need to do is configure a few properties for the producer. Add the following inside the main() method:

Properties properties = new Properties(); 

properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");  
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());  
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

Properties stores a set of properties as pairs of strings. The ones we're using are:

ProducerConfig.BOOTSTRAP_SERVERS_CONFIG which specifies the IP address to use to access the Kafka cluster
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG which specifies the serializer to use for message keys
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG which specifies the serializer to use for message values

We're going to connect to our local Kafka cluster running on localhost:9092, and use the StringSerializer since both our keys and values will be strings.

Now we can create our producer and pass it the configuration properties.

KafkaProducer producer = new KafkaProducer<>(properties);

To send a message, we need to create a ProducerRecord and pass it to our producer. ProducerRecord contains a topic name, and optionally a key, value, and partition number.

We're going to create the ProducerRecord with the topic to use, the message's key, and the message's value.

ProducerRecord producerRecord = new ProducerRecord<>("lotr_characters", "hobbits", "Bilbo");

We can now use the producer's send() method to send the message to Kafka.

producer.send(producerRecord);

Finally, we need to call the close() method to stop the producer. This method handles any messages currently being processed by send() and then closes the producer.

producer.close();

Now it's time to run our producer. Make sure you have Zookeeper and Kafka running. Then run the main() method of the Producer class.

Sending a message from a producer in a Java Kafka client app

Note: On Windows, your producer might not be able to connect to a Kafka broker running on WSL. To fix this, you're going to need to do the following:

In a WSL terminal, navigate to Kafka's config folder: cd ~/kafka_2.13-3.3.1/config/
Open server.properties, for example with Nano: nano server.properties
Uncomment #listeners=PLAINTEXT//:9092
Replace it with listeners=PLAINTEXT//[::1]:9092
In your Producer class, replace "localhost:9092" with "[::1]:9092"

[::1], or 0:0:0:0:0:0:0:1, refers to the loopback address (or localhost) in IPv6. This is equivalent to 127.0.0.1 in IPv4.

If you change listeners, when you try to access the Kafka broker from the command line you'll also have to use the new IP address, so use --bootstrap-server ::1:9092 instead of --bootstrap-server localhost:9092 and it should work.

We can now check that Producer worked by using kafka-console-consumer in another terminal window to read from the lotr_characters topic and see the message printed to the console.

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic lotr_characters --from-beginning

kafka-console-consumer reading the message sent by the producer in our Java app

How to Send Multiple Messages and Use Callbacks

So far we're only sending one message. If we update Producer to send multiple messages, we'll be able to see how keys are used to divide messages between partitions. We can also take this opportunity to use a callback to view the sent message's metadata.

To do this, we're going to loop over a collection of characters to generate our messages.

So replace this:

ProducerRecord producerRecord = new ProducerRecord<>("lotr_characters", "hobbits", "Bilbo");  

producer.send(producerRecord);

with this:

HashMap characters = new HashMap();  
characters.put("hobbits", "Frodo");  
characters.put("hobbits", "Sam");  
characters.put("elves", "Galadriel");  
characters.put("elves", "Arwen");
characters.put("humans", "Éowyn");  
characters.put("humans", "Faramir");

for (HashMap.Entry character : characters.entrySet()) {  
    ProducerRecord producerRecord = new ProducerRecord<>("lotr_characters", character.getKey(), character.getValue());  

    producer.send(producerRecord, (RecordMetadata recordMetadata, Exception err) -> {  
        if (err == null) {  
            log.info("Message received. \n" +  
                    "topic [" + recordMetadata.topic() + "]\n" +  
                    "partition [" + recordMetadata.partition() + "]\n" +  
                    "offset [" + recordMetadata.offset() + "]\n" +  
                    "timestamp [" + recordMetadata.timestamp() + "]");  
        } else {  
            log.error("An error occurred while producing messages", err);  
        }  
    });  
}

Here, we're iterating over the collection, creating a ProducerRecord for each entry, and passing the record to send(). Behind the scenes, Kafka will batch these messages together to make fewer network requests. send() can also take a callback as a second argument. We're going to pass it a lambda which will run code when the send() request completes.

If the request completed successfully, we get back a RecordMetadata object with metadata about the message, which we can use to see things such as the partition and offset the message ended up in.

If we get back an exception, we could handle it by retrying to send the message, or alerting our application. In this case, we're just going to log the exception.

Run the main() method of the Producer class and you should see the message metadata get logged.

The full code for the Producer class should now be:

package org.example;  

import org.apache.kafka.clients.producer.KafkaProducer;  
import org.apache.kafka.clients.producer.ProducerConfig;  
import org.apache.kafka.clients.producer.ProducerRecord;  
import org.apache.kafka.clients.producer.RecordMetadata;  
import org.apache.kafka.common.serialization.StringSerializer;  
import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  

import java.util.HashMap;  
import java.util.Properties;  

public class Producer {  
    private static final Logger log = LoggerFactory.getLogger(Producer.class);  

    public static void main(String[] args) {  
        log.info("This class produces messages to Kafka");  

        Properties properties = new Properties();
        properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); 
        properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());  
        properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());  

        KafkaProducer producer = new KafkaProducer<>(properties);  

        HashMap characters = new HashMap();  
        characters.put("hobbits", "Frodo");  
        characters.put("hobbits", "Sam");  
        characters.put("elves", "Galadriel");  
        characters.put("elves", "Arwen");
        characters.put("humans", "Éowyn");  
        characters.put("humans", "Faramir"); 

        for (HashMap.Entry character : characters.entrySet()) {  
            ProducerRecord producerRecord = new ProducerRecord<>("lotr_characters", character.getKey(), character.getValue());  

            producer.send(producerRecord, (RecordMetadata recordMetadata, Exception err) -> {  
                if (err == null) {  
                    log.info("Message received. \n" +  
                            "topic [" + recordMetadata.topic() + "]\n" +  
                            "partition [" + recordMetadata.partition() + "]\n" +  
                            "offset [" + recordMetadata.offset() + "]\n" +  
                            "timestamp [" + recordMetadata.timestamp() + "]");  
                } else {  
                    log.error("An error occurred while producing messages", err);  
                }  
            });  
        }
        producer.close();  
    }  
}

Next, we're going to create a consumer to read these messages from Kafka.

How to Create a Kafka Consumer

First, create a Consumer class. Again, you can call it whatever you want, but don't call it KafkaConsumer as you will need that class in a moment.

All the Kafka-specific code will go in Consumer's main() method.

package org.example;  

import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  

public class Consumer {  
    private static final Logger log = LoggerFactory.getLogger(Consumer.class);  

    public static void main(String[] args) {  
        log.info("This class consumes messages from Kafka");  
    }  
}

Next, configure the consumer properties.

Properties properties = new Properties();  
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");  
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());  
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());  
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "lotr_consumer_group");  
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

Just like with Producer, these properties are a set of string pairs. The ones we're using are:

ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG which specifies the IP address to use to access the Kafka cluster
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG which specifies the deserializer to use for message keys
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG which specifies the deserializer to use for message values
ConsumerConfig.GROUP_ID_CONFIG which specifies the consumer group this consumer belongs to
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG which specifies the offset to start reading from

We're connecting to the Kafka cluster on localhost:9092, using string deserializers since our keys and values are strings, setting a group id for our consumer, and telling the consumer to read from the start of the topic.

Note: If you're running the consumer on Windows and accessing a Kafka broker running on WSL, you'll need to change "localhost:9091" to "[::1]:9092" or "0:0:0:0:0:0:0:1:9092", like you did in Producer.

Next, we create a KafkaConsumer and pass it the configuration properties.

KafkaConsumer consumer = new KafkaConsumer<>(properties);

We need to tell the consumer which topic, or topics, to subscribe to. The subscribe() method takes in a collection of one or more strings, naming the topics you want to read from. Remember, consumers can subscribe to more than one topic at the same time. For this example, we'll use one topic, the lotr_characters topic.

String topic = "lotr_characters";  

consumer.subscribe(Arrays.asList(topic));

The consumer is now ready to start reading messages from the topic. It does this by regularly polling for new messages.

We'll use a while loop to repeatedly call the poll() method to check for new messages.

poll() takes in a duration for how long it should read for at a time. It then batches these messages into an iterable called ConsumerRecords. We can then iterate over ConsumerRecords and do something with each individual ConsumerRecord.

In a real-world application, we would process this data or send it to some further destination, like a database or data pipeline. Here, we're just going to log the key, value, partition, and offset for each message we receive.

while(true){  
    ConsumerRecords messages = consumer.poll(Duration.ofMillis(100));  

    for (ConsumerRecord message : messages){  
        log.info("key [" + message.key() + "] value [" + message.value() +"]");  
        log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");  
    }  
}

Now it's time to run our consumer. Make sure you have Zookeeper and Kafka running. Run the Consumer class and you'll see the messages that Producer previously sent to the lotr_characters topic in Kafka.

The Kafka client app consuming messages that were previously produced to Kafka

How to Shut Down the Consumer

Right now, our consumer is running in an infinite loop and polling for new messages every 100 ms. This isn't a problem, but we should add safeguards to handle shutting down the consumer if an exception occurs.

We're going to wrap our code in a try-catch-finally block. If an exception occurs, we can handle it in the catch block.

The finally block will then call the consumer's close() method. This will close the socket the consumer is using, commit the offsets it has processed, and trigger a consumer group rebalance so any other consumers in the group can take over reading the partitions this consumer was handling.

try {
            // subscribe to topic(s)
            String topic = "lotr_characters";
            consumer.subscribe(Arrays.asList(topic));

            while (true) {
                // poll for new messages
                ConsumerRecords messages = consumer.poll(Duration.ofMillis(100));

                // handle message contents
                for (ConsumerRecord message : messages) {
                    log.info("key [" + message.key() + "] value [" + message.value() + "]");
                    log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");
                }
            }
        } catch (Exception err) {
            // catch and handle exceptions
            log.error("Error: ", err);
        } finally {
            // close consumer and commit offsets
            consumer.close();
            log.info("consumer is now closed");
        }

Consumer will continuously poll its assigned topics for new messages and shut down safely if it experiences an exception.

The full code for the Consumer class should now be:

package org.example;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;

public class Consumer {
    private static final Logger log = LoggerFactory.getLogger(Consumer.class);

    public static void main(String[] args) {
        log.info("This class consumes messages from Kafka");

        Properties properties = new Properties();
        properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "lotr_consumer_group");
        properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        KafkaConsumer consumer = new KafkaConsumer<>(properties);

        try {
            String topic = "lotr_characters";
            consumer.subscribe(Arrays.asList(topic));

            while (true) {
                ConsumerRecords messages = consumer.poll(Duration.ofMillis(100));

                for (ConsumerRecord message : messages) {
                    log.info("key [" + message.key() + "] value [" + message.value() + "]");
                    log.info("partition [" + message.partition() + "] offset [" + message.offset() + "]");
                }
            }
        } catch (Exception err) {
            log.error("Error: ", err);
        } finally {
            consumer.close();
            log.info("The consumer is now closed");
        }
    }
}

You now have a basic Java application that can send messages to and read messages from Kafka. If you got stuck at any point, the full code is available on GitHub.

Where to Take it from Here

Congratulations on making it this far. You've learned:

the main concepts behind Kafka
how to communicate with Kafka from the command line
how to build a Java app that produces to and consumes from Kafka

There's plenty more to learn about Kafka, whether that's Kafka Connect for connecting Kafka to common data systems or the Kafka Streams API for processing and transforming your data.

Some resources you might find useful as you continue your journey with Kafka are:

I hope this guide has been helpful and made you excited to learn more about Kafka, event streaming, and real-time data processing.

How to Use Object Storage for Data Parallelization and Experimentation

Ry Vee — Mon, 27 Sep 2021 14:09:57 +0000

By using big data, companies can learn a lot about how their businesses are performing. Analytics on sales, churn rates, and other basic metrics are available in almost real time as data comes in.

Then there are more complex analyses that you'll need to do. At times relationships between two seemingly unrelated data sets can provide surprising insights and unveil important opportunities for the organization.

Data scientists and engineers are continuing to improve how they break down and work on data. Experimentation entails discovering the right correlations among data points.

This means they also need to do some sort of parallelization of such data and resulting models. Parallelization simply means that the same data set is being operated upon in many different ways without damaging the integrity of the original data.

In this article we are going to talk about how you can make sure you're doing such experimentation and parallel processing efficiently and that it provides the maximum insights. We will be tackling different concepts related to data storage and data versioning.

Block Storage vs Object Storage

For the uninitiated, we first must understand the difference between block and object storage and why the latter is the better option when dealing with data experimentation.

_Image source_

What is Block Storage?

It is called “block storage” (also known as SAN) because each dataset (in the form of files) is grouped into blocks stored in disks.

A classic example of block storage is the file system on your personal computer. For enterprise-level use-cases, it is scaled through a network of hard drives connected through fiber optic cables.

There are a few disadvantages to using block storage. First, if a sector (or a block) becomes corrupted, it can damage the files. Another problem is the lack of scalability (expanding the network of fiber optic cables is costly).

What is Object Storage?

In object storage, data is stored as objects. Each object contains the actual data, called the blob, a unique identifier (UUID), and metadata, which contains information about the object (such as timestamp, version, and author).

Object storage makes it cost-effective to scale your data store—you don’t need complex hardware for this. It also makes data retrieval faster as each object can be retrieved through its UUID.

This is in contrast to block storage, where each data location needs to be identified before the actual information can be retrieved.

One disadvantage of using object storage is that data can only be written once and cannot be updated. But this isn’t really a disadvantage as we will see further on in this article.

What Problems Does Object Storage Solve?

As we have already seen, data retrieval can be incredibly fast with object storage (no matter the size of the data store). But when it comes to data experimentation and data parallelization, object storage shines the brightest.

As mentioned before, you can't overwrite any data already stored as an object. This ensures object storage is protected from unwanted (or unauthorized) data destruction or updating. That’s great to know if you do a lot of data processing where accidental corruption of information could happen.

One other problem that object storage can solve is that it doesn’t require data to be structured. As companies produce and consume tremendous amounts of information every moment, often non-structured data (such as PDFs, videos, images) are not so easily processed into useful forms (such as for analytics or dashboards).

With object storage, this is now possible. You can now use non-structured data to develop machine learning models.

With data storage, it’s possible to have different versions of the same blob (with different metadata). As there is Git for code version control, we can have similar ways of managing different versions of the same data.

This brings us to the concept of data lakes.

What are Data Lakes?

Data lakes are central repositories of data that don’t care which format such data is in.

Companies produce and consume tremendous amounts of data. Such data traditionally sits in silos because they belong to different departments or are in different forms (for example, videos aren’t stored in the same directory as the data in the MySQL database).

With data lakes, any department in the enterprise can store information without the need to pre-process it. Likewise, any data can be retrieved and analyzed by anybody from any department.

Data lakes are important because they make data analytics extremely fast and convenient.

How Data Experimentation and Parallelization Work with Object Storage

As with developing software, working with data requires us to utilize tools that can aid us in our workflow. A powerful open source tool for experimenting with data and performing parallelization (that is working on the same data to create different sets of machine learning models) is LakeFS.

LakeFS is an open source platform that provides Git-like capabilities when working with data. This means you can create branches (allowing you to experiment with data) and commit versions of data (and data models).

Why is this Git-like feature important?

First, you need to make sure that your data lake is ACID compliant. This means that your data changes can happen in isolation (in branches). Thus, the integrity of the data is maintained in the master branch (until such changes are ready to be merged).

Another important feature of LakeFS is continuous integration of data (again, much like in software development). Enterprises need to incorporate new data quickly and without being disrupted. Therefore, this ability to have a CI/CD workflow is invaluable.

So, let’s see how we can get started with using LakeFS with our object storage experimentation and parallelization.

How to Install LakeFS

Locally you can install LakeFS by running the following command in your terminal:*

Code source

_*This is assuming you have Docker and Docker-Compose installed in your system. If you don’t have Docker and Docker-Compose, you may try other installation methods here._

Now visit http://127.0.0.1:8000/setup in your browser to verify you have installed it correctly.

How to Create a Repository in LakeFS

Once you’ve verified that LakeFS is installed correctly, go ahead and create an admin user.

Image source

_Image source_

Click on the login link and log in as an administrator.

On the page to which you get redirected, click on Create Repository. A popup will appear:

_Image source_

Congratulations! You now have your first repository. This is the main “bucket” in which you are going to store your data.

Next, we’ll start adding some data.

How to Add Data to your LakeFS Repository

Visit here to install AWS CLI.

With the credentials created during the admin-user creation phase, configure a new connection profile:

Code source

To test if the connection is working, run the following:

Code source

Now, to copy files into the main branch:

Code source

Just note that we need to prefix the path with the name of the branch we want to use.

Now, we will see the file we’ve added in the UI:

_Image source_

Next, we will need to know how to commit and create branches. To do that, we will need to install the LakeFS CLI.

How to Install the LakeFS CLI

You need to first download the binary file here.

Again, we need to use the earlier created admin credentials:

Code source

Here are some of the commands we can run to try it out:

Code source

You can find all the other commands, such as branch creation, and so on, online.

There you have it! Now, you can work with your data any way you like. Experiment without guilt and create multiple versions of your data models.

In Closing

In this article, we covered a bit of ground. We learned the different kinds of data storage mechanisms and why object storage has a lot of edge when dealing with data experimentations and parallelism.

Next, we looked into data lakes and LakeFS, which is a powerful tool for working with data.

At first, it might seem a daunting task. But, as we’ve shown here, with the right set of tools and knowledge, there’s a lot you can accomplish.

A Quick Overview of the Apache Hadoop Framework

freeCodeCamp — Sat, 01 Feb 2020 00:00:00 +0000

Hadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stuffed elephant that inspired the name appears in Hadoop’s logo.

What is Apache Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Source: Apache Hadoop

In 2003 Google released their paper on the Google File System (GFS). It detailed a proprietary distributed file system intended to provide efficient access to large amounts of data using commodity hardware. A year later, Google released another paper entitled “MapReduce: Simplified Data Processing on Large Clusters.” At the time, Doug was working at Yahoo. These papers were the inspiration for his open-source project Apache Nutch. In 2006, the project components then known as Hadoop moved out of Apache Nutch and was released.

Why is Hadoop useful?

Every day, billions of gigabytes of data are created in a variety of forms. Some examples of frequently created data are:

Metadata from phone usage
Website logs
Credit card purchase transactions
Social media posts
Videos
Information gathered from medical devices

“Big data” refers to data sets that are too large or complex to process using traditional software applications. Factors that contribute to the complexity of data are the size of the data set, speed of available processors, and the data’s format.

At the time of its release, Hadoop was capable of processing data on a larger scale than traditional software.

Core Hadoop

Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.

HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.

Hadoop Ecosystem

Many other software packages exist to complement Hadoop. These programs comprise the the Hadoop Ecosystem. Some programs make it easier to load data into the Hadoop cluster, while others make Hadoop easier to use.

The Hadoop Ecosystem includes:

Apache Hive
Apache Pig
Apache HBase
Apache Phoenix
Apache Spark
Apache ZooKeeper
Cloudera Impala
Apache Flume
Apache Sqoop
Apache Oozie

More Information:

Apache Hadoop

I ranked every Intro to Data Science course on the internet, based on thousands of data points

freeCodeCamp — Fri, 27 Dec 2019 03:24:00 +0000

By David Venturi

A year ago, I dropped out of one of the best computer science programs in Canada. I started creating my own data science master’s program using online resources. I realized that I could learn everything I needed through edX, Coursera, and Udacity instead. And I could learn it faster, more efficiently, and for a fraction of the cost.

I’m almost finished now. I’ve taken many data science-related courses and audited portions of many more. I know the options out there, and what skills are needed for learners preparing for a data analyst or data scientist role. A few months ago, I started creating a review-driven guide that recommends the best courses for each subject within data science.

For the first guide in the series, I recommended a few coding classes for the beginner data scientist. Then it was statistics and probability classes.

Now onto introductions to data science.

(Don’t worry if you’re unsure of what an intro to data science course entails. I’ll explain shortly.)

For this guide, I spent 10+ hours trying to identify every online intro to data science course offered as of January 2017, extracting key bits of information from their syllabi and reviews, and compiling their ratings. For this task, I turned to none other than the open source Class Central community and its database of thousands of course ratings and reviews.

_Class Central’s [homepage](https://www.class-central.com/" rel="noopener" target="blank" title=").

Since 2011, Class Central founder Dhawal Shah has kept a closer eye on online courses than arguably anyone else in the world. Dhawal personally helped me assemble this list of resources.

How we picked courses to consider

Each course must fit three criteria:

It must teach the data science process. More on that soon.
It must be on-demand or offered every few months.
It must be an interactive online course, so no books or read-only tutorials. Though these are viable ways to learn, this guide focuses on courses.

We believe we covered every notable course that fits the above criteria. Since there are seemingly hundreds of courses on Udemy, we chose to consider the most-reviewed and highest-rated ones only. There’s always a chance that we missed something, though. So please let us know in the comments section if we left a good course out.

How we evaluated courses

We compiled average rating and number of reviews from Class Central and other review sites to calculate a weighted average rating for each course. We read text reviews and used this feedback to supplement the numerical ratings.

We made subjective syllabus judgment calls based on two factors:

Coverage of the data science process. Does the course brush over or skip certain subjects? Does it cover certain subjects in too much detail? See the next section for what this process entails.
Usage of common data science tools. Is the course taught using popular programming languages like Python and/or R? These aren’t necessary, but helpful in most cases so slight preference is given to these courses.

Python and R are the two most popular programming languages used in data science.

What is the data science process?

What is data science? What does a data scientist do? These are the types of fundamental questions that an intro to data science course should answer. The following infographic from Harvard professors Joe Blitzstein and Hanspeter Pfister outlines a typical data science process, which will help us answer these questions.

_Visualization from [Opera Solutions](http://blog.operasolutions.com/bid/384900/what-is-data-science" rel="noopener" target="blank" title=").

Our goal with this introduction to data science course is to become familiar with the data science process. We don’t want too in-depth coverage of specific aspects of the process, hence the “intro to” portion of the title.

For each aspect, the ideal course explains key concepts within the framework of the process, introduces common tools, and provides a few examples (preferably hands-on).

We’re only looking for an introduction. This guide therefore won’t include full specializations or programs like Johns Hopkins University’s [Data Science Specialization](http://click.linksynergy.com/fs-bin/click?id=SAyYsTvLiGQ&subid=&offerid=479491.1&type=10&tmpid=18061&u1=cc-medium-career-guide-intro-to-data-science &RD_PARM1=https%3A%2F%2Fwww.coursera.org%2Fspecializations%2Fjhu-data-science%2F) on Coursera or Udacity’s Data Analyst Nanodegree. These compilations of courses elude the purpose of this series: to find the best individual courses for each subject to comprise a data science education. The final three guides in this series of articles will cover each aspect of the data science process in detail.

Basic coding, stats, and probability experience required

Several courses listed below require basic programming, statistics, and probability experience. This requirement is understandable given that the new content is reasonably advanced, and that these subjects often have several courses dedicated to them.

This experience can be acquired through our recommendations in the first two articles (programming, statistics) in this Data Science Career Guide.

Our pick for the best intro to data science course is…

Data Science A-Z™: Real-Life Data Science Exercises Included (Kirill Eremenko/Udemy)

Kirill Eremenko’s Data Science A-Z™ on Udemy is the clear winner in terms of breadth and depth of coverage of the data science process of the 20+ courses that qualified. It has a 4.5-star weighted average rating over 3,071 reviews, which places it among the highest rated and most reviewed courses of the ones considered.

It outlines the full process and provides real-life examples. At 21 hours of content, it is a good length. Reviewers love the instructor’s delivery and the organization of the content. The price varies depending on Udemy discounts, which are frequent, so you may be able to purchase access for as little as $10.

Though it doesn’t check our “usage of common data science tools” box, the non-Python/R tool choices (gretl, Tableau, Excel) are used effectively in context. Eremenko mentions the following when explaining the gretl choice (gretl is a statistical software package), though it applies to all of the tools he uses (emphasis mine):

In gretl, we will be able to do the same modeling just like in R and Python but we won’t have to code. That’s the big deal here. Some of you may already know R very well, but some may not know it at all. My goal is to show you how to build a robust model and give you a framework that you can apply in any tool you choose. gretl will help us avoid getting bogged down in our coding.

One prominent reviewer noted the following:

Kirill is the best teacher I’ve found online. He uses real life examples and explains common problems so that you get a deeper understanding of the coursework. He also provides a lot of insight as to what it means to be a data scientist from working with insufficient data all the way to presenting your work to C-class management. I highly recommend this course for beginner students to intermediate data analysts!

A great Python-focused introduction

Intro to Data Analysis (Udacity)

Udacity’s Intro to Data Analysis is a relatively new offering that is part of Udacity’s popular Data Analyst Nanodegree. It covers the data science process clearly and cohesively using Python, though it lacks a bit in the modeling aspect. The estimated timeline is 36 hours (six hours per week over six weeks), though it is shorter in my experience. It has a 5-star weighted average rating over two reviews. It is free.

The videos are well-produced and the instructor (Caroline Buckey) is clear and personable. Lots of programming quizzes enforce the concepts learned in the videos. Students will leave the course confident in their new and/or improved NumPy and Pandas skills (these are popular Python libraries). The final project — which is graded and reviewed in the Nanodegree but not in the free individual course — can be a nice add to a portfolio.

An impressive offering with no review data

Data Science Fundamentals (Big Data University)

Data Science Fundamentals is a four-course series provided by IBM’s Big Data University. It includes courses titled Data Science 101, Data Science Methodology, Data Science Hands-on with Open Source Tools, and R 101.

It covers the full data science process and introduces Python, R, and several other open-source tools. The courses have tremendous production value. 13–18 hours of effort is estimated, depending on if you take the “R 101” course at the end, which isn’t necessary for the purpose of this guide. Unfortunately, it has no review data on the major review sites that we used for this analysis, so we can’t recommend it over the above two options yet. It is free.

The competition

Our #1 pick had a weighted average rating of 4.5 out of 5 stars over 3,068 reviews. Let’s look at the other alternatives, sorted by descending rating. Below you’ll find several R-focused courses, if you are set on an introduction in that language.

Python for Data Science and Machine Learning Bootcamp (Jose Portilla/Udemy): Full process coverage with a tool-heavy focus (Python). Less process-driven and more of a very detailed intro to Python. Amazing course, though not ideal for the scope of this guide. It, like Jose’s R course below, can double as both intros to Python/R and intros to data science. 21.5 hours of content. It has a 4.7-star weighted average rating over 1,644 reviews. Cost varies depending on Udemy discounts, which are frequent.
Data Science and Machine Learning Bootcamp with R (Jose Portilla/Udemy): Full process coverage with a tool-heavy focus (R). Less process-driven and more of a very detailed intro to R. Amazing course, though not ideal for the scope of this guide. It, like Jose’s Python course above, can double as both intros to Python/R and intros to data science. 18 hours of content. It has a 4.6-star weighted average rating over 847 reviews. Cost varies depending on Udemy discounts, which are frequent.

_Jose Portilla has two Data Science and Machine Learning Bootcamps on Udemy: one for [R](http://click.linksynergy.com/fs-bin/click?id=SAyYsTvLiGQ&subid=&offerid=323058.1&type=10&tmpid=14538&RD_PARM1=https%3A%2F%2Fwww.udemy.com%2Fpython-for-data-science-and-machine-learning-bootcamp%2F%26u1%3Dcc-medium-career-guide-intro-to-data-science" rel="noopener" target="_blank" title="">Python and one for Data Science and Machine Learning with Python — Hands On! (Frank Kane/Udemy): Partial process coverage. Focuses on statistics and machine learning. Decent length (nine hours of content). Uses Python. It has a 4.5-star weighted average rating over 3,104 reviews. Cost varies depending on Udemy discounts, which are frequent.

Introduction to Data Science (Data Hawk Tech/Udemy): Full process coverage, though limited depth of coverage. Quite short (three hours of content). Briefly covers both R and Python. It has a 4.4-star weighted average rating over 62 reviews. Cost varies depending on Udemy discounts, which are frequent.

Applied Data Science: An Introduction (Syracuse University/Open Education by Blackboard): Full process coverage, though not evenly spread. Heavily focuses on basic statistics and R. Too applied and not enough process focus for the purpose of this guide. Online course experience feels disjointed. It has a 4.33-star weighted average rating over 6 reviews. Free.

Introduction To Data Science (Nina Zumel & John Mount/Udemy): Partial process coverage only, though good depth in the data preparation and modeling aspects. Okay length (six hours of content). Uses R. It has a 4.3-star weighted average rating over 101 reviews. Cost varies depending on Udemy discounts, which are frequent.

Applied Data Science with Python (V2 Maestros/Udemy): Full process coverage with good depth of coverage for each aspect of the process. Decent length (8.5 hours of content). Uses Python. It has a 4.3-star weighted average rating over 92 reviews. Cost varies depending on Udemy discounts, which are frequent.

_V2 Maestros has two versions of their “Applied Data Science” course: one for [R](http://click.linksynergy.com/fs-bin/click?id=SAyYsTvLiGQ&subid=&offerid=323058.1&type=10&tmpid=14538&RD_PARM1=https%3A%2F%2Fwww.udemy.com%2Fapplied-data-science-with-python%2F%26u1%3Dcc-medium-career-guide-intro-to-data-science" rel="noopener" target="_blank" title="">Python and one for Want to be a Data Scientist? (V2 Maestros/Udemy): Full process coverage, though limited depth of coverage. Quite short (3 hours of content). Limited tool coverage. It has a 4.3-star weighted average rating over 790 reviews. Cost varies depending on Udemy discounts, which are frequent.

Data to Insight: an Introduction to Data Analysis (University of Auckland/FutureLearn): Breadth of coverage unclear. Claims to focus on data exploration, discovery, and visualization. Not offered on demand. 24 hours of content (three hours per week over eight weeks). It has a 4-star weighted average rating over 2 reviews. Free with paid certificate available.

Data Science Orientation (Microsoft/edX): Partial process coverage (lacks modeling aspect). Uses Excel, which makes sense given it is a Microsoft-branded course. 12–24 hours of content (two-four hours per week over six weeks). It has a 3.95-star weighted average rating over 40 reviews. Free with Verified Certificate available for $25.

Data Science Essentials (Microsoft/edX): Full process coverage with good depth of coverage for each aspect. Covers R, Python, and Azure ML (a Microsoft machine learning platform). Several 1-star reviews citing tool choice (Azure ML) and the instructor’s poor delivery. 18–24 hours of content (three-four hours per week over six weeks). It has a 3.81-star weighted average rating over 67 reviews. Free with Verified Certificate available for $49.

_The above two courses are from Microsoft’s [Professional Program Certificate in Data Science](http://www.awin1.com/awclick.php?gid=295463&mid=6798&awinaffid=301045&linkid=599979&clickref=&p=https%3A%2F%2Fwww.edx.org%2Fmicrosoft-professional-program-certficate-data-science" rel="noopener" target="blank" title=") on edX.

Applied Data Science with R (V2 Maestros/Udemy): The R companion to V2 Maestros’ Python course above. Full process coverage with good depth of coverage for each aspect of the process. Decent length (11 hours of content). Uses R. It has a 3.8-star weighted average rating over 212 reviews. Cost varies depending on Udemy discounts, which are frequent.
Intro to Data Science (Udacity): Partial process coverage, though good depth for the topics covered. Lacks the exploration aspect, though Udacity has a great, full course on exploratory data analysis (EDA). Claims to be 48 hours in length (six hours per week over eight weeks), but is shorter in my experience. Some reviews think the set-up to the advanced content is lacking. Feels disorganized. Uses Python. It has a 3.61-star weighted average rating over 18 reviews. Free.
Introduction to Data Science in Python (University of Michigan/Coursera): Partial process coverage. No modeling and vizualization, though courses #2 and #3 in the Applied Data Science with Python Specialization cover these aspects. Taking all three courses would be too in depth for the purpose of this guides. Uses Python. Four weeks in length. It has a 3.6-star weighted average rating over 15 reviews. Free and paid options available.

_The University of Michigan teaches the [Applied Data Science with Python Specialization](http://click.linksynergy.com/fs-bin/click?id=SAyYsTvLiGQ&subid=&offerid=451430.1&type=10&tmpid=18061&u1=cc-medium-career-guide-intro-to-data-science&RD_PARM1=https%3A%2F%2Fwww.coursera.org%2Fspecializations%2Fdata-science-python" rel="noopener" target="blank" title=") on Coursera.

Data-driven Decision Making (PwC/Coursera): Partial coverage (lacks modeling) with a business focus. Introduces many tools, including R, Python, Excel, SAS, and Tableau. Four weeks in length. It has a 3.5-star weighted average rating over 2 reviews. Free and paid options available.
A Crash Course in Data Science (Johns Hopkins University/Coursera): An extremely brief overview of the full process. Too brief for the purpose of this series. Two hours in length. It has a 3.4-star weighted average rating over 19 reviews. Free and paid options available.
[The Data Scientist’s Toolbox](http://click.linksynergy.com/fs-bin/click?id=SAyYsTvLiGQ&subid=&offerid=451430.1&u1=cc-medium-career-guide-intro-to-data-science &type=10&tmpid=18061&RD_PARM1=https%3A%2F%2Fwww.coursera.org%2Flearn%2Fdata-scientists-tools) (Johns Hopkins University/Coursera): An extremely brief overview of the full process. More of a set-up course for Johns Hopkins University’s [Data Science Specialization](http://click.linksynergy.com/fs-bin/click?id=SAyYsTvLiGQ&subid=&offerid=479491.1&type=10&tmpid=18061&u1=cc-medium-career-guide-intro-to-data-science &RD_PARM1=https%3A%2F%2Fwww.coursera.org%2Fspecializations%2Fjhu-data-science%2F). Claims to have 4–16 hours of content (one-four hours per week over four weeks), though one reviewer noted it could be completed in two hours. It has a 3.22-star weighted average rating over 182 reviews. Free and paid options available.
[Data Management and Visualization](http://click.linksynergy.com/fs-bin/click?id=SAyYsTvLiGQ&subid=&offerid=451430.1&type=10&tmpid=18061&u1=cc-medium-career-guide-intro-to-data-science &RD_PARM1=https%3A%2F%2Fwww.coursera.org%2Flearn%2Fdata-visualization) (Wesleyan University/Coursera): Partial process coverage (lacks modeling). Four weeks in length. Good production value. Uses Python and SAS. It has a 2.67-star weighted average rating over 6 reviews. Free and paid options available.

The following courses had no reviews as of January 2017.

CS109 Data Science (Harvard University): Full process coverage in great depth (probably too in depth for the purpose of this series). A full 12-week undergraduate course. Course navigation is difficult since the course is not designed for online consumption. Actual Harvard lectures are filmed. The above data science process infographic originates from this course. Uses Python. No review data. Free.

_The featured viz on Harvard CS109’s [homepage](http://cs109.github.io/2015/" rel="noopener" target="blank" title=").

Introduction to Data Analytics for Business (University of Colorado Boulder/Coursera): Partial process coverage (lacks modeling and visualization aspects) with a focus on business. The data science process is disguised as the “Information-Action Value chain” in their lectures. Four weeks in length. Describes several tools, though only covers SQL in any depth. No review data. Free and paid options available.
Introduction to Data Science (Lynda): Full process coverage, though limited depth of coverage. Quite short (three hours of content). Introduces both R and Python. No review data. Cost depends on Lynda subscription.

Wrapping it Up

This is the third of a six-piece series that covers the best online courses for launching yourself into the data science field. We covered programming in the first article and statistics and probability in the second article. The remainder of the series will cover other data science core competencies: data visualization and machine learning.

If you want to learn Data Science, start with one of these programming classes

If you want to learn Data Science, take a few of these statistics classes

The final piece will be a summary of those articles, plus the best online courses for other key topics such as data wrangling, databases, and even software engineering.

If you’re looking for a complete list of Data Science online courses, you can find them on Class Central’s Data Science and Big Data subject page.

If you enjoyed reading this, check out some of Class Central’s other pieces:

Here are 250 Ivy League courses you can take online right now for free
250 MOOCs from Brown, Columbia, Cornell, Dartmouth, Harvard, Penn, Princeton, and Yale.

The 50 best free online university courses according to data
When I launched Class Central back in November 2011, there were around 18 or so free online courses, and almost all of…

If you have suggestions for courses I missed, let me know in the responses!

If you found this helpful, click the ? so more people will see it here on Medium.

This is a condensed version of my original article published on Class Central, where I’ve included further course descriptions, syllabi, and multiple reviews.

Powerful tools for Elasticsearch data visualization & analysis

freeCodeCamp — Tue, 13 Aug 2019 17:00:00 +0000

By Veronika Rovnik

The goal is to turn data into information, and information into insight.

―Carly Fiorina

About Kibana

Kibana is a piece of data visualization software that provides a browser-based interface for exploring Elasticsearch data and navigating the Elastic Stack — a collection of open-source products (Elasticsearch, Logstash, Beats, and others).

While Logstash and Bits deliver data to Elasticsearch, Kibana opens the window into the Elastic Stack, allowing you to track the health of your cluster, perform log and time-series analysis, detect anomalies in the data with unsupervised machine learning, discover relationships using graphs and, most importantly, extract insights from the Elasticsearch data with visualizations that can be combined together in a custom interactive dashboard.

Today I’d like to show you how to create a stunning dashboard and a tabular report based on the Elasticsearch data.

Roll up your sleeves and let’s start!

Where to start

The Home page is the place where everything starts.

Here you can decide which actions to take next. The available functionality can be divided into two logical sections:

Visualizing and exploring the data. Here you can create a new dashboard, visualization or presentation, build a machine learning model, analyze relationships in your data using graphs, and more.
Managing the Elastic Stack: configure your spaces, analyze logs of an application, configure security settings, etc.

We’ll focus on the process of creating visualizations and adding them to the dashboard.

How to create a dashboard in Kibana

Let me get you a feel for how easy it is to set up a rich dashboard and start reporting.

The first essential step to take is to import your data into Kibana. Multiple options for adding data are at your disposal — you can choose the one that works best for you:

For demonstration purposes, I’ve selected the sample data.

To design your first data visualizations and combine them into the dashboard, open the Visualize page. Here you can create, modify and view the existing visualizations.

What will strike you at once is the abundance of visualization types you can choose from.

After you’ve selected the one you need, choose an index pattern as a source so as to inform Kibana about your index. Let’s choose kibana_sample_data_flights and start creating a horizontal bar chart.

Now you can apply a metric aggregation for the Y-axis and a bucket aggregation for the X-axis. Here is a list of all available aggregations for charts.

Creating a horizontal bar chart in Kibana

Optionally, you can customize the colors of the visualization.

Filtering is another mighty feature of Elasticsearch and Kibana. It provides a way to visualize only a selected subset of documents.

See how you can apply filters to the fields based on logical conditions:

As you see, Kibana provides a straightforward way of filtering the data via the comfy interface. Along with that, you can choose how to filter the data — either by using the Kibana Query Language (a simplified query syntax) or Lucene.

To allow end-users to filter the data interactively, you can add control widgets — special elements of the dashboard which allow filtering the data simply by clicking them.

Another feature I’d like to highlight is the advanced filtering by dates and the ability to set time intervals for refreshing the data in the dashboard.

The good thing is that visualizations are reusable. After creating it, you can save your result and add it to the dashboard any time as well as share with your colleagues given they have access to your Kibana instance.

Saving a visualization in Kibana

After arranging all the visualization elements on a single page, you can export the final dashboard to PNG or PDF format. This is what makes the dashboards portable — it’s easy to share them across departments in no time.

Let’s look at an example of the dashboard you can create:

Interacting with the dashboard in Kibana

To my mind, the principal features which make each dashboard special are interactivity and expressiveness. With it, you can communicate business metrics efficiently.

Personal impression

The visualizations in Kibana ideally perform the tasks they are designed for. What is more, all the visualizations are eye-catching and you can tailor them according to your design ideas. The entire process of creating a dashboard in Kibana is meant to be fast and efficient — and it is so due to the Kibana’s user-friendly and intuitive interface.

On the other hand, I’ve felt that some functionality is missing here.

When working with data, one of the effective exploratory techniques you can apply is slicing and dicing your data before getting to know which aspects of the data to pay attention to. To my mind, the data table widget isn’t the best option — it presents the data in a flat table which doesn’t support a multi-dimensional view of the data. But playing with data should be done interactively and fast.

And this is where a pivot table control comes into play. After searching for available solutions, my choice fell on one open-source plugin called Flexmonster. It handles connecting to the Elasticsearch index and allows creating tabular reports based on the data from its documents. Along with that, integrating with Kibana is smooth — the only thing required to get started is to install a plugin by running one line of code in the command line. You can find more details on GitHub. Before using it, I recommend making sure that your Kibana and Elasticsearch instances are of the same version.

Once you set up a tool, you are ready to use all available features for searching in-depth insights.

Features for analytics and reporting

Flexmonster Pivot provides fast access to the most essential reporting functionality. Its toolbar allows connecting to the data source, loading previously saved reports, exporting reports to PDF, Excel, HTML, CSV, and images. Besides, I’ve managed to quickly switch between two different modes — the grid and the charts. Cells formatting options include conditional and number formatting. The field list deserves particular attention — here you can select hierarchies to rows, columns, measures, and report filters. There is also the search input field which is helpful if the index has a long list of fields.

One of the features I’d like to highlight is the ability to drag and drop the hierarchies right on the grid. Thereby, you can change the slice completely via the UI.

Another one is the drill-through feature — it helps to know which records stand behind the aggregated values.

Working with a pivot table

Let me show you how to create a report based on the Elasticsearch data:

While testing the tool, I’ve managed to aggregate and filter the data, sort the values on the grid and save the results to continue working with the report later. Plus, exporting works well — it’s easy to share the reports with teammates.

Bringing it all together

Today I’ve covered the benefits Kibana provides for visualization of Elasticsearch data. You’ve been able to make sure how dashboards can empower the analysis process.

To my mind, a pivot table is a good tool which enables you to benefit from exploring data before teasing out the answers to complex questions.

Flexmonster nicely complements the available functionality of Kibana - the reports you are creating with it are insightful, customizable and can be easily shared across departments.

Working together, both tools have all the potential to boost your storytelling.

I encourage you to give such a combination a try.

What’s next?

How to protect your information with Local Sheriff

freeCodeCamp — Wed, 01 May 2019 18:12:00 +0000

By Konark Modi

Watching them watching us

What is a TellTale URL ?

A URL is the most commonly tracked piece of information. The innocent choice to structure a URL based on page content can make it easier to learn a users’ browsing history, address, health information or more sensitive details. They contain sensitive information or can lead to a page which contains sensitive information.

We call such URLs as TellTaleURLs.

Let’s take a look at some examples of such URLs.

EXAMPLE #1:

Website: donate.mozilla.org (Fixed)

After you have finished the payment process on donate.mozilla.org, you are redirected to a “thank you” page. If you look carefully at the URL shown in the below screenshot, it contains some private information like email, country, amount, payment method.

PII in URL on donate.mozilla.org

Now because this page loads some resources from third-parties and the URL is not sanitised, the same information is also shared with those third-parties via referrer and as a value inside payload sent to the third-parties.

URL with PII shared when fonts being loaded from Google Apis.

In this particular case, there were 7 third-parties with whom this information was shared.

Mozilla was prompt to fix these issues, more details can be found here: _https://bugzilla.mozilla.org/showbug.cgi?id=1516699

EXAMPLE #2:

Website: trainline.eu, JustFly.com (Last checked: Aug’18)

Once you finish a purchase like train tickets / flight tickets, you receive an email which has a link to manage your booking. Most of the time, when you click on the link, you are shown the booking details - without having to enter any more details like booking code, username/password.

This means that the URL itself contains some token which is unique to the user and provides access to the users’ booking.

It so happens that these URLs are also shared with third-parties, giving these third-parties highly sensitive data and access to your bookings.

JustFly.com leaking bookingID to 10 third-party domains

trainline.eu sharing booking token with 17 third-party domains.

URL with token being shared via Ref and inside the payload.

EXAMPLE #3:

Website: foodora.de, grubhub.com (Last checked: Aug’18)

One of the pre-requisites to order food online is entering the address where you want the food to be delivered.

Some popular food delivery websites, convert the address to fine latitude-longitude values and add them to the URL.

The URL is also shared with third-parties, potentially leaking where the user lives.

Foodora leaking address details to 15 third-party domains.

To be clear, it’s not just these websites that suffer from such leaks. This problem exists everywhere - it’s a default situation, not a rarity. We’ve seen it with Lufthansa, Spotify, Flixbus, Emirates, and even with medical providers.

Risks of TellTale URLs:

Websites are carelessly leaking sensitive information to plethora of third-parties.
Most often without users’ consent.
More dangerously: Most websites are not aware of these leaks while implementing third-party services.

Are these problems hard to fix?

As a Software Engineer who has worked for some of the largest eCommerce companies, I understand the need to use third party services for optimising and enhancing not only the Digital Product but also how users interact with the product.

It is not the usage of third party services that is of concern in this case but the implementation of these services. Owners should always have the control of their website and what the website shares with third party services.

It is this control that needs to be exercised to limit the leakage of User information.

It is not a mammoth task, it is just a matter of commitment to preserving the basic right to privacy.

For example:

Private pages should have noindex meta tags.
Limit the presence of third-party services on private pages.
Referrer-Policy on pages with sensitive data.
Implement CSP and SRI. Even with a huge footprint of third-party services CSP, SRI are not enabled on majority of the websites.

Introducing Local Sheriff:

Given that such information leakage is dangerous to both users and the organisations, then why is it a wide-spread problem?

One big reason that these issues exist is lack of awareness.

A good starting point for websites is to see what information is being leaked or detect presence of TellTaleURLs.

But in order to find out if the same is happening with the websites you maintain or visit, you need to learn some tools to inspect network traffic, understand first-party — third-party relationship and then make sure you have these tools open during the transaction process.

To help bridge this gap, we wanted to build a tool with the following guidelines:

Easy to install.
Monitors and stores all data being exchanged between websites and third-parties — Locally on the user machine.
Helps identify the users which companies are tracking them on the internet.
Interface to search information being leaked to third-parties.

Given the above guidelines, browser extension seemed like a reasonable choice. After you install Local-Sheriff, in the background:

Using the WebRequest API, it monitors interaction between first-party and third-party.
Classifies what URL is first-party and third-party.
Ships with a copy of database from WhoTracksMe. To map which domain belongs to which company.
Provides an interface you can search for values that you think are private to you and see which websites leak it to which third-parties. Eg: name, email, address, date of birth, cookie etc.

Revisiting EXAMPLE #1

Website: donate.mozilla.org

The user has Local-Sheriff installed and donates to mozilla.org.

PII in URL on donate.mozilla.org

Clicks on the icon to open search interface.

Local sheriff icon.

Enters emailID used on the website donate.mozilla.org.

Search interface Local-Sheriff

It can be seen that email address used at the time of donation was shared with ~7 third-party domains.

You can try it yourselves by installing it:

Firefox: https://addons.mozilla.org/de/firefox/addon/local-sheriff/
Chrome: https://chrome.google.com/webstore/detail/local-sheriff/ckmkiloofgfalfdhcfdllaaacpjjejeg

Resources:

More details: https://www.ghacks.net/2018/08/12/local-sheriff-reveals-if-sites-leak-personal-information-with-third-parties/
Source Code: https://github.com/cliqz-oss/local-sheriff
Conferences: Defcon 26 Demo Labs _, FOSDEM 2019_
Code: https://github.com/cliqz-oss/local-sheriff
Chrome store: https://chrome.google.com/webstore/detail/local-sheriff/ckmkiloofgfalfdhcfdllaaacpjjejeg

Thanks for reading and sharing ! :)

If you liked this story, feel free to ??? a few times (Up to 50 times. Seriously).

Happy Hacking !

- Konark Modi

Credits:

_Special thanks to Remi , Pallavi for reviewing this post :)_
Title “Watching them watching us “ comes from a joint talk between Local Sheriff and Trackula at FOSDEM 2019.

How serverless stream processing will make decision-making easier

freeCodeCamp — Tue, 09 Apr 2019 19:08:22 +0000

By Chamath Kirinde

About a year ago, we started being a part of the digital transformation with the first ever cloud-based IDE for serverless development. It was no cakewalk — we’ve been burning the candle at both ends trying to cover the majority from AWS’s serverless stack. Working with AWS Kinesis made me realise the beauty of serverless — of course, the exposure to streaming data with Kafka spared me some time going through the rudiments.

_Rational decision making: Photo by [Unsplash](https://unsplash.com/photos/o4c2zoVhjSw?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" rel="noopener" target="_blank" title="">Raj Eiamworakul on TL;DR

Did you ever wonder…

How “Google Search” suggests things to you when you’re half-typing your query?

How “Cheapest Airlines” start to appear everywhere after you searched for a country?

How online role-playing games adjust according to your decisions?

How gambling sites predict the odds of a live game?

Why were Curry and Thompson benched while Portland was handing the Warriors their worst loss in a 73-win NBA season?

Google’s (sometimes so annoying) query autocomplete

The power of real-time streaming data analytics is astonishing indeed. Now, since serverless technology is gaining some momentum, maybe you won’t have to worry about taking risky decisions on your own at all. This post covers the basics of “Serverless Streaming Data Processing” and how it will be an influential component of our decision making in the future.

Data, Data Everywhere

Life is an endless series of events. The technology around us has made it a stream of digital actions emitting streams of data. If you turn back and investigate your life very carefully, you’ll see the never-ending string of data you have generated with your every digital action. It could be a lot to digest at first, but let’s explore some scenarios and try to find what applies to you and me.

Online banking and convenient e-commerce purchasing capabilities
Ride-sharing, modern-day traveling and transportation
Industrial equipment and agricultural use cases like monitored machinery, autonomous tractors, and precision farming
Automated power generation and smart grids, Zero-net Buildings, Smart metering
Real-estate property recommendations based on geo-location, predictive maintenance
Online dating and matchmaking relying on complex personality patterns and attribute distribution

Rational Romance: Will you be my Valentine?

Financial trading according to the real-time changes in the stock market, analytical risk management
Movies, songs and other digital media with a better experience depending on the demographics, preference, and emotions
Improved web and mobile application experience based on usage
Dynamic and personalised experiences in online gaming
Enhanced social media experiences with hyper-personalisation and predictive analytics
Telemetry from connected devices, or remote data centres from geospatial or spatial services like weather, resource assessment
Sports analytics to enhance the players’ performance reducing health risks

Welcome Analytics

All these events produce data — lots of it. Due to the frequency of this data emission, it has become an increasing burden to the digital space.

What is Streaming Data?

In a survey conducted last year about data, it’s estimated that with the current pace of data generation,

1.7 MB of data will be created every second for every person on earth by 2020

Data that is poured out continuously by a gazillion sources every second has become a fact we can’t just ignore. Big Data discipline was an eye-opener for the tech world to apply this once irritating data to do something useful. This same irksome data is collected and analysed by a new species, namely data scientists ?. Due to the nature of continuity and often being in small sizes (order of Kilobytes) these data flows — usually referred by the moniker streaming data — are collected simultaneously as records and sent in for further processing.

From stream processing to smart decisions

A streaming data processing structure is usually comprised of two layers — a storage layer and a processing layer. The former is responsible for ordering large streams of records and facilitating persistence and accessibility at high speeds. The processing layer takes care of data consumption, executing computations, and notifying the storage layer to get rid of already processed records. Data processing is done for each record incrementally or by matching over sliding time windows. Processed data is then subjected to streaming analytics operations and the derived information is used to make context-based decisions.

For instance, companies can track public sentiment changes on their products by analysing social media streams continuously. The world’s most influential nations can intervene in decisive events like presidential elections in other powerful countries. And mobile apps can offer personalised recommendations for products based on geo-location of devices and user emotions.

Poor data analytics — Poor decisions

Most applications collect a portion of their data at the outset to produce simple summary reports and take simple decisions such as triggering alarms or calculating a moving average value. As time flies by, these become more and more sophisticated, and companies might want to access profound insights to perform intricate activities in turn with the aid of Machine Learning algorithms and data analysis techniques.

The continual growth of data has made data scientists work around the clock to come up with trailblazing solutions to utilise as much data as possible to fabricate alternate futures with better decisions.

Service Facilitators

Adoption of the ideal cloud provider to fit organisational requirements can be overwhelming. However, all the major cloud service providers are equipped with competitive options to accommodate stream processing due to its ubiquitous impact. Here’s a list of commonly used serverless services to bolster enterprise-grade applications, highly relying on streaming data.

Infographic: Serverless Stream Processing Components

Live Examples

Many companies use insights from stream analytics to enhance the visibility of their businesses. This allows them to deliver customers a personalised experience. Additionally, near real-time transparency gives these firms the flexibility to promptly address emergencies.

The emerging serverless architecture has driven all the leading cloud service platforms to present complementary solutions. Stream processing was made available for serverless application development with fully-managed, cloud-based services for real-time data processing over large Distributed Data Streams.

1. Hyper-personalised Television

_Netflix: Photo by [Unsplash](https://unsplash.com/@jenskreuter?utm_source=medium&utm_medium=referral" rel="noopener" target="_blank" title="">Jens Kreuter on using Amazon Kinesis Streams. As a system processing billions of traffic flows every day, this eliminates plenty of complexity for them because of the absence of a database in the architecture. Due to the high scalability and lightning speed, they can discover and address issues as they arise, and monitor the application on a massive scale.

With the upgraded recommendation algorithm, video transcoding, and licensing popular media, this subsequently grants a seamless experience to subscribers. With the exponential growth of subscribers, the company’s responsibilities increase by the day. However, nothing seems to be a problem for Netflix since they are considered to have a sound decision-making model.

2. Improving the decisions of the decision makers

As a leading source of integrated and intelligent information for businesses and professionals, Thomson Reuters provide their services to decision makers in a wide range of domains like financing and risk, science, legal, technology. This company built an in-house analytics engine to take full control of data and moved to AWS because they were familiar with its capabilities and scale.

The new real-time pipeline attached to Amazon Kinesis stream produces better results in perceptive customer experience with accurate economic forecasts, financial trends for beneficiaries including a range of government activities.

3. Unicorn: a solution to traffic congestion

_Unicorn: Photo by [Unsplash](https://unsplash.com/@boudewijn_huysmans?utm_source=medium&utm_medium=referral" rel="noopener" target="_blank" title="">Boudewijn Huysmans on AWS, Google, Microsoft Azure, and IBM Cloud are exploited by companies to make their clients’ lives better and secure.

Limitations of Serverless Stream Processing

Serverless stream processing is increasingly becoming a vital part of decision-making engines. However, with the current set of features, it’s not the ideal solution for some scenarios. Implementing real-time analytics for sliding windows and temporal event patterns is not a course for the faint-hearted.

The best way to assimilate never-ending data of this magnitude is through real-time dashboards which requires additional data organisation and persisting. These manoeuvres introduce undesirable latency and data management issues into the context. However, technology is evolving and trying to catch up to the speeds with integration using advanced cloud data management techniques to produce materialised views.

Security: A major concern

Stream Processing often uses a time-based or record-based window to be processed in contrast to the batch-based processing, which can lead to challenges in use cases that require query re-execution.

Nowadays, application requirements grow beyond aggregated analytics. Increasing the window size seems to be an appropriate temporary solution, but it develops another intractable problem — Memory Management. Modern-day solutions usually provide advanced memory management and scheduling techniques to overcome this, but the world will see further improvements.

Conclusion

All in all, it’s apparent that serverless stream processing has been playing a prominent role around us without us even knowing. With the power of serverless data stream processing, applications can evolve from traditional batch processing to real-time analytics. The revelation of profound insights will result in effective decision making without having to manage infrastructure.

Even today, many organizations practise orthodox decision-making strategies based on the analytics derived using the big data clusters that belonged to THE PAST. New horizons of serverless and real-time data processing are now equipped with the power to make effective decisions and create a more productive, relevant, and most importantly secure world around you.

Will serverless stream processing make emotional decision making obsolete and computerized rational judgement the norm?

What do you think?

What should you do now?

Clap. Appreciate and let others find this article.
Comment. Share your thoughts.
Follow me. Chamath Kirinde to receive updates on articles like this.
Keep in touch. LinkedIn, Twitter, Chummy Charms
Think Serverless. SLAppForge

Originally published at chummycharms.blogspot.com.

For the love of SQL: why you should learn it and how it’ll help you out

freeCodeCamp — Wed, 03 Apr 2019 21:22:53 +0000

By Matthew Oldham

I recently read a great article by the esteemed @craigkerstiens describing why he feels SQL is such a valuable skill for developers. This topic really resonated with me. It lined up well with notes I’d already started sketching out for a similar article about developing a love for data.

The more I fleshed out my topic, however, the more I realized that many of my points and examples seemed to be centering around SQL. Reading Craig’s article convinced me to redirect my focus and talk more about why I personally have such an affinity for SQL.

In short, Craig makes the following assertions about SQL (and I quote):

It is valuable across different roles and disciplines

Learning it once doesn’t really require re-learning

You seem like a superhero. You seem extra powerful when you know it because of the amount of people that aren’t fluent

I’ve found all these points to be true in my own experience, and I’d like to recast and expand on each one.

The Versatility Effect

The SQL skillset has proven to be an extremely valuable asset in my career. In fact, I believe SQL to be the single most powerful and versatile “programming” language I know.

I have been able to use SQL to solve many problems, and it’s my go-to tool anytime I face a new challenge. In fact, I keep an instance of PostgreSQL running on my laptop so I can quickly hop into my favorite SQL GUI whenever I need to test something out.

Here are just some of the cool things I’ve been able to do with SQL:

SQL FTW!

Are you having a hard time believing that list above? I promise you there’s not an ounce of exaggeration in it. Now, are there some items there that were dependent upon other capabilities of the RDBMS I was using at the time? Sure. Regardless, each of those solutions was implemented in SQL.

The Bicycle Effect

While Structured Query Language has certainly undergone changes and has been expanded over the years, I agree with Craig that the fundamentals have not changed. The overall level of volatility compared to other languages has been relatively low.

I would argue that this fact only strengthens the argument that one should invest the time to learn SQL. You can be confident that you’ll get a lot of mileage out of such an investment without having to look up the latest conventions the next time you need to use it.

So, learn SQL! Here are some great places to get started:

SQL Tutorial — Essential SQL For The Beginners
_This SQL tutorial helps you get started with SQL quickly and effectively through many practical examples. After the…_www.sqltutorial.org

There are even interactive tutorials:

SQLBolt — Learn SQL — Introduction to SQL
_SQLBolt provides a set of interactive lessons and exercises to help you learn SQL_sqlbolt.com

There are also some versatile sandboxes out there that allow you to run SQL in various dialects without having to install anything. For example, SQL Fiddle:

SQL Fiddle

Or, DB Fiddle:

DB Fiddle

The Superhero Effect

I remember a colleague once saying he broke into a cold sweat every time he had to write SQL. ?

It sounds exaggerated, but SQL can be intimidating to anyone who properly regards the database as the sensitive asset it is and is not familiar with how to safely interact with it. SQL, being one of the adults in the room, also doesn’t get as much attention as other shiny new programming languages. That means that it remains a less common skillset among contemporary and emerging developers.

As such, having a solid understanding of SQL and the insight to see the set-based facets of a given problem or challenge provides the opportunity to be a hero.

One of my favorite personal experiences was helping a customer debug a slow and complex SAS program. The goal of this program was to extract a list of state transitions from an audit table in order to measure the mean duration a widget spent in each phase of a given business workflow. The implementation of these calculations was complex and required building multiple local data sets.

I remember reverse engineering this program and realizing that I could solve the problem much more easily using a single SQL query and the magical LAG window function.

The customer was simply blown away.

Not just because he learned about the LAG function, but because he saw just how powerful SQL can be.

An even more dramatic example was during a large data warehouse migration where I replaced an entire Java program (that took more than 20 minutes to complete!) with a single SQL query that ran in seconds. The original author of the program was shocked! That was a really good day. ?

So, I encourage you to dive into SQL today and broaden your skillset with one of the most versatile tools I’ve had the pleasure of working with. If you already know SQL and agree, or if I’ve convinced you to give it a try, please consider leaving me a comment.

How to import Google BigQuery tables to AWS Athena

freeCodeCamp — Mon, 11 Mar 2019 18:55:49 +0000

By Aftab Ansari

As a data engineer, it is quite likely that you are using one of the leading big data cloud platforms such as AWS, Microsoft Azure, or Google Cloud for your data processing. Also, migrating data from one platform to another is something you might have already faced or will face at some point.

In this post, I will show how I imported Google BigQuery tables to AWS Athena. If you only need a list of tools to be used with some very high-level guidance, you can quickly look at this post that shows how to import a single BigQuery table into Hive metastore. In this article, I will show one way of importing a full BigQuery project (multiple tables) into both Hive and Athena metastore.

There are few import limitations: for example, when you import data from partitioned tables, you cannot import individual partitions. Please check the limitations before starting the process.

In order to successfully import Google BigQuery tables to Athena, I performed the steps shown below. I used AVRO format when dumping data and the schemas from Google BigQuery and loading them into AWS Athena.

Step 1. Dump BigQuery data to Google Cloud Storage

Step 2. Transfer data from Google Cloud Storage to AWS S3

Step 3. Extract AVRO schema from AVRO files stored in S3

Step 4. Create Hive tables on top of AVRO data, use schema from Step 3

Step 5. Extract Hive table definition from Hive tables

Step 6. Use the output of Step 3 and 5 to create Athena tables

So why do I have to create Hive tables in the first place although the end goal is to have data in Athena? This is because:

Athena does not support using avro.schema.url to specify table schema.
Athena requires you to explicitly specify field names and their data types in CREATE statement.
Athena also requires the AVRO schema in JSON format under avro.schema.literal.
You can check this AWS doc for more details.

So, Hive tables can be created directly by pointing to AVRO schema files stored on S3. But to have the same in Athena, columns and schema are required in the CREATE TABLE statement.

One way to overcome this is to first extract schema from AVRO data to be supplied as avro.schema.literal . Second, for field names and data types required for CREATE statement, create Hive tables based on AVRO schemas stored in S3 and use SHOW CREATE TABLE to dump/export Hive table definitions which contain field names and datatypes. Finally, create Athena tables by combining the extracted AVRO schema and Hive table definition. I will discuss in detail in subsequent sections.

For the demonstration, I have the following BigQuery tables that I would like to import to Athena.

So, let’s get started!

Step 1. Dump BigQuery data to Google Cloud Storage

It is possible to dump BigQuery data in Google storage with the help of the Google cloud UI. However, this can become a tedious task if you have to dump several tables manually.

To tackle this problem, I used Google Cloud Shell. In Cloud Shell, you can combine regular shell scripting with BigQuery commands and dump multiple tables relatively fast. You can activate Cloud Shell as shown in the picture below.

From Cloud Shell, the following operation provides the BigQuery extract commands to dump each table of the “backend” dataset to Google Cloud Storage.

bq ls backend | cut -d ' ' -f3 | tail -n+3 | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY .@ gs://@

In my case it prints:

aftab_ansari@cloudshell:~ (project-ark-archive)$ bq ls backend | cut -d ' ' -f3 | tail -n+3 | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY backend.@ gs://plr_data_transfer_temp/bigquery_data/backend/@/@-*.avro

bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_daily_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-*.avro

bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_detailed_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-*.avro

bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-*.avro

Please note: --compression SNAPPY, this is important, as uncompressed and big files can cause the gsutil command (that is used to transfer data to AWS S3) to get stuck. The wildcard (*) makes bq extract split bigger tables (>1GB) into multiple output files. Running those commands on Cloud Shell, copy data to the following Google Storage directory.

gs://plr_data_transfer_temp/bigquery_data/backend/table_name/table_name-*.avro

Let’s do ls to see the dumped AVRO file.

aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil ls gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2

gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro

I can also browse from the UI and find the data like shown below.

Step 2. Transfer data from Google Cloud Storage to AWS S3

Transferring data from Google Storage to AWS S3 is straightforward. First, set up your S3 credentials. On Cloud Shell, create or edit .boto file ( vi ~/.boto) and add these:

[Credentials]aws_access_key_id = aws_secret_access_key = [s3]host = s3.us-east-1.amazonaws.comuse-sigv4 = True

Please note: s3.us-east-1.amazonaws.com has to correspond with the region where the bucket is.

After setting up the credentials, execute gsutil to transfer data from Google Storage to AWS S3. For example:

gsutil rsync -r gs://your-gs-bucket/your-extract-path/your-schema s3://your-aws-bucket/your-target-path/your-schema

Add the -n flag to the command above to display the operations that would be performed using the specified command without actually running them.

In this case, to transfer the data to S3, I used the following:

aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil rsync -r gs://plr_data_transfer_temp/bigquery_data/backend s3://my-bucket/bq_data/backend

Building synchronization state…Starting synchronization…Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-000000000000.avro [Content-Type=application/octet-stream]...| [3 files][987.8 KiB/987.8 KiB]Operation completed over 3 objects/987.8 KiB.

Let’s check if the data got transferred to S3. I verified that from my local machine:

aws s3 ls --recursive  s3://my-bucket/bq_data/backend --profile smoke | awk '{print $4}'

bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avrobq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avrobq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro

Step 3. Extract AVRO schema from AVRO files stored in S3

To extract schema from AVRO data, you can use the Apache avro-tools-t;.jar with the getschema parameter. The benefit of using this tool is that it returns schema in the form you can use directly in WITH SERDEPROPERTIES statement when creating Athena tables.


You noticed I got only one .avro file per table when dumping BigQuery tables. This was because of small data volume — otherwise, I would have gotten several files per table. Regardless of single or multiple files per table, it’s enough to run avro-tools against any single file per table to extract that table’s schema.
I downloaded the latest version of avro-tools which is avro-tools-1.8.2.jar. I first copied all .avro files from s3 to local disk:
[hadoop@ip-10-0-10-205 tmpAftab]$ aws s3 cp s3://my-bucket/bq_data/backend/ bq_data/backend/ --recursive
download: s3://my-bucket/bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro to bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro
download: s3://my-bucket/bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro to bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro
download: s3://my-bucket/bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro to bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro
Avro-tools command should look like java -jar avro-tools-1.8.2.jar getschema your_data.avro > schema_file.avsc. This can become tedious if you have several AVRO files (in reality, I’ve done this for a project with many more tables). Again, I used a shell script to generate commands. I created extract_schema_avro.sh with the following content:
schema_avro=(bq_data/backend/*)for i in ${!schema_avro[*]}; do  input_file=$(find ${schema_avro[$i]} -type f)  output_file=$(ls -l ${schema_avro[$i]} | tail -n+2 \    | awk -v srch="avro" -v repl="avsc" '{ sub(srch,repl,$9);    print $9 }')  commands=$(    echo "java -jar avro-tools-1.8.2.jar getschema " \      $input_file" > bq_data/schemas/backend/avro/"$output_file  )  echo $commandsdone
Running extract_schema_avro.sh provides the following:
[hadoop@ip-10-0-10-205 tmpAftab]$ sh extract_schema_avro.sh
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc
Executing the above commands copies the extracted schema under bq_data/schemas/backend/avro/ :
[hadoop@ip-10-0-10-205 tmpAftab]$ ls -l bq_data/schemas/backend/avro/* | awk '{print $9}'
bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avscbq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avscbq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc
Let’s also check what’s inside an .avsc file.
[hadoop@ip-10-0-10-205 tmpAftab]$ cat bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc
{"type" : "record","name" : "Root","fields" : [ {"name" : "uid","type" : [ "null", "string" ]}, {"name" : "platform","type" : [ "null", "string" ]}, {"name" : "version","type" : [ "null", "string" ]}, {"name" : "country","type" : [ "null", "string" ]}, {"name" : "sessions","type" : [ "null", "long" ]}, {"name" : "active_days","type" : [ "null", "long" ]}, {"name" : "session_time_minutes","type" : [ "null", "double" ]} ]}
As you can see, the schema is in the form that can be directly used in Athena WITH SERDEPROPERTIES. But before Athena, I used the AVRO schemas to create Hive tables. If you want to avoid Hive table creation, you can read the .avsc files to extract field names and data types, but then you have to map the data types yourself from AVRO format to Athena table creation DDL.
The complexity of the mapping task depends on how complex the data types are in your tables. For simplicity (and to cover most simple to complex data types), I let Hive do the mapping for me. So I created the tables first in Hive metastore. Then I used SHOW CREATE TABLE to get the field names and data types part of the DDL.
Step 4. Create Hive tables on top of AVRO data, use schema from Step 3
As discussed earlier, Hive allows creating tables by using avro.schema.url. So once you have schema (.avsc file) extracted from AVRO data, you can create tables as follows:
CREATE EXTERNAL TABLE table_nameSTORED AS AVROLOCATION 's3://your-aws-bucket/your-target-path/avro_data'TBLPROPERTIES ('avro.schema.url'='s3://your-aws-bucket/your-target-path/your-avro-schema');
First, upload the extracted schemas to S3 so that avro.schema.url can refer to their S3 locations:
[hadoop@ip-10-0-10-205 tmpAftab]$ aws s3 cp bq_data/schemas s3://my-bucket/bq_data/schemas --recursive
upload: bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc
upload: bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc
upload: bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc
After having both AVRO data and schema in S3, DDL for Hive table can be created using the template shown at the beginning of this section. I used another shell script create_tables_hive.sh (shown below) to cover any number of tables:
schema_avro=$(ls -l bq_data/backend | tail -n+2 | awk '{print $9}')for table_name in $schema_avro; do  file_name=$(ls -l bq_data/backend/$table_name | tail -n+2 | awk \    -v srch="avro" -v repl="avsc" '{ sub(srch,repl,$9); print $9 }')  table_definition=$(    echo "CREATE EXTERNAL TABLE IF NOT EXISTS backend."$table_name"\\nSTORED AS AVRO""\\nLOCATION 's3://my-bucket/bq_data/backend/"$table_name"'""\\nTBLPROPERTIES('avro.schema.url'='s3://my-bucket/bq_data/\schemas/backend/avro/"$file_name"');"  )  printf "\n$table_definition\n"done
Running the script provides the following:
[hadoop@ip-10-0-10-205 tmpAftab]$ sh create_tables_hive.sh
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc');
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2'TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc');
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc');
I ran the above on Hive console to actually create the Hive tables:
[hadoop@ip-10-0-10-205 tmpAftab]$ hiveLogging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: false
hive> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2> STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc');OKTime taken: 4.24 seconds
hive>> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2'> TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc');OKTime taken: 0.563 seconds
hive>> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2> STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc');OKTime taken: 0.386 seconds
So I have created the Hive tables successfully. To verify that the tables work, I ran this simple query:
hive> select count(*) from backend.sessions_detailed_phase2;Query ID = hadoop_20190214152548_2316cb5b-29f1-4416-922e-a6ff02ec1775Total jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App id application_1550010493995_0220)----------------------------------------------------------------------------------------------VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 8.17 s----------------------------------------------------------------------------------------------OK6130
So it works!
Step 5. Extract Hive table definition from Hive tables
As discussed earlier, Athena requires you to explicitly specify field names and their data types in CREATE statement. In Step 3, I extracted the AVRO schema, which can be used in WITH SERDEPROPERTIES of Athena table DDL, but I also have to specify all the fiend names and their (Hive) data types. Now that I have the tables in the Hive metastore, I can easily get those by running SHOW CREATE TABLE. First, prepare the Hive DDL queries for all tables:
[hadoop@ip-10-0-10-205 tmpAftab]$ ls -l bq_data/backend | tail -n+2 | awk '{print "hive -e '\''SHOW CREATE TABLE backend."$9"'\'' > bq_data/schemas/backend/hql/backend."$9".hql;" }'
hive -e 'SHOW CREATE TABLE backend.sessions_daily_phase2' > bq_data/schemas/backend/hql/backend.sessions_daily_phase2.hql;
hive -e 'SHOW CREATE TABLE backend.sessions_detailed_phase2' > bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql;
hive -e 'SHOW CREATE TABLE backend.sessions_phase2' > bq_data/schemas/backend/hql/backend.sessions_phase2.hql;
Executing the above commands copies Hive table definitions under bq_data/schemas/backend/hql/. Let’s see what’s inside:
[hadoop@ip-10-0-10-205 tmpAftab]$ cat bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql
CREATE EXTERNAL TABLE `backend.sessions_detailed_phase2`(`uid` string COMMENT '',`platform` string COMMENT '',`version` string COMMENT '',`country` string COMMENT '',`sessions` bigint COMMENT '',`active_days` bigint COMMENT '',`session_time_minutes` double COMMENT '')ROW FORMAT SERDE'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'LOCATION's3://my-bucket/bq_data/backend/sessions_detailed_phase2'TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc','transient_lastDdlTime'='1550157659')
By now all the building blocks needed for creating AVRO tables in Athena are there:

Field names and data types can be obtained from the Hive table DDL (to be used in columns section of CREATE statement)
AVRO schema (JSON) can be obtained from the extracted .avsc files (to be supplied in WITH SERDEPROPERTIES).

Step 6. Use the output of Steps 3 and 5 to Create Athena tables
If you are still with me, you have done a great job coming this far. I am now going to perform the final step which is creating Athena tables. I used the following script to combine .avsc and .hql files to construct Athena table definitions:
[hadoop@ip-10-0-10-205 tmpAftab]$ cat create_tables_athena.sh
# directory where extracted avro schemas are storedschema_avro=(bq_data/schemas/backend/avro/*)# directory where extracted HQL schemas are storedschema_hive=(bq_data/schemas/backend/hql/*)for i in ${!schema_avro[*]}; do  schema=$(awk -F '{print $0}' '/CREATE/{flag=1}/STORED/{flag=0}\   flag' ${schema_hive[$i]})  location=$(awk -F '{print $0}' '/LOCATION/{flag=1; next}\  /TBLPROPERTIES/{flag=0} flag' ${schema_hive[$i]})  properties=$(cat ${schema_avro[$i]})  table=$(echo $schema '\n' \    "WITH SERDEPROPERTIES ('avro.schema.literal'='\n"$properties \    "\n""')STORED AS AVRO \n" \    "LOCATION" $location";\n\n")  printf "\n$table\n"done \  > bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql
Running the above script copies Athena table definitions to bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql. In my case it contains:
[hadoop@ip-10-0-10-205 all_athena_tables]$ cat all_athena_tables.hql
CREATE EXTERNAL TABLE `backend.sessions_daily_phase2`( `uid` string COMMENT '', `activity_date` string COMMENT '', `sessions` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "activity_date", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2';
CREATE EXTERNAL TABLE `backend.sessions_detailed_phase2`( `uid` string COMMENT '', `platform` string COMMENT '', `version` string COMMENT '', `country` string COMMENT '', `sessions` bigint COMMENT '', `active_days` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "platform", "type" : [ "null", "string" ] }, { "name" : "version", "type" : [ "null", "string" ] }, { "name" : "country", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] } ')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2';
CREATE EXTERNAL TABLE `backend.sessions_phase2`( `uid` string COMMENT '', `sessions` bigint COMMENT '', `active_days` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_phase2';
And finally, I ran the above scripts in Athena to create the tables:

There you have it.
I feel that the process is a bit lengthy. However, this has worked well for me. The other approach would be to use AWS Glue wizard to crawl the data and infer the schema. If you have used AWS Glue wizard, please share your experience in the comment section below.



 An in-depth introduction to SQOOP architecture 
freeCodeCamp — Tue, 26 Feb 2019 17:53:46 +0000
 By Jayvardhan Reddy
Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and structured data-stores such as relational databases, and vice-versa.

_Image Credits: [hdfstutorial.com](https://www.hdfstutorial.com/sqoop-architecture/" rel="noopener" target="blank" title=")
As part of this blog, I will be explaining how the architecture works on executing a Sqoop command. I’ll cover details such as the jar generation via Codegen, execution of MapReduce job, and the various stages involved in running a Sqoop import/export command.
Codegen
Understanding Codegen is essential, as internally this converts our Sqoop job into a jar which consists of several Java classes such as POJO, ORM, and a class that implements DBWritable, extending SqoopRecord to read and write the data from relational databases to Hadoop & vice-versa.
You can create a Codegen explicitly as shown below to check the classes present as part of the jar.
sqoop codegen \   -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_db \   -- username retail_user \   -- password ******* \   -- table products
The output jar will be written in your local file system. You will get a Jar file, Java file and java files which are compiled into .class files:

Let us see a snippet of the code that will be generated.
ORM class for table ‘products’ // Object-relational modal generated for mapping:

Setter & Getter methods to get values:

Internally it uses JDBC prepared statements to write to Hadoop and ResultSet to read data from Hadoop.

Sqoop Import
It is used to import data from traditional relational databases into Hadoop.

_Image Credits: [dummies.com](https://www.dummies.com/programming/big-data/hadoop/hadoop-for-dummies-cheat-sheet/" rel="noopener" target="blank" title=")
Let’s see a sample snippet for the same.
sqoop import \   -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_db \   -- username retail_user \   -- password ******* \   -- table products \   -- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir \   -- delete-target-dir
The following steps take place internally during the execution of sqoop.
Step 1: Read data from MySQL in streaming fashion. It does various operations before writing the data into HDFS.

As part of this process, it will first generate code (typical Map reduce code) which is nothing but Java code. Using this Java code it will try to import.

Generate the code. (Hadoop MR)
Compile the code and generate the Jar file.
Submit the Jar file and perform the import operations

During the import, it has to make certain decisions as to how to divide the data into multiple threads so that Sqoop import can be scaled.
Step 2: Understand the structure of the data and perform CodeGen

Using the above SQL statement, it will fetch one record along with the column names. Using this information, it will extract the metadata information of the columns, datatype etc.

_Image Credits: [cs.tut.fi](http://www.cs.tut.fi/~aaltone3/kurssit/hadoop/Sqoop_pdf.pdf" rel="noopener" target="blank" title=")
Step 3: Create the java file, compile it and generate a jar file
As part of code generation, it needs to understand the structure of the data and it has to apply that object on the incoming data internally to make sure the data is correctly copied onto the target database. Each unique table has one Java file talking about the structure of data.

This jar file will be injected into Sqoop binaries to apply the structure to incoming data.
Step 4: Delete the target directory if it already exists.

Step 5: Import the data

Here, it connects to a resource manager, gets the resource, and starts the application master.

To perform equal distribution of data among the map tasks, it internally executes a boundary query based on the primary key by default
 to find the minimum and maximum count of records in the table.
 Based on the max count, it will divide by the number of mappers and split it amongst each mapper.

It uses 4 mappers by default:

It executes these jobs on different executors as shown below:

The default number of mappers can be changed by setting the following parameter:

So in our case, it uses 4 threads. Each thread processes mutually exclusive subsets, that is each thread processes different data from the others.
To see the different values, check out the below:

Operations that are being performed under each executor nodes:

In case you perform a Sqooop hive import, one extra step as part of the execution takes place.
Step 6: Copy data to hive table

Sqoop Export
This is used to export data from Hadoop into traditional relational databases.

_Image Credits: [slideshare.net](https://www.slideshare.net/gharriso/from-oracle-to-hadoop-with-sqoop-and-other-tools" rel="noopener" target="blank" title=")
Let’s see a sample snippet for the same:
sqoop export \  -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_export \  -- username retail_user \  -- password ******* \  -- table product_sqoop_exp \  -- export-dir /user/jvanchir/sqoop_prac/import_table_dir/products
On executing the above command, the execution steps (1–4) similar to Sqoop import take place, but the source data is read from the file system (which is nothing but HDFS). Here it will use boundaries upon block size to divide the data and it is internally taken care by Sqoop.
The processing splits are done as shown below:

After connecting to the respective database to which the records are to be exported, it will issue a JDBC insert command to read data from HDFS and store it into the database as shown below.

Now that we have seen how Sqoop works internally, you can determine the flow of execution from jar generation to execution of a MapReduce task on the submission of a Sqoop job.
Note: The commands that were executed related to this post are added as part of my GIT account.
Similarly, you can also read more here:

Hive Architecture in Depth with code.
HDFS Architecture in Depth with code.

If you would like too, you can connect with me on LinkedIn - Jayvardhan Reddy.
If you enjoyed reading this article, you can click the clap and let others know about it. If you would like me to add anything else, please feel free to leave a response ?
 


 How to work in Data Science, AI, or Big Data based on my experience 
freeCodeCamp — Wed, 30 Jan 2019 22:19:43 +0000
 By Richard Freeman, PhD
In summer 2013, I interviewed for a lead role in the data science and analytics team at tech-for-good company JustGiving. During the interview, I said I planned to deliver batch machine learning, graph analytics and streaming analytics systems, both in-house and in the cloud.
A few years later, my former boss Mike Bugembe and I were both presenting at international conferences, winning awards and becoming authors!
Here is my story, and what I learnt on the journey — plus my recommendations for you.
Why Big Data Engineering and Data Science?
I’ve always been interested in artificial intelligence (AI), machine learning (ML) and natural language processing (NLP). In particular, I’ve been interested in scalable systems, and making robots more intelligent and responsive.
My interest in data engineering comes from my background as a solutions architect. In that role, I enjoyed building cloud-based systems to store and process data to derive new insight and knowledge.
I also develop big data and ML pipelines to automate the whole ML process. This helps data scientists and analysts save time preparing data for training and testing their algorithms, running metrics and deriving key performance indicators at scale.
Data preparation is particularly important. Data scientists typically spend about 80% of their time on it. Having access to data shaped in the right way makes them more productive and happier.
My previous background
I previously earned a Masters degree in computer systems engineering, and a PhD in ML and NLP. I completed both at the University of Manchester.
Rather than join a specialised vendor in my Ph.D. area of expertise, I decided to broaden my skills and gain more client exposure by joining Capgemini. Capgemini are a large global consulting, technology and outsourcing services company.
I worked my way from being a developer to a solution architect. There, I helped deliver large scale projects for Fortune Global 500 companies in sectors including insurance, retail banking, financial services, and central government.
I then joined PageGroup. There, I worked as an lead developer and architect on a global transformation programme across 34 countries. I led the technical delivery of search, multi-channel communication, business intelligence, text analytics, job board integration, and advertising solutions.
Current roles
Now I am a lead big data and machine learning engineer at JustGiving. JustGiving is a tech-for-good company that’s helped 26 million users in 164 countries raise $5 billion for good causes. It was acquired in 2017 by Blackbaud — the world’s leading software company powering social good.
I currently lead the delivery and architecture of our in-house data science platform RAVEN and production ML systems. These were initially deployed with Azure, but later hosted in AWS. I also dive in as a data scientist specialising in scalable streaming analytics, ML and NLP algorithms.

I share my technical experience and knowledge internally and externally relating to AWS, stream processing, serverless stacks, ML and NLP. I also present regularly at industry conferences, open source my code and write technical blog posts on Medium and for AWS such as Analyze a Time Series in Real Time.
I’m also an independent freelance advisor and consultant helping organisations with cloud architecture, serverless computing and ML at Starwolf.
A typical day in the office
JustGiving is still a start-up at heart, so there is no typical day. I get involved in various tasks, such as data and report requirements capture, engineering new data pipeline, investigating operational issues, running data experiments, analysing unstructured data looking for useful patterns, exploring new ways to use the data to answer questions, presenting a data story, and sharing my knowledge and experience. This means that I work closely with marketing, product managers and product analysts to understand their data needs and what metrics and predictions are important for them.
Speaking to others outside your specialist area helps to broaden your views, gives you a new perspective, and new areas you can apply your skills.
On the technical side, I work with engineers, data analysts, developers, business intelligence analysts, operations, and data scientists to support their data and platform requirements.
Things I enjoy about work
I am passionate about working with huge data sets, as you face different kinds of performance, costs and operational issues that require you to think differently in order to scale your data warehouse, ETL processes, and algorithms and how you present your results. A lot of what you know about data warehousing with their millions of records goes out the roof when you hit hundreds of billions rows and need to iterate or do complex joins to run ML data preparation queries.
Building and running large-scale data infrastructure and distributed model training are active areas in academia and industry. They are evolving at a fast pace, with new tooling being introduced every few months. I like to use cloud solutions in an innovative way to improve our in-house data science platform, enhance our business processes, and make data insights available to internal and external users.
I’ve found that a lot of companies give their power away by using 3rd parties for their web analytics solutions, rather than building their own. That data is then siloed in marketing or sales departments, is difficult if not possible to get back in its raw form, and cannot be streamed back for example preventing you from making real-time ML recommendation or predictions directly in your product.
At JustGiving we built an in-house web analytics product called KOALA and have this data available in real-time as an AWS Serverless stack. This allowed us to have a full suite of data pipelines for ML training and analytics in-house, and the likes of MAGPIE that allows us to create real-time metrics and insighs that we can serve back to the users.
For example here is early version shown in this Tweet during a crowdfunding campaign for the Manchester attacks victims families’ in May 2017.
In addition KOALA allows us to make predictions from streaming data. It is extremely costs effective solution compared to paying for a vendor product. If you compare it to a vendor solution based on the same web traffic, KOALA is 10x cheaper, more developer friendly, and we get the raw streamed data back in real-time, rather than in batches or having to use a propitiatory locked down querying or reporting system.
I am also a big fan of Python and have successfully encouraged its uptake in the company and wider community for the data pipelines, ML and serverless computing. Why Python? It has extensive ML Libraries, scales with the likes of pySpark, and easy to read / write.
I also enjoy working with different organisations, charities, universities and giving back to the wider technical community with my experience and time such as at the AWS and British Heart Foundation Hackathon recently.
The Future of Big Data, Data Science and AI
I see more people using ML, real-time analytics, graph analytics and NLP in their products and applications, not just offline on their laptops. This is accelerating as the cloud providers offer ML and NLP application program interfaces (APIs).
For real-time analytics, there is a growing demand from consumers that are much more data aware and impatient. For example they want to know what is happening right now, see the results of their action, and use more intelligent applications and websites that adapt as they are interacting with them.
On the infrastructure side, I see serverless computing and Platform as a Service (PaaS) infrastructure in the public cloud such as AWS and Azure becoming more prominent. Functions in serverless computing are particularly interesting for me, as they can auto-scale in less than a 100 milliseconds, are highly available and are low cost. They are low cost as you only pay for the time your code is executed, rather than for an always-on machine or container like in more traditional cloud infrastructure. I’ve even shown that you can implement most of the existing container-based microservices patterns using a serverless stack.
The open source frameworks and programming languages will also continue to grow compared to closed vendor specific products and languages, e.g. Apache Spark framework, Python, R, SQL. The same goes for data storage and access: cloud storage, data warehouses and data lakes will store data in more open rather than proprietary formats, and this will be more accessible over standard APIs or open protocols.
There will also be growing requirements to analyse unstructured and multimedia data sources, and again the cloud providers will have a growing role to play.
We will also see more companies making the transition from using strategies decided by a few on gut instinct at the top, to becoming more experiment-based, evidence-based, and data-driven as described by my former CAO Mike Bugembe in his book. For example the testing of new products or features, identifying new opportunities and strategic decisions will come more and more from the data analysis, insight and predictions.
This will require more staff to get involved in data capture, data preparation, running experiments using algorithms, data visualisation and presenting results.
As such, new data orientated jobs based on creating and training data models will emerge, disrupting some of the existing specialist fields such as health care, accountancy and law. AI, Internet of things (IoT) and robotics will also replace some existing blue and white collar jobs so we will need to think about training and upskilling people to the changing landscape, and possibly introduce some kind of universal basic income.
You can draw parallels with the shift seen during the industrial revolution from the agrarian or pre-industrial times. For AI to take off, we need two things to happen: the cost of human workers becomes higher than the AI alternative, and for AI to be deployed in a scalable way.
In the much the longer term, quantum computing will also disrupt the field again in terms of how we process, analyse and store data, and will transform areas like cyber security, banking and existing AI.
How to inspire people to pursue careers in data science
I think it’s a lot easier to get people interested in big data and data science than it used to be, thanks to the likes of Google and Facebook that make it fashionable to be smart and work within technology.
In addition, the growing number of young and flexible startup companies with infrastructures in the public cloud are successfully competing and winning market shares from large established companies. Employers need to be willing to educate and upskill existing staff or graduates rather than solely recruit people with existing data engineering or data science skills.
For inspiring existing staff, we need to show the benefits, use cases and data sources most relevant to them, which makes them more productive and their jobs easier. With more data exploration tools available, staff in other departments outside IT or finance, such as customer support, marketing and product managers will be self-serving on the data and insights.
For people who have not worked in industry, I think we need to start early in schools and then universities. Teachers and lecturers in non-computer science subjects could make data more visual and interactive in their respective fields.
I think that almost any subject can benefit — for example even in English literature you can draw a relationship graph of the characters and their connections linked to main themes, events and locations. In history classes, you could have and interactive visual maps and time evolving graph representations of key events their dependencies.
Advice I would you give to someone considering a career in Big Data and Data Science
Whether you are a graduate, already working in an organisation or not from a technical background, you can benefit from analysing and understanding data. For example, data journalists are typically not from a technical or scientific background, yet are able to do simple analysis and create an interesting data story for the general public.
It’s about self-motivation: when things move at such a fast pace, you can look broadly across the sector to gain a general understanding. But you also need to focus your energy on one specific course or project and complete it. The industry also tends to repackage old technologies with some improvements as new trending ones, like cyber security, cognitive computing, chatbots, virtual reality and deep learning at the moment. So I would follow your heart for the areas you are truly interested in and want to focus on rather than the latest trend.
Behind each viral trend there have usually been early explorers that have worked and struggled on that area for years!
In terms of gaining the knowledge, it is a lot easier than it used to be. For example in the past you had to pay for specific vendor training and there was the cost of the product itself. You can now access the learning materials, data sources, and tools all for free, so there is no excuse not to get started today!
For the learning materials, a lot of the content is available for free in massive open online courses, forms, blogs, and source code repositories. Equally there are numerous free data sources like ML datasets, open data, news feeds and social media you can use.
There are many tools out there. Some are graphical, but in my view you should learn to program in SQL, Python, or R. All three have the ability to do data science at scale thanks to frameworks like Apache Spark. I particularly like Python as it benefits from being an efficient development language with a solid test framework and numerous data science packages.
As an ML engineer or data scientist, expect to spend a lot of time on data preparation. This is an important process to master, which involves the cleaning, parsing, enriching and shaping the data so that it can be used in the ML algorithms and experiments. Overall, remember that the processes, tools and data sources are always evolving, so there is no one-off unicorn training course you can do. You will need be self-motivated and open to constantly learn and adapt to the data ecosystem.
I would recommend that you learn another language such as Mandarin (1.1 billion speakers) or Spanish (0.5 billion speakers), to remain mobile, get more career opportunities, and be competitive within this interconnected world. This will also open your mind and give you an insight into other cultures and values, and how they use their data.
Cloud computing also means that you no longer need a physical presence in a country to operate in it, so you need to be open to building systems across regions and analysing data from many countries. Start using collaborative tools and participate in tech for good communities.
Some jobs and professions will be replaced, and some human expertise will be lost, but we will still rely on the data and algorithms. For example, once driverless transportation is widely adopted and considered safer, cheaper and more convenient than human drivers, future generations may not wish to drive a car or even have a driving license. However humans will still be involved in the systems that automate the driving, the creative analysis of the telemetry and IoT data, the supervision and monitoring of the ecosystem, and the wider participation in the transport industry and sharing economy.
Summary
If you want to have a career in data science, ML, or data engineering, the business needs still drive the software development and analysis. Think about the metrics you want to calculate that will benefit your business decisions, or the hypothesis you want to validate with an experiment.
What actions will your audience take with your results? What growth or cost savings opportunities for a business are out there? Then work back to see what data, models and infrastructure you need for the task. I think that being curious, inquisitive, and having an experimental mind are important qualities.
Feel free to connect with me on LinkedIn, follow me on Twitter, or message me for comments and questions. If you want to have a more personalised chat with me, based on your requests, I’m offering short 30min Skype calls on career advice or mentoring for a small fee. I also do short term consultancy, and provide expert advice and audit services to organisations building and running big data and data science platforms in the cloud.
 


 What to consider for painless Apache Kafka integration 
freeCodeCamp — Tue, 22 Jan 2019 18:04:27 +0000
 By Adi Polak
Apache Kafka’s real-world adoption is exploding, and it claims to dominate the world of stream data. It has a huge developer community all over the world that keeps on growing. But, it can be painful too. So, just before jumping head first and fully integrating with Apache Kafka, let’s check the water and plan ahead for painless integration.

_nPhoto by [Unsplash](https://unsplash.com/photos/5fNmWej4tAA?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" rel="noopener" target="_blank" title="">Helloquence on What is it?
Apache Kafka is an open source framework for asynchronous messaging and it’s a distributed streaming platform. It is TCP based. The messages are persisted in topics. Message producers are called publishers and message consumers are called subscribers.

Consumers can subscribe to one or more topics and consume all the messages in that topic. Messages are written into the topic partitions.
Topics are always multilayer subscriber, they can have zero, one, or many consumers that subscribe to the data written to it. For each topic Kafka maintains a partition log. Metadata for the partition’s logs and topics are usually managed by Zookeeper.
If you would like to learn more about Kafka message delivery semantics — like, at most once, at least once and exactly once — read here.
Many tech companies have already integrated Apache Kafka into their production as a message broker, user activities tracking pipeline, metrics gatherer, log aggregation mechanism, stream processing device and much more. Apache Kafka is written in Scala and Java.
Why Kafka?

Kafka provides High Availability and Fault Tolerance message logs. Kafka clusters retain all published records. It is by default persistent — If you don’t set a limit for Kafka, it will keep records until it runs out of disk space. When data loss means awful failure for the product, this is essential for recovery.
Multiple Topic Consumers — when configuring the consumers under multiple consumers groups, it helps to reduce the old bottleneck of sending the data to multiple applications for processing. Kafka is distributed, hence, it can send information to consumers from various physical machines/services instances. Replicating topics to a secondary cluster is also relatively easy using Apache Kafka’s mirroring feature, MirrorMaker — see an example of mirroring data between two HDInsight clusters. Just remember, if multiple consumers are defined as part of the same group (defined by the group.id) the data will be balanced over all the consumers within the group.
Kafka is polyglot — there are many clients in C#, Java, C, python and more. The ecosystem also provides a REST proxy which allows easy integration via HTTP and JSON.
Real-Time Handling — Kafka can handle real-time data pipelines for real time messaging for applications.
Scalable — due to distributed architecture, Kafka can scale out without incurring any downtime.
and more…

Let’s make integration with Kafka painless

Here are 6 things to know before integrating:
1 — Apache Zookeeper can become a pain point with a Kafka cluster
In the past ( versions < 0.81) Kafka used Zookeeper to maintain offsets of each topic and partition. Zookeeper used to take part in the read path, where too frequent commits and too many consumers led to sever performance and stability issues.
On top of that, it is better to use commits manually with old Zookeeper-based consumers, since careless auto-commits could lead to data loss.
The newer versions of Kafka offer their own management, where the consumer can use Kafka itself to manage offsets. This means that there is a specific topic that manages the read offsets instead of Zookeeper.
Yet, Kafka still needs a cluster with Zookeeper, even in the later versions 2.+. Zookeeper is used to store Kafka configs (reassigning partitions when needed) and the Kafka topics API, like create topic, add partition, etc.
The load on Kafka is strictly related to the number of consumers, brokers, partitions and frequency of commits from the consumer.

2 — You shouldn’t send large messages or payloads through Kafka
According to Apache Kafka, for better throughput, the max message size should be 10KB. If the messages are larger than this, it is better to check the alternatives or find a way to chop the message into smaller parts before writing to Kafka. Best practice to do so is using a message key to make sure all chopped messages will be written to the same partition.
3 — Apache Kafka can’t transform data
Many developers are mistaken and think that they can create Kafka parsers or do a data transformation over Kafka. However, Kafka does not enable transformation of data. If you are using Azure services, there is a great list of data factories services that you can use to transform the data like Azure Databricks, HDInsights Spark and others that connects to Kafka.
Another solution is using Apache Kafka stream. This is actually a new API that is build on top of Kafka’s producer and consumer clients. It’s significantly more powerful and also more expressive than the Kafka consumer client.
The [KafkaStreams](https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/KafkaStreams.html) client allows us to perform continuous computation on input coming from one or more input topics and sends output to zero, one, or more output topics. Internally a KafkaStreams instance contains a normal [KafkaProducer](https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html) and [KafkaConsumer](https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html) instance that is used for reading input and writing output.
Another option is using Flink, check it out here.
4 — Apache Kafka supports a binary protocol over TCP
Apache Kafka communication protocol is TCP based. It doesn’t support MQTT or JMS or other non-based TCP protocols out of the box. However, many users have written adaptors to read data from those protocols and write to Apache Kafka. For example kafka-jms-client.

Simple TCP handshake
5 — Apache Kafka management / support and the steep learning curve
As of today, there are limited free UI based management system for Apache Kafka, and most the the DevOps I worked with are using scripting tools. However, it can be tedious for beginner to jump into Apache Kafka scripting tools without taking the time for training. The Learning curve is steep and takes some time to get moving and integrate into big running systems.
For experienced DevOps/ developers it might take a few months (2+) to fully understand how to integrate, support and work with Apache Kafka. It is important to learn how Kafka works in order to use the configuration in the way that will best suit the system’s needs.
Here’s a list of management tools that you can use for almost free (some are restricted to personal/community use):

KafkaTool — GUI application for managing and using Apache Kafka clusters.
Confluent platform — full enterprise streaming platform solution.
KafDrop — tool for displaying information such as brokers, topics, partitions, and even lets you view messages. It is a lightweight application that runs on Spring Boot and requires very little configuration.
Yahoo Kafka Manager —another tool for monitoring Kafka, yet it offers much less than the rest.

Supporting Managed Kafka on the cloud
Today almost all clouds support Kafka, if it is fully managed or using integration with Confluent from the cloud store up to just purchasing Kafka machines:

Confluent Cloud- Kafka as a Service
Azure Event Hub- fully managed Kafka
Managed Kafka on HDInsight — Azure
Kafka Machine on Google cloud
Kafka on AWS using Confluent solution
... many more

6 — Kafka is no magic — There is still a possibility of data loss
Apache Kafka is probably the most popular tool for distributed asynchronous messaging. This is mainly due to his high throughput, low latency, scalability, centralised and real time abilities. Most of this is due to using data replicas which in Kafka are called partitions.
However, with misconfiguration there is a high chance of data loss when machines/processes are failing, and they will fail. Therefore, it’s important to understand how Kafka works and what the product/system requirements are.
7 — Kafka built-in failure testing framework Trogdor
To assist you in finding the right configuration, the Kafka team created Trogdor. Trogdor is a failure testing framework.
How it works

Configure Kafka the way you would in production
Create a producer that generates messages with sequence 1…X million.
Run the producer
Run the consumer
Create failure by crashing and/or hanging broker.
Test and check that every event produced was consumed.
… if that’s not the case, it is better to go back and update the configuration accordingly!

On top of that, it is important to remember that Apache Kafka …

Is not a RPC —Apache Kafka is a messaging system. For RPC, service X needs to be aware of Service Y and the call signature. For example, in Kafka, if you send a message it doesn't mean that someone will consume it, ever. In RPC, there is always a consumer since the service itself is aware of the consumer Y and creates a call to its signature/function.
It is not a Database — it’s not a good place to save messages since you can’t jump between them or create a search without an expensive full scan.

Just a word about KSQL
An interesting library brought to us by the Confluent Community is KSQL. It is build on top of Kafka stream. KSQL is a completely interactive SQL interface. You can use it without writing any code. KSQL is under the Confluent Community licensing.
TL;DR
Apache Kafka has many benefits, yet before adding it in production, one should be aware that:

It has a steep learning curve — make time to learn the bits and bits of Kafka
You must manage cluster resources — be aware of the requirements like Zookeeper
You can still lose data with Apache Kafka
Most clouds provide managed Apache Kafka
It won’t transform data
It’s not a Database
It support binary protocol over TCP protocol
At the moment, you can’t sent large messages using Kafka
You should use Trogdor for fault testing of your system

All that being said, Apache Kafka is probably the best tool for messaging and streaming tasks.
Thank you Gwen Shapira for your input and guidance along the way.
If you enjoyed this story, please click the ? button. Feel free to leave a comment below.

Follow me here, or here for more posts about Scala, Kotlin, Big data, clean code and software engineers nonsense. Cheers!
 


 These Are The Best Free Open Data Sources Anyone Can Use 
freeCodeCamp — Thu, 10 Jan 2019 17:28:42 +0000
 By Hiren Patel
What is Open Data?
In simple terms, Open Data means the kind of data which is open for anyone and everyone for access, modification, reuse, and sharing.
Open Data derives its base from various “open movements” such as open source, open hardware, open government, open science etc.
Governments, independent organizations, and agencies have come forward to open the floodgates of data to create more and more open data for free and easy access.
Why Is Open Data Important?
Open data is important because the world has grown increasingly data-driven. But if there are restrictions on the access and use of data, the idea of data-driven business and governance will not be materialized.
Therefore, open data has its own unique place. It can allow a fuller understanding of the global problems and universal issues. It can give a big boost to businesses. It can be a great impetus for machine learning. It can help fight global problems such as disease or crime or famine. Open data can empower citizens and hence can strengthen democracy. It can streamline the processes and systems that the society and governments have built. It can help transform the way we understand and engage with the world.
So here’s my list of 15 awesome Open Data sources:
1. World Bank Open Data
As a repository of the world’s most comprehensive data regarding what’s happening in different countries across the world, World Bank Open Data is a vital source of Open Data. It also provides access to other datasets as well which are mentioned in the data catalog.
World Bank Open Data is massive because it has got 3000 datasets and 14000 indicators encompassing microdata, time series statistics, and geospatial data.
Accessing and discovering the data you want is also quite easy. All you need to do is to specify the indicator names, countries or topics and it will open up the treasure-house of Open Data for you. It also allows you to download data in different formats such as CSV, Excel, and XML.
If you are a journalist or academic, you will be enthralled by the array of tools available to you. You can get access to analysis and visualization tools that can bolster your research. It can felicitate a deeper and better understanding of global problems.
You can get access to the API which can help you create the data visualizations you need, live combinations with other data sources and many more such features.
Therefore, it’s no surprise that World Bank Open Data tops any list of Open Data sources!
2. WHO (World Health Organization) — Open data repository
WHO’s Open Data repository is how WHO keeps track of health-specific statistics of its 194 Member States.
The repository keeps the data systematically organized. It can be accessed as per different needs. For instance, whether it is mortality or burden of diseases, one can access data classified under 100 or more categories such as the Millennium Development Goals (child nutrition, child health, maternal and reproductive health, immunization, HIV/AIDS, tuberculosis, malaria, neglected diseases, water and sanitation), non communicable diseases and risk factors, epidemic-prone diseases, health systems, environmental health, violence and injuries, equity etc.
For your specific needs, you can go through the datasets according to themes, category, indicator, and country.
The good thing is that it is possible to download whatever data you need in Excel Format. You can also monitor and analyze data by making use of its data portal.
The API to the World Health Organization’s data and statistics content is also available.
3. Google Public Data Explorer
Launched in 2010, Google Public Data Explorer can help you explore vast amounts of public-interest datasets. You can visualize and communicate the data for your respective uses.
It makes the data from different agencies and sources available. For instance, you can access data from World Bank, U. S. Bureau of Labor Statistics and U.S. Bureau, OECD, IMF, and others.
Different stakeholders access this data for a variety of purposes. Whether you are a student or a journalist, whether you are a policy maker or an academic, you can leverage this tool in order to create visualizations of public data.
You can deploy various ways of representing the data such as line graphs, bar graphs, maps and bubble charts with the help of Data Explorer.
The best part is that you would find these visualizations quite dynamic. It means that you will see them change over time. You can change topics, focus on different entries and modify the scale.
It is easily shareable too. As soon as you get the chart ready, you can embed it on your website or blog or simply share a link with your friends.
4. Registry of Open Data on AWS (RODA)
This is a repository containing public datasets. It is data which is available from AWS resources.
As far as RODA is concerned, you can discover and share the data which is publicly available.
In RODA, you can use keywords and tags for common types of data such as genomic, satellite imagery and transportation in order to search whatever data that you are looking for. All of this is possible on a simple web interface.
For every dataset, you will discover detail page, usage examples, license information and tutorials or applications that use this data.
By making use of a broad range of compute and data analytics products, you can analyze the open data and build whatever services you want.
While the data you access is available through AWS resources, you need to bear in mind that it is not provided by AWS. This data belongs to different agencies, government organizations, researchers, businesses and individuals.
5. European Union Open Data Portal
You can access whatever open data EU institutions, agencies and other organizations publish on a single platform namely European Union Open Data Portal.
The EU Open Data Portal is home to vital open data pertaining to EU policy domains. These policy domains include economy, employment, science, environment, and education.
Around 70 EU institutions, organizations or departments such as Eurostat, the European Environment Agency, the Joint Research Centre and other European Commission Directorates General and EU Agencies have made their datasets public and allowed access. These datasets have crossed the number of 11700 till date.
The portal enables easy access. You can easily search, explore, link, download and reuse the data through a catalog of common metadata. You can do so for your specific purposes. It could be commercial or non-commercial purposes.
You can search the metadata catalog through an interactive search engine (Data tab) and SPARQL queries (Linked data tab).
By making use of this catalog, you can gain access to the data stored on the different websites of the EU institutions, agencies and organizations.
6. FiveThirtyEight
It is a great site for data-driven journalism and story-telling.
It provides its various sources of data for a variety of sectors such as politics, sports, science, economics etc. You can download the data as well.
When you access the data, you will come across a brief explanation regarding each dataset with respect to its source. You will also get to know what it stands for and how to use it.
In order to render this data user-friendly, it provides datasets in as simple, non-proprietary formats such as CSV files as possible. Needless to say, these formats can be easily accessed and processed by humans as well as machines.
With the help of these datasets, you can create stories and visualizations as per your own requirements and preference.
7. U.S. Census Bureau
U.S. Census Bureau is the biggest statistical agency of the federal government. It stores and provides reliable facts and data regarding people, places, and economy of America.
The Census Bureau considers its noble mission to extend its services as the most reliable provider of quality data.
Whether it is a federal, state, local or tribal government, all of them make use of census data for a variety of purposes. These governments use this data to determine the location of new housing and public facilities. They also make use of it at the time of examining the demographic characteristics of communities, states, and the USA.
This data is also made use of in planning of transportation systems and roadways. When it comes to deciding quotas and creating police and fire precincts, this data comes in handy. When governments create localized areas of elections, schools, utilities etc, they make use of this data. It is a practice to compile population information once a decade and this data are quite useful in accomplishing the same.
There are various tools such as American Fact Finder, Census Data Explorer and Quick Facts which are useful in case you want to search, customize and visualize data.
For instance, Quick Facts alone contains statistics for all the states, counties, cities and even towns with a population of 5000 or more.
Likewise, American Fact Finder can help you discover popular facts such as population, income etc. It provides information that is frequently requested.
The good thing is that you can search, interact with the data, get to know about popular statistics and see the related charts through Census Data Explorer. Moreover, you can also use visual tool to customize data on an interactive maps experience.
8. Data.gov
Data.gov is the treasure-house of US government’s open data. It was only recently that the decision was made to make all government data available for free.
When it was launched, there were only 47. There are now 180,000 datasets.
Why Data.gov is a great resource is because you can find data, tools, and resources that you can deploy for a variety of purposes. You can conduct your research, develop your web and mobile applications and even design data visualizations.
All you need to do is enter keywords in the search box and browse through types, tags, formats, groups, organization types, organizations, and categories. This will facilitate easy access to data or datasets that you need.
Data.gov follows the Project Open Data Schema — a set of requisite fields (Title, Description, Tags, Last Update, Publisher, Contact Name, etc.) for every data set displayed on Data.gov.
9. DBpedia
As you know, Wikipedia is a great source of information. DBpedia aims at getting structured content from the valuable information that Wikipedia created.
With DBpedia, you can semantically search and explore relationships and properties of Wikipedia resource. This includes links to other related datasets as well.
There are around 4.58 million entities in the DBpedia dataset. 4.22 million are classified in ontology, including 1,445,000 persons, 735,000 places, 123,000 music albums, 87,000 films, 19,000 video games, 241,000 organizations, 251,000 species and 6,000 diseases.
There are labels and abstracts for these entities in around 125 languages. There are 25.2 million links to images. There are 29.8 million links to external web pages.
All you need to do in order to use DBpedia is write SPARQL queries against endpoint or by downloading their dumps.
DBpedia has benefitted several enterprises, such as Apple (via Siri), Google (via Freebase and Google Knowledge Graph), and IBM (via Watson), and particularly their respective prestigious projects associated with artificial intelligence.
10. freeCodeCamp Open Data
It is an open source community. Why it matters is because it enables you to code, build pro bono projects after nonprofits and grab a job as a developer.
In order to make this happen, the freeCodeCamp.org community makes available enormous amounts of data every month. They have turned it into open data.
You will find a variety of things in this repository. You can find datasets, analysis of the same and even demos of projects based on the freeCodeCamp data. You can also find links to external projects involving the freeCodeCamp data.
It can help you with a diversity of projects and tasks that you may have in mind. Whether it is web analytics, social media analytics, social network analysis, education analysis, data visualization, data-driven web development or bots, the data offered by this community can extremely useful and effective.
11. Yelp Open Datasets
The Yelp dataset is basically a subset of nothing but our own businesses, reviews and user data for use in personal, educational and academic pursuits.
There are 5,996,996 reviews, 188,593 businesses, 280,991 pictures and 10 metropolitan areas included in Yelp Open Datasets.
You can use them for different purposes. Since they are available as JSON files, you can use them in order to teach students about databases. You can use them to learn NLP or for sample production data while you understand how to design mobile apps.
In this dataset, you will find each file composed of a single object type, one JSON-object per-line.
12. UNICEF Dataset
Since UNICEF concerns itself with a wide variety of critical issues, it has compiled relevant data on education, child labor, child disability, child mortality, maternal mortality, water and sanitation, low birth-weight, antenatal care, pneumonia, malaria, iodine deficiency disorder, female genital mutilation/cutting, and adolescents.
UNICEF’s open datasets published on the IATI Registry: http://www.iatiregistry.org/publisher/unicef has been extracted directly from UNICEF’s operating system (VISION) and other data systems, and it reflects inputs made by individual UNICEF offices.
The good thing is that there is a regular update when it comes to these datasets. Every month, the data is updated in order to make it more comprehensive, reliable and accurate.
You can freely and easily access this data. In order to do so, you can download this data in CSV format. You can also preview sample data prior to downloading it.
While anybody can explore and visualize UNICEF’s datasets, there are three principal publishers:
UNICEF’s AID TRANSPARENCY PORTAL : You can far more easily access the datasets if you use this portal. It also includes details for each country that UNICEF works in.
Publisher d-portal : It is, at the moment, in BETA. With this, portal, you can explore IATI data.
You can search the information related to development activities, budgets etc. You can explore this information country-wise.
Publisher’s data platform : On this platform, you can easily access statistics, charts, and metrics on data accessed via the IATI Registry. If you click on the headers, you can also sort many of the tables that you see on the platform. You will also find many of the datasets in the platforms in machine-readable JSON format.
13. Kaggle
Kaggle is great because it promotes the use of different dataset publication formats. However, the better part is that it strongly recommends that the dataset publishers share their data in an accessible, non-proprietary format.
The platform supports open and accessible data formats. It is important not just for access but also for whatever you want to do with this data. Therefore, Kaggle Dataset clearly defines the file formats which are recommended while sharing data.
The unique thing about Kaggle datasets is that it is not just a data repository. Each dataset stands for a community that enables you to discuss data, find out public codes and techniques, and conceptualize your own projects in Kernels.
CSV, JSON, SQLite, Archive, Big Query etc. are files types that Kaggle supports. You can find a variety of resources in order to start working on your open data project.
The best part is that Kaggle allows you to publish and share datasets privately or publicly.
14. LODUM
It is the Open Data initiative of the University of Münster. Under this initiative, it is made possible for anyone to access any public information about the university in machine-readable formats. You can easily access and reuse it as per your needs.
Open data about scientific artifacts and encoded as linked data is made available under this project.
With the help of Linked Data, it is possible to share and use data, ontologies and various metadata standards. It is, in fact, envisaged that it will be the accepted standard for providing metadata, and the data itself on the Web.
The LODUM team has co-initiated LinkedUniversities.org and LinkedScience.org.
You can use SPARQL editor or SPARQL package of R to analyze data.
SPARQL Package enables to connect to a SPARQL endpoint over HTTP, pose a SELECT query or an update query (LOAD, INSERT, DELETE).
15. UCI Machine Learning Repository
It serves as a comprehensive repository of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
In this repository, there are, at present, 463 datasets as a service to the machine learning community.
The Center for Machine Learning and Intelligent Systems at the University of California, Irvine hosts and maintains it. David Aha had originally created it as a graduate student at UC Irvine.
Since then, students, educators, and researchers all over the world make use of it as a reliable source of machine learning datasets.
How it works is that each dataset has its distinct webpage which enlists all the known details including any relevant publications that investigate it. You can download these datasets as ASCII files, often the useful CSV format.
The details of datasets are summarized by aspects like attribute types, number of instances, number of attributes and year published that can be sorted and searched.
Open Data Portals and Search Engines:
While there are plenty of datasets published by numerous agencies every year, very few datasets become recognized and established.
The reason why very few such datasets sustain as useful resource is that it is a challenge to develop, manage and provide the data in a way that people and organizations find it useful and easy to use.
However, please find below a list of other few important open data portals and platforms that permit users to access open data quite easily, study the impact and glean valuable insights.

Google dataset search
Dataverse
Open Data Kit
Ckan
Open Data Monitor
Plenar.io
Open Data Impact Map

Conclusion
Open data is the order of the day. The world has gradually started moving towards open systems and open data is rightly in sync with that.
The business and organizations which leverage open data will gain a competitive edge and will be able to dominate the future.

big data - freeCodeCamp.org

How to Read and Write Deeply Partitioned Files Using Apache Spark

Here’s what we’ll cover:

Prerequisite

Setup

False Starts

My Solution

Conclusion

Data-Driven Reality – Exploring the Power of AI, ML, Virtual and Augmented Reality

What Exactly Is Data?

Virtual Reality and Augmented Reality

Artificial Intelligence and Machine Learning

What can AI actually do?

How can machine learning help?

The Apache Kafka Handbook – How to Get Started Using Kafka

Why Should You Learn Apache Kafka?

Table of Contents

Event Streaming and Event-Driven Architectures

Core Kafka Concepts

Event Messages in Kafka

Topics in Kafka

Partitions in Kafka

Offsets in Kafka

Brokers in Kafka

Replication in Kafka

Producers in Kafka

Consumers in Kafka

Consumer Groups in Kafka

Kafka Zookeeper

How to Install Kafka on Your Computer

Install Kafka on macOS

Install Kafka on Windows (WSL2) and Linux

How to Start Zookeeper and Kafka

How to Start Kafka on macOS

How to Start Kafka on Windows (WSL2) and Linux

The Kafka CLI

How to List Topics

How to Create a Topic

How to Describe Topics

How to Partition a Topic

How to Set a Replication Factor

How to Delete a Topic

How to Use kafka-console-producer

How to Use kafka-console-consumer

How to Use kafka-consumer-groups

How to Build a Kafka Client App with Java

Preliminaries

How to Set Up the Project

How to Install the Dependencies

How to Create a Kafka Producer

How to Send Multiple Messages and Use Callbacks

How to Create a Kafka Consumer

How to Shut Down the Consumer

Where to Take it from Here

How to Use Object Storage for Data Parallelization and Experimentation

Block Storage vs Object Storage

What is Block Storage?

What is Object Storage?

What Problems Does Object Storage Solve?

What are Data Lakes?

How Data Experimentation and Parallelization Work with Object Storage

Why is this Git-like feature important?

How to Install LakeFS

How to Create a Repository in LakeFS

How to Add Data to your LakeFS Repository

How to Install the LakeFS CLI

In Closing

A Quick Overview of the Apache Hadoop Framework

What is Apache Hadoop?

Why is Hadoop useful?

Core Hadoop

Hadoop Ecosystem

More Information:

I ranked every Intro to Data Science course on the internet, based on thousands of data points

Now onto introductions to data science.

How we picked courses to consider

How we evaluated courses

What is the data science process?

Basic coding, stats, and probability experience required

Our pick for the best intro to data science course is…

How to Use `kafka-console-producer`

How to Use `kafka-console-consumer`

How to Use `kafka-consumer-groups`