Scala - freeCodeCamp.org

How to Read and Write Deeply Partitioned Files Using Apache Spark

Arun Shanmugam Kumar — Sun, 31 Aug 2025 21:23:23 +0000

If you use Apache Spark to write your data pipeline, you might need to export or copy data from a source to destination while preserving the partition folders between the source and destination.

When I researched online on how to do this in Spark, I found very few tutorials giving an end-to-end solution that worked – especially when the partitions are deeply nested and you don't know beforehand the values these folder names will take (for example year=*/month=*/day=*/hour=*/*.csv).

In this tutorial, I have provided one such implementation using Spark.

Prerequisite

To follow along in this tutorial, you need to have basic understanding of distributed computing using frameworks like Hadoop and Spark, as well as code that’s programmed in Object Oriented languages like Scala/Java. The code is tested using the below dependencies:

Scala 2.12+
Java 17 (earlier versions might work)
Sbt

Setup

I’m assuming you have partition folders that are created at the source with the below pattern (which is a standard partition column involving date-time):

year/month/day/hour

Crucially, as I mentioned above, I’m assuming that you don’t know the full name of the folders – except that they have some constant prefix pattern in them.

False Starts

If you think of using recursiveFileLookup and pathGlobFilter option while both reading and writing, it doesn’t quite work, as the above functions are only available on read API.
If you think of parameterizing the reading and writing based on all the possible year/month/day/hour combination and skip export if the corresponding partition folder is not found, then it might work but won’t be very efficient.

My Solution

After a few trials and errors and searching in Stack Overflow and the Spark documentation, I hit upon an idea to use a combination of input_file_name(), regexp_extract(), and partitionBy() API's on the write side to achieve the end goal. You can find a Scala-based sample code below:

package main.scala.blog

/**
*  Spark stream example code to read and write from a partitioned folder
*  to a partitioned folder without explicitly known datetime.
*/

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.functions.{udf, input_file_name, col, lit, regexp_extract}

object PartitionedReaderWriter {

    def main(args: Array[String]) {
        // 1.
        val spark = SparkSession
                    .builder
                    .appName("PartitionedReaderWriterApp")
                    .getOrCreate()

        val sourceBasePath = "data/partitioned_files_source/user"
        // 2.
        val sourceDf = spark.read
                            .format("csv")
                            .schema("State STRING, Color STRING, Count INT")
                            .option("header", "true")
                            .option("pathGlobFilter", "*.csv")
                            .option("recursiveFileLookup", "true")
                            .load(sourceBasePath)

        val destinationBasePath = "data/partitioned_files_destination/user"
        // 3.
        val writeDf = sourceDf
                        .withColumn("year", regexp_extract(input_file_name(), "year=(\\d{4})", 1))
                        .withColumn("month", regexp_extract(input_file_name(), "month=(\\d{2})", 1))
                        .withColumn("day", regexp_extract(input_file_name(), "day=(\\d{2})", 1))
                        .withColumn("hour", regexp_extract(input_file_name(), "hour=(\\d{2})", 1))

        // 4.
        writeDf.write
                .format("csv")
                .option("header", "true")
                .mode("overwrite")
                .partitionBy("year", "month", "day", "hour")
                .save(destinationBasePath)

        // 5.
        spark.stop()        
    }
}

Here’s what’s going on in the above code:

Inside main method, you begin by adding Spark initialization setup code to create a Spark session.
You read the data from sourceBasePath using spark read() API with the format as csv (you can also optionally provide the schema). Options recursiveFileLookup and pathGlobFilter are needed to recursively read through nested folders and to specify any csv file, respectively.
Th next section contains the core logic where you can use input_file_name() to return the full path of the file and regexp_extract() to extract year , month, day, and hour from the corresponding subfolders in the path and store them as auxiliary columns on the dataframe.
Finally, you write the dataframe using the csv format again and crucially use partitionBy to specify the previously created auxiliary columns as partition columns. Then save the dataframe in the destinationBasePath.
After the copy is done, you stop the Spark session by calling the stop() API.

Conclusion

In this article I have shown you how to export / copy a deeply nested data files from source to destination using Apache Spark in an efficient way. I hope you find it useful!

You can read my other articles at https://www.beyonddream.me.

How to Use Thread.sleep Without Blocking on the JVM

freeCodeCamp — Mon, 21 Dec 2020 17:39:16 +0000

By Daniel Sebban

JVM Languages like Java and Scala have the ability to run concurrent code using the Thread class. Threads are notoriously complex and very error prone, so having a solid understanding of how they work is essential.

Let's start with the Javadoc for Thread.sleep:

Causes the currently executing thread to sleep (temporarily cease execution) for the specified number of milliseconds

What are the implications of **cease execution**, also known as blocking, and what does it mean? Is it bad? And if so can we achieve non-blocking sleep?

What We'll Cover in This Article

This post covers a lot of ground and hopefully you will learn a lot of cool things.

What happens at the OS level when sleeping?
The problem with sleeping
Project Loom and virtual threads
Functional programming and design
ZIO Scala library for concurrency

Yes, all of this is coming up below.

But first, let’s start with this simple Scala snippet that we will change throughout the post to achieve what we want:

println("a")
Thread.sleep(1000)
println("b")

It's quite simple: it prints “a” and then 10 seconds later in prints “b”

Let’s focus on Thread.sleep and try to understand HOW it achieves sleeping. Once we understand the how, we will be able to see the problem and define it more concretely.

How Does Sleeping Work at the OS Level?

Here is what happens when you call Thread.sleep under the hood.

It calls the thread API of the underlying OS
Because the JVM uses a one to one mapping between Java and kernel threads, it asks the OS to give up the thread’s “rights” to the CPU for the specified time
When the time has elapsed the OS scheduler will wake the thread via an interrupt (this is efficient) and assign it a CPU slice to allow it to resume running

The critical point here is that the sleeping thread is completely taken out and is not reusable while sleeping.

Limitations of Threads

Here are few important limitations that come with threads:

There is a limit to how many threads you can create. After around 30K, you will get this error:

java.lang.OutOfMemoryError : unable to create new native Thread

JVM Threads can be expensive memory-wise to create, as they come with a dedicated stack
Too many JVM threads will incur overhead because of expensive context switches and the way they share finite hardware resources

Now that we understand more about what goes on behind the scenes let’s go back to the sleeping problem.

The Problem with Sleeping

Let’s define the problem more concretely and run a snippet to show the issue we are facing. We will use this function to illustrate the point:

def task(id: Int): Runnable = () => 
{
  println(s"${Thread.currentThread().getName()} start-$id")
  Thread.sleep(10000)
  println(s"${Thread.currentThread().getName()} end-$id")
}

This simple function will

print **start** followed by the thread id
sleeps for 10 seconds
print **end** followed by the thread id

Your mission if you accept it is to run 2 tasks concurrently with 1 thread

We want to run 2 tasks concurrently, meaning the whole program should take a total of 10 seconds. But we only have 1 thread available.

Are you up for this challenge?

Let’s play a little bit with the number of tasks and threads to get a sense of what the problem is exactly.

`1 task -> 1 thread`

new Thread(task(1)).start()

12:11:08 INFO  Thread-0 start-1
12:11:18 INFO  Thread-0 end-1

Let’s fire up jvisualvm to check out what the thread is doing:

You can see that the Thread-0 is in the purple sleeping state.

Hitting the thread dump button will print this:

"Thread-0" #13 prio=5 os_prio=31 tid=0x00007f9a3e0e2000 nid=0x5b03 waiting on condition [0x0000700004ac8000]  
  java.lang.Thread.State: TIMED_WAITING (sleeping)
  at java.lang.Thread.sleep(Native Method)
  at example.Blog$.$anonfun$task$1(Blog.scala:7)
  at example.Blog$$$Lambda$2/1359484306.run(Unknown Source)
  at java.lang.Thread.run(Thread.java:748)
  Locked ownable synchronizers:        - None

Clearly, this thread is not usable anymore until it finishes to sleep.

`2 tasks -> 1 thread`

Let’s illustrate the problem by running 2 such tasks with only one thread available:

import java.util.concurrent.Executors

// an executor with only 1 thread available
val oneThreadExecutor = Executors.newFixedThreadPool(1)

// send 2 tasks to the executor
(1 to 2).foreach(id =>
   oneThreadExecutor.execute(task(id)))

We get this output:

2020.09.28 21:49:56 INFO  pool-1-thread-1 start-1
2020.09.28 21:50:07 INFO  pool-1-thread-1 end-1
2020.09.28 21:50:07 INFO  pool-1-thread-1 start-2
2020.09.28 21:50:17 INFO  pool-1-thread-1 end-2

You can see the purple color (sleeping state) for the pool-1-thread-1. The tasks have no choice but to run one after the other because the thread is taken out each time Thread.sleep is used.

`2 tasks -> 2 threads`

Let’s run the same code with 2 threads available. We get this:

// an executor with 2 threads available
val oneThreadExecutor = Executors.newFixedThreadPool(2)

// send 2 tasks to the executor
(1 to 2).foreach(id =>
   oneThreadExecutor.execute(task(id)))

2020.09.28 22:42:04 INFO  pool-1-thread-2 start-2
2020.09.28 22:42:04 INFO  pool-1-thread-1 start-1
2020.09.28 22:42:14 INFO  pool-1-thread-1 end-1
2020.09.28 22:42:14 INFO  pool-1-thread-2 end-2

Each thread can run one task at a time. We finally accomplished what we wanted, running 2 tasks concurrently, and the whole program finished in 10 seconds.

That was easy because we used 2 threads (pool-1-thread-1 and pool-1-thread-2), but we want to do the same with only 1 thread.

Let’s identify the problem and then find a solution.

The problem : `Thread.sleep is blocking`

We now understand that we cannot use Thread.sleep – it blocks the thread.

This makes it unusable until it resumes, preventing us from running 2 tasks concurrently.

Fortunately, there are solutions, which we'll discuss next.

First Solution: Upgrade your JVM with Project Loom

I mentioned before that JVM threads map one to one to OS threads. And this fatal design mistake leads us here.

Project Loom aims to correct that by adding virtual threads.

Here is our code rewritten using virtual threads from Loom:

Thread.startVirtualThread(() -> {
  System.out.println("a")  
  Thread.sleep(1000)  
  System.out.println("b")
});

The amazing thing is that the Thread.sleep will not block anymore! It's fully async. And on top of that, virtual threads are super cheap. You could create hundreds of thousands of them without overhead or limitations.

All our problems are solved now – well besides the fact that Project Loom will not be available until at least JDK 17 (as of now scheduled for September 2021).

Oh well, let’s go back and try to solve the sleeping problem with what the JVM currently gives us.

Key insight: You can express sleeping in terms of scheduling a task in the future

If you tell your boss that you are busy and you will resume your work in 10 minutes, your boss does not know that you are about to take a nap. They only see that you started your work in the morning then paused for 10 minutes then resumed.

This:

start
sleep(10)
end

is equivalent from the outside to this:

start
resumeIn(10s, end)

What we did above is to SCHEDULE the task to end in 10 seconds.

That’s it, we don’t need to sleep anymore. We just need to be able to schedule things in the future instead.

We've reduced one problem with another, one that is easier and has a simpler solution.

The scheduling problem

Luckily for us, scheduling tasks is very simple to do. We just have to switch out the executor as follows:

val oneThreadScheduleExecutor = Executors.newScheduledThreadPool(1)

We can now use the schedule function instead of execute:

oneThreadScheduleExecutor.schedule
(task(1),10, TimeUnit.SECONDS)

Well that’s not exactly what we want. We want to split the start and end printing by 10 seconds, so let’s change our task function as follows:

def nonBlockingTask(id: Int): Runnable = () => {
  println(s"${Thread.currentThread().getName()} start-$id")
  val endTask: Runnable = () => 
  {
    println(s"${Thread.currentThread().getName()} end-$id")
  }
  //instead of Thread.sleep for 10s, we schedule it in the future, no     more blocking!
  oneThreadScheduleExecutor.schedule(endTask, 10, TimeUnit.SECONDS)
  }

2020.09.28 23:35:45 INFO  pool-1-thread-1 start-1
2020.09.28 23:35:45 INFO  pool-1-thread-1 start-2
2020.09.28 23:35:56 INFO  pool-1-thread-1 end-1
2020.09.28 23:35:56 INFO  pool-1-thread-1 end-2

Yes! We did it! Only one thread and 2 concurrent tasks that “sleep” 10 seconds each.

Ok great, but you cannot really write code like this. What if you want another task in the middle as follows:

00:00:00 start
00:00:10 middle
00:00:20 end

You would need to change the implementation of the nonBlockingTask and add another call to schedule in there. And that will get pretty messy very quickly.

How to Use Functional Programming to Write a DSL with a Non-Blocking Sleep

Functional Programming in Scala is a joy, and writing a DSL (domain-specific language) using FP principles is quite easy.

Let’s start at the end. We would like our final program to look something like this:

def nonBlockingFunctionalTask(id: Int) = {
  Print(id,"start") andThen 
  Print(id,"middle").sleep(1000) andThen
  Print(id,"end").sleep(1000)
}

This mini-language will achieve exactly the same behavior as our previous solution but without exposing all the nasty internals of the scheduled executor and threads.

The model

Let’s define our data types:

object Task {
sealed trait Task { self =>
  def andThen(other: Task) = AndThen(self,other)
  def sleep(millis: Long) = Sleep(self,millis)
}

case class AndThen(t1: Task, t2: Task) extends Task
case class Print(id: Int, value: String) extends Task 
case class Sleep(t1: Task, millis: Long) extends Task

In FP the data types only hold data and no behavior. So this whole code does “nothing" – it just captures the language structure and information we want.

We need 2 functions:

sleep to make a task sleep
andThen to chain tasks

Notice that their implementation does nothing. It just wraps it in the correct class and that’s it.

Let’s use our nonBlockingFunctionalTask function:

import Task._
//create 2 tasks, this does not run them, no threads involved here
(1 to 2).toList.map(nonBlockingFunctionalTask)

It’s a description of the problem. It does nothing, it just builds a list with 2 tasks, each one describing what to do.

If we print the result in the REPL we get this:

res3: List[Task] = List(
//first task  
AndThen(AndThen(Print(1,start),Sleep(Print(1,middle),10000)),Sleep(Print(1,end),10000)), 
//second task  
AndThen(AndThen(Print(2,start),Sleep(Print(2,middle),10000)),Sleep(Print(2,end),10000))
)

Let’s write the interpreter that will turn this tree into one that's actually running the tasks.

The interpreter

In FP the function that turns a description into an executable program is called the interpreter. It takes the description of the program, the model, and interprets it into an executable form. Here it will execute and schedule the tasks directly.

We first need a Stack that will allow us to encode the dependencies between tasks. Think that start >>= middle >>= end will each be pushed to the stack and then popped in order of execution. This will be evident in the implementation.

And now the interpreter (don’t worry if you don't understand this code, it’s a bit complicated, there is a simpler solution coming up):

def interpret(task: Task, executor: ScheduledExecutorService): Unit = {
  def loop(current: Task, stack: Stack[Task]): Unit =
  current match {
    case AndThen(t1, t2) =>
      loop(t1,stack.push(t2))
    case Print(id, value) =>  
      stack.pop match {
        case Some((t2, b)) => 
          executor.execute(() => {
          println(s"${Thread.currentThread().getName()} $value-$id")
          })   
        loop(t2,b)
        case None => 
          executor.execute(() => {
          println(s"${Thread.currentThread().getName()} $value-$id")
          })
    case Sleep(t1,millis) => 
      val r: Runnable = () =>{loop(t1,stack)}
      executor.schedule(r, millis, TimeUnit.MILLISECONDS)
}
loop(task,Nil)
}

And the output is what we want:

2020.09.29 00:06:39 INFO  pool-1-thread-1 start-1
2020.09.29 00:06:39 INFO  pool-1-thread-1 start-2
2020.09.29 00:06:50 INFO  pool-1-thread-1 middle-1
2020.09.29 00:06:50 INFO  pool-1-thread-1 middle-2
2020.09.29 00:07:00 INFO  pool-1-thread-1 end-1
2020.09.29 00:07:00 INFO  pool-1-thread-1 end-2

One thread running 2 concurrent sleeping tasks. That’s a lot of code and a lot of work. As usual, you should always ask yourself whether there a library that already solves this problem. Turns out there is: ZIO.

Non-Blocking Sleep in ZIO

**[ZIO](https://zio.dev/)** is a functional library for asynchronous and concurrent programming. It works in a similar fashion to our little DSL, because it gives you a few types you can mix and match to describe your program and nothing more.

And then it gives us an interpreter that lets you run a ZIO program.

As I said this interpreter pattern is pervasive in the world of FP. Once you get it, a new world opens up to you.

`ZIO.sleep` – a better version of `Thread.sleep`

ZIO gives us the ZIO.sleep function, a non-blocking version of Thread.sleep. Here is our function written using ZIO:

import zio._
import zio.console._
import zio.duration._
object ZIOApp extends zio.App {
def zioTask(id: Int) = 
  for {
  _ <- putStrLn(s"${Thread.currentThread().getName()} start-$id")
  _ <- ZIO.sleep(10.seconds)
  _ <- putStrLn(s"${Thread.currentThread().getName()} end-$id")
} yield ()

It’s strikingly similar to the first snippet:

def task(id: Int): Runnable = () => 
{
  println(s"${Thread.currentThread().getName()} start-$id")
  Thread.sleep(10000)
  println(s"${Thread.currentThread().getName()} end-$id")
}

The clear difference is the for syntax that allows us to chain statements with the ZIO type. It's very similar to the andThen function from our previous mini-language.

As before with our mini-language, this program is just a description. It’s pure data, and it does nothing. To do something we need the interpreter.

The ZIO interpreter

To interpret a ZIO program, you just have to extend the ZIO.App interface and put it in the run method and ZIO will take care of running it, like this:

object ZIOApp extends zio.App 
{ 
override def run(args: List[String]) = {
  ZIO
  //start 2 ZIO tasks in parallel
  .foreachPar((1 to 2))(zioTasks)
  //complete program when done
  .as(ExitCode.success) 
}

And we get this output – the tasks complete correctly in 10 seconds:

2020.09.29 00:45:12 INFO  zio-default-async-3-1594199808 start-2
2020.09.29 00:45:12 INFO  zio-default-async-2-1594199808 start-1
2020.09.29 00:45:33 INFO  zio-default-async-7-1594199808 end-1
2020.09.29 00:45:33 INFO  zio-default-async-8-1594199808 end-2

Takeaways

Each JVM Thread maps to an OS thread, in a one to one fashion. And this is the root of a lot of problems.
Thread.sleep is bad! It blocks the current thread and renders it unusable for further work.
Project Loom (that will be available in JDK 17) will solve a lot of issues. Here is a cool talk about it.
You can use ScheduledExecutorService to achieve non-blocking sleep.
You can use Functional Programming to model a language where doing sleep is non-blocking.
The ZIO library provides a non-blocking sleep out of the box.

How to implement Change Data Capture using Kafka Streams

freeCodeCamp — Mon, 20 Jan 2020 21:10:30 +0000

By Luca Florio

Change Data Capture (CDC) involves observing the changes happening in a database and making them available in a form that can be exploited by other systems.

One of the most interesting use-cases is to make them available as a stream of events. This means you can, for example, catch the events and update a search index as the data are written to the database.

Interesting right? Let's see how to implement a CDC system that can observe the changes made to a NoSQL database (MongoDB), stream them through a message broker (Kafka), process the messages of the stream (Kafka Streams), and update a search index (Elasticsearch)!?

TL;DR

The full code of the project is available on GitHub in this repository. If you want to skip all my jibber jabber and just run the example, go straight to the How to run the project section near the end of the article!?

Use case & infrastructure

We run a web application that stores photos uploaded by users. People can share their shots, let others download them, create albums, and so on. Users can also provide a description of their photos, as well as Exif metadata and other useful information.

We want to store such information and use it to improve our search engine. We will focus on this part of our system that is depicted in the following diagram.

Photo information storage architecture

The information is provided in JSON format. Since I like to post my shots on Unsplash, and the website provides free access to its API, I used their model for the photo JSON document.

Once the JSON is sent through a POST request to our server, we store the document inside a MongoDB database. We will also store it in Elasticsearch for indexing and quick search.

However, we love long exposure shots, and we would like to store in a separate index a subset of information regarding this kind of photo. It can be the exposure time, as well as the location (latitude and longitude) where the photo has been taken. In this way, we can create a map of locations where photographers usually take long exposure photos.

Here comes the interesting part: instead of explicitly calling Elasticsearch in our code once the photo info is stored in MongoDB, we can implement a CDC exploiting Kafka and Kafka Streams.

We listen to modifications to MongoDB oplog using the interface provided by MongoDB itself. When the photo is stored we send it to a photo Kafka topic. Using Kafka Connect, an Elasticsearch sink is configured to save everything sent to that topic to a specific index. In this way, we can index all photos stored in MongoDB automatically.

We need to take care of the long exposure photos too. It requires some processing of the information to extract what we need. For this reason, we use Kafka Streams to create a processing topology to:

Read from the photo topic
Extract Exif and location information
Filter long exposure photos (exposure time > 1 sec.)
Write to a long-exposure topic.

Then another Elasticsearch sink will read data from the long-exposure topic and write it to a specific index in Elasticsearch.

It is quite simple, but it's enough to have fun with CDC and Kafka Streams! ?

Server implementation

Let's have a look at what we need to implement: our server exposing the REST APIs!

Models and DAO

First things first, we need a model of our data and a Data Access Object (DAO) to talk to our MongoDB database.

As I said, the model for the photo JSON information is the one used by Unsplash. Check out the free API documentation for an example of the JSON we will use.

I created the mapping for the serializaion/deserialization of the photo JSON using spray-json. I'll skip the details about this, if you are curious just look at the repo!

Let's focus on the model for the long exposusure photo.

case class LongExposurePhoto(id: String, exposureTime: Float, createdAt: Date, location: Location)

object LongExposurePhotoJsonProtocol extends DefaultJsonProtocol {
  implicit val longExposurePhotoFormat:RootJsonFormat[LongExposurePhoto] = jsonFormat(LongExposurePhoto, "id", "exposure_time", "created_at", "location")
}

This is quite simple: we keep from the photo JSON the information about the id, the exposure time (exposureTime), when the photo has been created (createdAt), and the location where it has been taken. The location comprehends the city, the country, and the position composed of latitude and longitude.

The DAO consists of just the PhotoDao.scala class.

class PhotoDao(database: MongoDatabase, photoCollection: String) {

  val collection: MongoCollection[Document] = database.getCollection(photoCollection)

  def createPhoto(photo: Photo): Future[String] = {
    val doc = Document(photo.toJson.toString())
    doc.put("_id", photo.id)
    collection.insertOne(doc).toFuture()
      .map(_ => photo.id)
  }
}

Since I want to keep this example minimal and focused on the CDC implementation, the DAO has just one method to create a new photo document in MongoDB.

It is straightforward: create a document from the photo JSON, and insert it in mongo using id as the one of the photo itself. Then, we can return the id of the photo just inserted in a Future (the MongoDB API is async).

Kafka Producer

Once the photo is stored inside MongoDB, we have to send it to the photo Kafka topic. This means we need a producer to write the message in its topic. The PhotoProducer.scala class looks like this.

case class PhotoProducer(props: Properties, topic: String) {

  createKafkaTopic(props, topic)
  val photoProducer = new KafkaProducer[String, String](props)

  def sendPhoto(photo: Photo): Future[RecordMetadata] = {
    val record = new ProducerRecord[String, String](topic, photo.id, photo.toJson.compactPrint)
    photoProducer.send(record)
  }

  def closePhotoProducer(): Unit = photoProducer.close()
}

I would say that this is pretty self-explanatory. The most interesting part is probably the createKafkaTopic method that is implemented in the utils package.

def createKafkaTopic(props: Properties, topic: String): Unit = {
    val adminClient = AdminClient.create(props)
    val photoTopic = new NewTopic(topic, 1, 1)
    adminClient.createTopics(List(photoTopic).asJava)
  }

This method creates the topic in Kafka setting 1 as a partition and replication factor (it is enough for this example). It is not required, but creating the topic in advance lets Kafka balance partitions, select leaders, and so on. This will be useful to get our stream topology ready to process as we start our server.

Event Listener

We have the DAO that writes in MongoDB and the producer that sends the message in Kafka. We need to glue them together in some way so that when the document is stored in MongoDB the message is sent to the photo topic. This is the purpose of the PhotoListener.scala class.

case class PhotoListener(collection: MongoCollection[Document], producer: PhotoProducer) {

  val cursor: ChangeStreamObservable[Document] = collection.watch()

  cursor.subscribe(new Observer[ChangeStreamDocument[Document]] {
    override def onNext(result: ChangeStreamDocument[Document]): Unit = {
      result.getOperationType match {
        case OperationType.INSERT => {
          val photo = result.getFullDocument.toJson().parseJson.convertTo[Photo]
          producer.sendPhoto(photo).get()
          println(s"Sent photo with Id ${photo.id}")
        }
        case _ => println(s"Operation ${result.getOperationType} not supported")
      }
    }
    override def onError(e: Throwable): Unit = println(s"onError: $e")
    override def onComplete(): Unit = println("onComplete")})
}

We exploit the Chage Streams interface provided by the MongoDB scala library.

Here is how it works: we watch() the collection where photos are stored. When there is a new event (onNext) we run our logic.

For this example we are interested only in the creation of new documents, so we explicitly check that the operation is of type OperationType.INSERT. If the operation is the one we are interested in, we get the document and convert it to a Photo object to be sent by our producer.

That's it! With few lines of code we connected the creation of documents in MongoDB to a stream of events in Kafka.?

As a side note, be aware that to use the Change Streams interface we have to setup a MongoDB replica set. This means we need to run 3 instances of MongoDB and configure them to act as a replica set using the following command in mongo client:

rs.initiate({_id : "r0", members: [{ _id : 0, host : "mongo1:27017", priority : 1 },{ _id : 1, host :"mongo2:27017", priority : 0 },{ _id : 2, host : "mongo3:27017", priority : 0, arbiterOnly: true }]})

Here our instances are the containers we will run in the docker-compose file, that is mongo1, mongo2, and mongo3.

Processing Topology

Time to build our processing topology! It will be in charge of the creation of the long-exposure index in Elasticsearch. The topology is described by the following diagram:

Processing topology

and it is implemented in the LongExposureTopology.scala object class. Let's analyse every step of our processing topology.

val stringSerde = new StringSerde

val streamsBuilder = new StreamsBuilder()

val photoSource: KStream[String, String] = streamsBuilder.stream(sourceTopic, Consumed.`with`(stringSerde, stringSerde))

The first step is to read from a source topic. We start a stream from the sourceTopic (that is photo topic) using the StreamsBuilder() object. The stringSerde object is used to serialise and deserialise the content of the topic as a String.

Please notice that at each step of the processing we create a new stream of data with a KStream object. When creating the stream, we specify the key and the value produced by the stream. In our topology the key will always be a String. In this step the value produced is still a String.

val covertToPhotoObject: KStream[String, Photo] =
      photoSource.mapValues((_, jsonString) => {
        val photo = jsonString.parseJson.convertTo[Photo]
        println(s"Processing photo ${photo.id}")
        photo
      })

The next step is to convert the value extracted from the photo topic into a proper Photo object.

So we start from the photoSource stream and work on the values using the mapValues function. We simply parse the value as a JSON and create the Photo object that will be sent in the convertToPhotoObject stream.

val filterWithLocation: KStream[String, Photo] = covertToPhotoObject.filter((_, photo) => photo.location.exists(_.position.isDefined))

There is no guarantee that the photo we are processing will have the info about the location, but we want it in our long exposure object. This step of the topology filters out from the covertToPhotoObject stream the photos that have no info about the location, and creates the filterWithLocation stream.

val filterWithExposureTime: KStream[String, Photo] = filterWithLocation.filter((_, photo) => photo.exif.exists(_.exposureTime.isDefined))

Another important fact for our processing is the exposure time of the photo. For this reason, we filter out from the filterWithLocation stream the photos without exposure time info, creating the filterWithExposureTime.

val dataExtractor: KStream[String, LongExposurePhoto] =
      filterWithExposureTime.mapValues((_, photo) => LongExposurePhoto(photo.id, parseExposureTime(photo.exif.get.exposureTime.get), photo.createdAt, photo.location.get))

We now have all we need to create a LongExposurePhoto object! That is the result of the dataExtractor: it takes the Photo coming from the filterWithExposureTime stream and produces a new stream containing LongExposurePhoto.

val longExposureFilter: KStream[String, String] =
      dataExtractor.filter((_, item) => item.exposureTime > 1.0).mapValues((_, longExposurePhoto) => {
        val jsonString = longExposurePhoto.toJson.compactPrint
        println(s"completed processing: $jsonString")
        jsonString
      })

We are almost there. We now have to keep the photos with a long exposure time (that we decided is more then 1 sec.). So we create a new longExposureFilter stream without the photos that are not long exposure.

This time we also serialise the LongExposurePhotos into the corresponding JSON string, which will be written to Elasticsearch in the next step.

longExposureFilter.to(sinkTopic, Produced.`with`(stringSerde, stringSerde))

streamsBuilder.build()

This is the last step of our topology. We write to our sinkTopic (that is long-exposure topic) using the string serialiser/deserialiser what is inside the longExposureFilter stream. The last command simply builds the topology we just created.

Now that we have our topology, we can use it in our server. The PhotoStreamProcessor.scala class is what manages the processing.

class PhotoStreamProcessor(kafkaProps: Properties, streamProps: Properties, sourceTopic: String, sinkTopic: String) {

  createKafkaTopic(kafkaProps, sinkTopic)
  val topology: Topology = LongExposureTopology.build(sourceTopic, sinkTopic)
  val streams: KafkaStreams = new KafkaStreams(topology, streamProps)

  sys.ShutdownHookThread {
    streams.close(java.time.Duration.ofSeconds(10))
  }

  def start(): Unit = new Thread {
    override def run(): Unit = {
      streams.cleanUp()
      streams.start()
      println("Started long exposure processor")
    }
  }.start()
}

First we create the sinkTopic, using the same utility method we saw before. Then we build the stream topology and initialize a KafkaStreams object with that topology.

To start the stream processing, we need to create a dedicated Thread that will run the streaming while the server is alive. According to the official documentation, it is always a good idea to cleanUp() the stream before starting it.

Our PhotoStreamProcessor is ready to go!?

REST API

The server exposes REST APIs to send it the photo information to store. We make use of Akka HTTP for the API implementation.

trait AppRoutes extends SprayJsonSupport {

  implicit def system: ActorSystem
  implicit def photoDao: PhotoDao
  implicit lazy val timeout = Timeout(5.seconds)

  lazy val healthRoute: Route = pathPrefix("health") {
    concat(
      pathEnd {
        concat(
          get {
            complete(StatusCodes.OK)
          }
        )
      }
    )
  }

  lazy val crudRoute: Route = pathPrefix("photo") {
    concat(
      pathEnd {
        concat(
          post {
            entity(as[Photo]) { photo =>
              val photoCreated: Future[String] =
                photoDao.createPhoto(photo)
              onSuccess(photoCreated) { id =>
              complete((StatusCodes.Created, id))
              }
            }
          }
        )
      }
    )
  }

}

To keep the example minimal, we have only two routes:

GET /health - to check if the server is up & running
POST /photo - to send to the system the JSON of the photo information we want to store. This endpoint uses the DAO to store the document in MongoDB and returns a 201 with the id of the stored photo if the operation succeeded.

This is by no means a complete set of APIs, but it is enough to run our example.?

Server main class

OK, we implemented all the components of our server, so it's time to wrap everything up. This is our Server.scala object class.

implicit val system: ActorSystem = ActorSystem("kafka-stream-playground")
implicit val materializer: ActorMaterializer = ActorMaterializer()

First a couple of Akka utility values. Since we use Akka HTTP to run our server and REST API, these implicit values are required.

val config: Config = ConfigFactory.load()
val address = config.getString("http.ip")
val port = config.getInt("http.port")

val mongoUri = config.getString("mongo.uri")
val mongoDb = config.getString("mongo.db")
val mongoUser = config.getString("mongo.user")
val mongoPwd = config.getString("mongo.pwd")
val photoCollection = config.getString("mongo.photo_collection")

val kafkaHosts = config.getString("kafka.hosts").split(',').toList
val photoTopic = config.getString("kafka.photo_topic")
val longExposureTopic = config.getString("kafka.long_exposure_topic")

Then we read all the configuration properties. We will come back to the configuration file in a moment.

val kafkaProps = new Properties()
kafkaProps.put("bootstrap.servers", kafkaHosts.mkString(","))
kafkaProps.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")

val streamProps = new Properties()
streamProps.put(StreamsConfig.APPLICATION_ID_CONFIG, "long-exp-proc-app")
streamProps.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaHosts.mkString(","))

val photoProducer = PhotoProducer(kafkaProps, photoTopic)
val photoStreamProcessor = new PhotoStreamProcessor(kafkaProps, streamProps, photoTopic, "long-exposure")
photoStreamProcessor.start()

We have to configure both our Kafka producer and the stream processor. We also start the stream processor, so the server will be ready to process the documents sent to it.

val client = MongoClient(s"mongodb://$mongoUri/$mongoUser")
val db = client.getDatabase(mongoDb)
val photoDao: PhotoDao = new PhotoDao(db, photoCollection)
val photoListener = PhotoListener(photoDao.collection, photoProducer)

Also MongoDB needs to be configured. We setup the connection and initialize the DAO as well as the listener.

lazy val routes: Route = healthRoute ~ crudRoute

Http().bindAndHandle(routes, address, port)
Await.result(system.whenTerminated, Duration.Inf)

Everything has been initialized. We create the REST routes for the communication to the server, bind them to the handlers, and finally start the server!?

Server configuration

This is the configuration file used to setup the server:

http {
  ip = "127.0.0.1"
  ip = ${?SERVER_IP}

  port = 8000
  port = ${?SERVER_PORT}
}
mongo {
  uri = "127.0.0.1:27017"
  uri = ${?MONGO_URI}
  db = "kafka-stream-playground"
  user = "admin"
  pwd = "admin"
  photo_collection = "photo"
}
kafka {
  hosts = "127.0.0.1:9092"
  hosts = ${?KAFKA_HOSTS}
  photo_topic = "photo"
  long_exposure_topic = "long-exposure"
}

I think that this one does not require much explanation, right??

Connectors configuration

The server we implemented writes in two Kafka topics: photo and long-exposure. But how are messages written in Elasticsearch as documents? Using Kafka Connect!

We can setup two connectors, one per topic, and tell the connectors to write every message going through that topic in Elasticsearch.

First we need Kafka Connect. We can use the container provided by Confluence in the docker-compose file:

connect:
    image: confluentinc/cp-kafka-connect
    ports:
      - 8083:8083
    networks:
      - kakfa_stream_playground
    depends_on:
      - zookeeper
      - kafka
    volumes:
      - $PWD/connect-plugins:/connect-plugins
    environment:
      CONNECT_BOOTSTRAP_SERVERS: kafka:9092
      CONNECT_REST_ADVERTISED_HOST_NAME: connect
      CONNECT_REST_PORT: 8083
      CONNECT_GROUP_ID: compose-connect-group
      CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
      CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
      CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
      CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
      CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
      CONNECT_KEY_CONVERTER: "org.apache.kafka.connect.storage.StringConverter"
      CONNECT_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: "false"
      CONNECT_INTERNAL_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      CONNECT_INTERNAL_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
      CONNECT_ZOOKEEPER_CONNECT: zookeeper:2181
      CONNECT_PLUGIN_PATH: /connect-plugins
      CONNECT_LOG4J_ROOT_LOGLEVEL: INFO

I want to focus on some of the configuration values.

First of all, we need to expose the port 8083 - that will be our endpoint to configure the connectors (CONNECT_REST_PORT).

We also need to map a volume to the /connect-plugins path, where we will place the Elasticsearch Sink Connector to write to Elasticsearch. This is reflected also in the CONNECT_PLUGIN_PATH.

The connect container should know how to find the Kafka servers, so we set CONNECT_BOOTSTRAP_SERVERS as kafka:9092.

Once Kafka Connect is ready, we can send the configurations of our connectors to the http://localhost:8083/connectors endpoint. We need 2 connectors, one for the photo topic and one for the long-exposure topic. We can send the configuration as a JSON with a POST request.

{
  "name": "photo-connector",
  "config": {
    "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "tasks.max": "1",
    "topics": "photo",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
    "schema.ignore": "true",
    "connection.url": "http://elastic:9200",
    "type.name": "kafka-connect",
    "behavior.on.malformed.documents": "warn",
    "name": "photo-connector"
  }
}

We explicitly say we are gonna use the ElasticsearchSinkConnector as the connector.class , as well as the topics that we want to sink - in this case photo.

We don't want to use a schema for the value.converter, so we can disable it (value.converter.schemas.enable) and tell the connector to ignore the schema (schema.ignore).

The connector for the long-exposure topic is exactly like this one. The only difference is the name and of course the topics.

How to run the project

We have all we need to test the CDC! How can we do it? It's quite easy: simply run the setup.sh script in the root folder of the repo!

What will the script do?

Run the docker-compose file with all the services.
Configure MongoDB replica set. This is required to enable the Change Stream interface to capture data changes. More info about this here.
Configure the Kafka connectors.
Connect to the logs of the server.

The docker-compose will run the following services:

Our Server
3 instances of MongoDB (required for the replica set)
Mongoku, a MongoDB client
Kafka (single node)
Kafka connect
Zookeeper (required by Kafka)
Elasticsearch
Kibana

There are a lot of containers to run, so make sure you have enough resources to run everything properly. If you want, remove Mongoku and Kibana from the compose-file, since they are used just for a quick look inside the DBs.

Once everything is up and running, you just have to send data to the server.

I collected some JSON documents of photos from Unplash that you can use to test the system in the photos.txt file.

There are a total of 10 documents, with 5 of them containing info about long exposure photos. Send them to the server running the send-photos.sh script in the root of the repo. Check that everything is stored in MongoDB connecting to Mongoku at http://localhost:3100. Then connect to Kibana at http://localhost:5601 and you will find two indexes in Elasticsearch: photo, containing the JSON of all the photos stored in MongoDB, and long-exposure, containing just the info of the long exposure photos.

Amazing, right? ?

Conclusion

We made it guys!?

Starting from the design of the use-case, we built our system that connected a MongoDB database to Elasticsearch using CDC.

Kafka Streams is the enabler, allowing us to convert database events to a stream that we can process.

Do you need to see the whole project? Just checkout the repository on GitHub!?

That's it, enjoy! ?

How to understand Scala variances by building restaurants

freeCodeCamp — Wed, 24 Jul 2019 16:34:04 +0000

By Luca Florio

I understand that type variance is not fundamental to writing Scala code. It's been more or less a year since I've been using Scala for my day-to-day job, and honestly, I've never had to worry much about it.

However, I think it is an interesting "advanced" topic, so I started to study it. It is not easy to grasp it immediately, but with the right example, it might be a little bit easier to understand. Let me try using a food-based analogy...

What is type variance?

First of all, we have to define what type variance is. When you develop in an Object-Oriented language, you can define complex types. That means that a type may be parametrized using another type (component type).

Think of List for example. You cannot define a List without specifying which types will be inside the list. You do it by putting the type contained in the list inside square brackets: List[String]. When you define a complex type, you can specify how it will vary its subtype relationship according to the relation between the component type and its subtypes.

Ok, sounds like a mess... Let's get a little practical.

Building a restaurant empire

Our goal is to build an empire of restaurants. We want generic and specialised restaurants. Every restaurant we will open needs a menu composed of different recipes, and a (possibly) starred chef.

The recipes can be composed of different kinds of food (fish, meat, white meat, vegetables, etc.), while the chef we hire has to be able to cook that kind of food. This is our model. Now it's coding time!

Different types of food

For our food-based example, we start by defining the Trait Food, providing just the name of the food.

trait Food {

  def name: String

}

Then we can create Meat and Vegetable, that are subclasses of Food.

class Meat(val name: String) extends Food

class Vegetable(val name: String) extends Food

In the end, we define a WhiteMeat class that is a subclass of Meat.

class WhiteMeat(override val name: String) extends Meat(name)

Sounds reasonable right? So we have this hierarchy of types.

We can create some food instances of various type. They will be the ingredients of the recipes we are going to serve in our restaurants.

// Food <- Meat
val beef = new Meat("beef")

// Food <- Meat <- WhiteMeat
val chicken = new WhiteMeat("chicken")
val turkey = new WhiteMeat("turkey")

// Food <- Vegetable
val carrot = new Vegetable("carrot")
val pumpkin = new Vegetable("pumpkin")

Recipe, a covariant type

Let's define the covariant type Recipe. It takes a component type that expresses the base food for the recipe - that is, a recipe based on meat, vegetable, etc.

trait Recipe[+A] {

  def name: String

  def ingredients: List[A]

}

The Recipe has a name and a list of ingredients. The list of ingredients has the same type of Recipe. To express that the Recipe is covariant in its type A, we write it as Recipe[+A]. The generic recipe is based on every kind of food, the meat recipe is based on meat, and a white meat recipe has just white meat in its list of ingredients.

case class GenericRecipe(ingredients: List[Food]) extends Recipe[Food] {

  def name: String = s"Generic recipe based on ${ingredients.map(_.name)}"

}

case class MeatRecipe(ingredients: List[Meat]) extends Recipe[Meat] {

  def name: String = s"Meat recipe based on ${ingredients.map(_.name)}"

}

case class WhiteMeatRecipe(ingredients: List[WhiteMeat]) extends Recipe[WhiteMeat] {

  def name: String = s"Meat recipe based on ${ingredients.map(_.name)}"

}

A type is covariant if it follows the same relationship of subtypes of its component type. This means that Recipe follows the same subtype relationship of its component Food.

Let's define some recipes that will be part of different menus.

// Recipe[Food]: Based on Meat or Vegetable
val mixRecipe = new GenericRecipe(List(chicken, carrot, beef, pumpkin))
// Recipe[Food] <- Recipe[Meat]: Based on any kind of Meat
val meatRecipe = new MeatRecipe(List(beef, turkey))
// Recipe[Food] <- Recipe[Meat] <- Recipe[WhiteMeat]: Based only on WhiteMeat
val whiteMeatRecipe = new WhiteMeatRecipe(List(chicken, turkey))

Chef, a contravariant type

We defined some recipes, but we need a chef to cook them. This gives us the chance to talk about contravariance. A type is contravariant if it follows an inverse relationship of subtypes of its component type. Let's define our complex type Chef, that is contravariant in the component type. The component type will be the food that the chef can cook.

trait Chef[-A] {

  def specialization: String

  def cook(recipe: Recipe[A]): String
}

A Chef has a specialisation and a method to cook a recipe based on a specific food. We express that it is contravariant writing it as Chef[-A]. Now we can create a chef able to cook generic food, a chef able to cook meat and a chef specialised on white meat.

class GenericChef extends Chef[Food] {

  val specialization = "All food"

  override def cook(recipe: Recipe[Food]): String = s"I made a ${recipe.name}"
}

class MeatChef extends Chef[Meat] {

  val specialization = "Meat"

  override def cook(recipe: Recipe[Meat]): String = s"I made a ${recipe.name}"
}

class WhiteMeatChef extends Chef[WhiteMeat] {

  override val specialization = "White meat"

  def cook(recipe: Recipe[WhiteMeat]): String = s"I made a ${recipe.name}"
}

Since Chef is contravariant, Chef[Food] is a subclass of Chef[Meat] that is a subclass of Chef[WhiteMeat]. This means that the relationship between subtypes is the inverse of its component type Food.

Ok, we can now define different chef with various specialization to hire in our restaurants.

// Chef[WhiteMeat]: Can cook only WhiteMeat
val giuseppe = new WhiteMeatChef
giuseppe.cook(whiteMeatRecipe)

// Chef[WhiteMeat] <- Chef[Meat]: Can cook only Meat
val alfredo = new MeatChef
alfredo.cook(meatRecipe)
alfredo.cook(whiteMeatRecipe)

// Chef[WhiteMeat]<- Chef[Meat] <- Chef[Food]: Can cook any Food
val mario = new GenericChef
mario.cook(mixRecipe)
mario.cook(meatRecipe)
mario.cook(whiteMeatRecipe)

Restaurant, where things come together

We have recipes, we have chefs, now we need a restaurant where the chef can cook a menu of recipes.

trait Restaurant[A] {

  def menu: List[Recipe[A]]
  def chef: Chef[A]

  def cookMenu: List[String] = menu.map(chef.cook)
}

We are not interested in the subtype relationship between restaurants, so we can define it as invariant. An invariant type does not follow the relationship between the subtypes of the component type. In other words, Restaurant[Food] is not a subclass or superclass of Restaurant[Meat]. They are simply unrelated. We will have a GenericRestaurant, where you can eat different type of food. The MeatRestaurant is specialised in meat-based dished and the WhiteMeatRestaurant is specialised only in dishes based on white meat. Every restaurant to be instantiated needs a menu, that is a list of recipes, and a chef able to cook the recipes in the menu. Here is where the subtype relationship of Recipe and Chef comes into play.

case class GenericRestaurant(menu: List[Recipe[Food]], chef: Chef[Food]) extends Restaurant[Food]

case class MeatRestaurant(menu: List[Recipe[Meat]], chef: Chef[Meat]) extends Restaurant[Meat]

case class WhiteMeatRestaurant(menu: List[Recipe[WhiteMeat]], chef: Chef[WhiteMeat]) extends Restaurant[WhiteMeat]

Let's start defining some generic restaurants. In a generic restaurant, the menu is composed of recipes of various type of food. Since Recipe is covariant, a GenericRecipe is a superclass of MeatRecipe and WhiteMeatRecipe, so I can pass them to my GenericRestaurant instance. The thing is different for the chef. If the Restaurant requires a chef that can cook generic food, I cannot put in it a chef able to cook only a specific one. The class Chef is covariant, so GenericChef is a subclass of MeatChef that is a subclass of WhiteMeatChef. This implies that I cannot pass to my instance anything different from GenericChef.

val allFood = new GenericRestaurant(List(mixRecipe), mario)
val foodParadise = new GenericRestaurant(List(meatRecipe), mario)
val superFood = new GenericRestaurant(List(whiteMeatRecipe), mario)

The same goes for MeatRestaurant and WhiteMeatRestaurant. I can pass to the instance only a menu composed of more specific recipes then the required one, but chefs that can cook food more generic than the required one.

val meat4All = new MeatRestaurant(List(meatRecipe), alfredo)
val meetMyMeat = new MeatRestaurant(List(whiteMeatRecipe), mario)

val notOnlyChicken = new WhiteMeatRestaurant(List(whiteMeatRecipe), giuseppe)
val whiteIsGood = new WhiteMeatRestaurant(List(whiteMeatRecipe), alfredo)
val wingsLovers = new WhiteMeatRestaurant(List(whiteMeatRecipe), mario)

That's it, our empire of restaurants is ready to make tons of money!

Conclusion

Ok guys, in this story I did my best to explain type variances in Scala. It is an advanced topic, but it is worth to know just out of curiosity. I hope that the restaurant example can be of help to make it more understandable. If something is not clear, or if I wrote something wrong (I'm still learning!) don't hesitate to leave a comment!

See you! ?

Futures Made Easy with Scala

freeCodeCamp — Fri, 26 Apr 2019 22:06:18 +0000

By Martin Budi

Future is an abstraction to represent the completion of an asynchronous operation. Today it is commonly used in popular languages from Java to Dart. However, as modern applications are becoming more complex, composing them is also becoming more difficult. Scala utilizes a functional approach that makes it easy to visualize and construct Future composition.

This article aims to explain the basics in a pragmatic way. No jargon, no foreign terminology. You don’t even have to be a Scala programmer (yet). All you need to have is some understanding of a couple of higher-order functions: map and foreach. So let’s get started.

In Scala, a future can be created as simple as this:

Future {"Hi"}

Now let’s run it and make a “Hi World”.

Future {"Hi"} .foreach (z => println(z + " World"))

That’s all there is. We just ran a future using foreach, manipulated the result a bit, and printed it to the console.

But how is it possible? So we normally associate foreach and map with collections: we unwrap the content and tinker with it. If you look at it, it’s conceptually similar to a future in the way we want to unwrap the output from Future{}and manipulate it. To have this happen the future needs to be completed first, hence “running” it. This is the reasoning behind the functional composition of Scala Future.

In realistic applications, we want to coordinate not just one but several futures at once. A particular challenge is how to arrange them to run sequentially or simultaneously.

Sequential run

When several futures start one after another like a relay race we call it sequential run. A typical solution would simply be placing a task in the previous task’s callback, a technique known as chaining. The concept is correct but it doesn’t look pretty.

In Scala, we can use for-comprehension to help us abstract it. To see how it looks, let’s just go straight to an example.

import scala.concurrent.ExecutionContext.Implicits.global

object Main extends App {

  def job(n: Int) = Future {
    Thread.sleep(1000)
    println(n) // for demo only as this is side-effecting 
    n + 1
  }

  val f = for {
    f1 <- job(1)
    f2 <- job(f1)
    f3 <- job(f2)
    f4 <- job(f3)
    f5 <- job(f4)
  } yield List(f1, f2, f3, f4, f5)
  f.map(z => println(s"Done. ${z.size} jobs run"))
  Thread.sleep(6000) // needed to prevent main thread from quitting 
                     // too early 
}

The first thing to do is importing ExecutionContext whose role is to manage thread pool. Without it, our future will not run.

Next, we define our “big job” which simply waits for a second and returns its input incremented by one.

Then we have our for-comprehension block. In this structure, each line inside assigns a job’s result to a value with &lt;- which will then be available for any subsequent futures. We have arranged our jobs so that except for the first one, each one takes in the output of the previous job.

Also, note that the result of a for-comprehension is also a future with output determined by yield. After the execution, the result will be available inside map. For our purpose, we simply put all the jobs’ outputs in a list and take its size.

Let’s run it.

Sequential run

We can see the five futures fired one-by-one. It is important to note that this arrangement should only be used when the future is dependent on the previous future.

Simultaneous or Parallel run

If the futures are independent of each other then they should be fired simultaneously. For this purpose, we’re going to use Future.sequence. The name is a bit confusing, but in principle it simply takes a list of futures and transforms it into a future of list. The evaluation, however, is done asynchronously.

Let’s create an example of mixed sequential and parallel futures.

val f = for {
  f1 <- job(1)
  f2 <- Future.sequence(List(job(f1), job(f1)))
  f3 <- job(f2.head)
  f4 <- Future.sequence(List(job(f3), job(f3)))
  f5 <- job(f4.head)
} yield f2.size + f4.size
f.foreach(z => println(s"Done. $z jobs run in parallel"))

Future.sequence takes a list of futures that we wish to run simultaneously. So here we have f2 and f4 containing two parallel jobs. As the argument fed into Future.sequence is a list, the result is also a list. In a realistic application, the results may be combined for further computation. Here we’ll take the first element from each list with .head then pass it to f3 and f5 respectively.

Let’s see it in action:

Parallel run

We can see the jobs in 2 and 4 fired simultaneously indicating successful parallelism. It is worth noting that parallel execution is not always guaranteed since it depends on available threads. If there are not enough threads then only some of the jobs will run in parallel. The others, however, will wait until some more threads are freed.

Recovering from errors

Scala Future incorporates recover that acts as a back-up future when an error occurs. This allows the future composition to finish even with failures. To illustrate, consider this code:

Future {"abc".toInt}
.map(z => z + 1)

Of course, this will not work, as “abc” is not an int. With recover, we can salvage it by passing a default value. Let’s try passing a zero:

Future {"abc".toInt}
.recover {case e => 0}
.map(z => z + 1)

Now the code will run and produce one as a result. In composition, we can fine-tune each future like this to make sure the process won’t fail.

However, there are also times when we want to reject errors explicitly. For this purpose, we can use Future.succesful and Future.failed to signal validation result. And if we don’t care about individual failure we can position recover to catch any error inside the composition.

Let’s work another bit of code using for-comprehension that checks if the input is a valid int and lower than 100. Future.failed and Future.successful are both futures so we don’t need to wrap it in one. Future.failed in particular requires a Throwable so we’re going to create a custom one for input larger than 100. After putting it all up together we would have as follows:

val input = "5" // let's try "5", "200", and "abc"
case class NumberTooLarge() extends Throwable()
val f = for {
   f1 <- Future{ input.toInt }
   f2 <- if (f1 > 100) {
            Future.failed(NumberTooLarge())
          } else {
            Future.successful(f1)
          }
} yield f2
f map(println) recover {case e => e.printStackTrace()}

Notice the positioning of recover. With this configuration, it will simply intercept any error occurring inside the block. Let’s test it with several different inputs “5”, “200”, and “abc”:

"5"   -> 5
"200" -> NumberTooLarge stacktrace
"abc" -> NumberFormatException stacktrace

“5” reached the end no problem. “200” and “abc” arrived in recover. Now, what if we want to handle each error separately? This is where pattern matching comes into play. Expanding the recover block, we can have something like this:

case e => 
  e match {
    case t: NumberTooLarge => // deal with number > 100
    case t: NumberFormatException => // deal with not a number
    case _ => // deal with any other errors
  }
}

You might probably have guessed it but an all-or-nothing scenario like this is commonly used in public APIs. Such service wouldn’t process invalid input but needs to return a message to inform the client what they did wrong. By separating exceptions, we can pass a custom message for each error. If you like to build such service (with a very fast web framework), head over to my Vert.x article.

The world outside Scala

We have talked a lot about how easy Scala Future is. But is it really? To answer it we need to look at how it’s done in other languages. Arguably the closest language to Scala is Java as both operate on JVM. Furthermore, Java 8 has introduced Concurrency API with CompletableFuture which is also able to chain futures. Let’s rework the first sequence example with it.

Sequential run in Java

That’s sure a lot of stuff. And to code this I had to look up supplyAsync and thenApply among so many methods in the documentation. And even if I know all these methods, they can only be used within the context of the API.

On the other hand, Scala Future is not based on API or external libraries but a functional programming concept that is also used in other aspects of Scala. So with an initial investment in covering the fundamentals, you can reap the reward of less overhead and higher flexibility.

Wrapping up

That’s all for the basics. There’s more to Scala Future but what we have here has covered enough ground to build real-life applications. If you like to read more about Future or Scala, in general, I’d recommend Alvin Alexander tutorials, AllAboutScala, and Sujit Kamthe’s article that offers easy to grasp explanations.

An introduction to Law Testing in Scala

freeCodeCamp — Mon, 22 Apr 2019 16:48:22 +0000

By dor sever

Property-based law testing is one of the most powerful tools in the scala ecosystem. In this post, I’ll explain how to use law testing and the value it’ll give you using in-depth code examples.

This post is aimed for Scala developers who want to improve their testing knowledge and skills. It assumes some knowledge of Scala, cats, and other functional libraries.

Introduction

You might be familiar with types which are a set of values (for example Int values are : 1,2,3… String values are : “John Doe” etc).
You also might be familiar with functions which is a mapping from Input type to Output type.
A property is defined on a type or a function and describes its desired behavior.

So what is a Law? Keep on reading!

A concrete example

Here is our beloved _Person_ data type:

case class Person(name: String, age: Int)

And serialization code using the Play-Json , a library that allows transforming your Person type into JSON :

val personFormat: OFormat[Person] = new OFormat[Person] {
  override def reads(json: JsValue): JsResult[Person] = {
    val name = (json \ "name").as[String]
    val age = (json \ "age").as[Int]
    JsSuccess(Person(name, age))
  }
override def writes(o: Person): JsObject =
    JsObject(Seq("name" -> JsString(o.name), 
                 "age" -> JsNumber(o.age)))
}

We can now test this serialization function on a specific input like this:

import org.scalatest._
class PersonSerdeSpec extends WordSpecLike with Matchers {
  "should serialize and deserialize a person" in {
    val person = Person("John Doe", 32)
    val actualPerson =
      personFormat.reads(personFormat.writes(person))
    actualPerson.asOpt.shouldEqual(Some(person))
  }
}

But, we now need to ask ourselves, whether all people will serialize successfully? What about a person with invalid data (such as negative age)? Will we want to repeat this thought process of finding edge-cases for all our test data?

And most importantly, will this code remain readable over time? (e.g.: changing the person data type [adding a LastName field], repeated tests for other data types, etc)

“ We can solve any problem by introducing an extra level of indirection”.

Property-based testing

The first weapon in our disposal is Property-based testing (PBT). PBT works by defining a property, which is a high-level specification of behavior that should hold for all values of the specific type.

In our example, the property will be:

For every person p, if we serialize and deserialize them, we should get back the same person.

Writing this property using scala check looks like this:

object PersonSerdeSpec extends org.scalacheck.Properties("PersonCodec") {
  property("round trip consistency") = 
org.scalacheck.Prop.forAll { a: Person =>
    personFormat.reads(personFormat.writes(a)).asOpt.contains(a)
  }
}

The property check requires a way to generate Persons. This is done by using an Arbitrary[Person] which can be defined like this:

implicit val personArb: Arbitrary[Person] = Arbitrary {
  for {
    name <- Gen.alphaStr
    age  <- Gen.chooseNum(0, 120)
  } yield Person(name, age)
}

Furthermore, we can use “scalacheck-shapeless”- an amazing library which eliminates (almost) all needs for the verbose (quite messy and highly bug-prone) arbitrary type definition by generating it for us!

This can be done by adding:

libraryDependencies += "com.github.alexarchambault" %% "scalacheck-shapeless_1.14" % "1.2.0"

and importing the following in our code:

import org.scalacheck.ScalacheckShapeless._

And then we can remove the _personArb_ instance we defined earlier.

The Codec Law

Let’s try to abstract further, by defining the laws of our data type:

trait CodecLaws[A, B] {
  def serialize: A => B
  def deserialize: B => A
  def codecRoundTrip(a: A): Boolean = serialize.
andThen(deserialize)(a) == a
}

This means That given

The types A, B
A function from A to B
A function from B to A

We define a function called “codecRoundTrip” which takes an “a: A” and takes it through the functions and makes sure we get the same value of type A back.

This Law states (without giving away any implementation details), that the roundtrip we do on the given input does not “lose” any information.

Another way of saying just that is by claiming that our types A and B are isomorphic.

We can abstract even more, by using the cats-laws library with the IsEq case-class for defining an Equality description.

import cats.laws._
trait CodecLaws[A, B] {
  def serialize: A => B
  def deserialize: B => A
  def codecRoundTrip(a: A): cats.laws.IsEq[A] = serialize.andThen(deserialize)(a) <-> a
}
/** Represents two values of the same type that are expected to be equal. */
final case class IsEq[A](lhs: A, rhs: A)

What we get from this type and syntax is a description of equality between the two values instead of the equality result like before.

The Codec Test

It is time to test the laws we just defined. In order to do that, we will use the “discipline” library.

import cats.laws.discipline._
import org.scalacheck.{ Arbitrary, Prop }
trait CodecTests[A, B] extends org.typelevel.discipline.Laws {
  def laws: CodecLaws[A, B]
  def tests(
    implicit
    arbitrary: Arbitrary[A],
    eqA: cats.Eq[A]
  ): RuleSet =
    new DefaultRuleSet(
      name   = name,
      parent = None,
      "roundTrip" -> Prop.forAll { a: A =>
        laws.codecRoundTrip(a)
      }
    )
}

We define a CodecTest trait that takes 2 type parameters A and B, which in our example will be Person and JsResult.

The trait holds an instance of the laws and defines a test method that takes an Arbitrary[A] and an equality checker (of type Eq[A]) and returns a rule-set for scalacheck to run.

Note that no tests actually run here. This gives us the power to run these tests which are defined just once for all the types we want

We can now commit to a specific type and implementation (like Play-Json serialization) by instantiating a CodecTest with the proper types.

object JsonCodecTests {
  def apply[A: Arbitrary](implicit format: Format[A]): CodecTests[A, JsValue] =
    new CodecTests[A, JsValue] {
      override def laws: CodecLaws[A, JsValue] =
        CodecLaws[A, JsValue](format.reads, format.writes)
    }
}

A (type) detour

But now we get the error:

Error:(11, 38) type mismatch;
 found   : play.api.libs.json.JsResult[A]
 required: A

We expected the types to flow from:

  A  =>  B  =>  A

But Play-Json types go from:

 A  =>  JsValue  =>  JsResult[A]

This means that our deserialize function can succeed or fail and will not always return an A, but rather a container of A.

In order to abstract over the types, we now need to use the F[_] type constructor syntax:

trait CodecLaws[F[_],A, B] {
  def serialize: A => B
  def deserialize: B => F[A]
  def codecRoundTrip(a: A)(implicit app:Applicative[F]): IsEq[F[A]] =
    serialize.andThen(deserialize)(a) <-> app.pure(a)
}

The Applicative instance is used to take a simple value of type A and lift it into the Applicative context which returns a value of type F[A].

This process is similar to taking some value x and lifting it to an Option context using Some(x), or in our concrete example taking a value a:A and lifting it to the JsResult type using [JsSuccess](https://www.playframework.com/documentation/2.7.0/api/scala/play/api/libs/json/JsSuccess.html)(a).

We can now finish the implementation for CodecTests and JsonCodecTests like this:

trait CodecTests[F[_], A, B] extends org.typelevel.discipline.Laws {
  def laws: CodecLaws[F, A, B]
  def tests(
    implicit
    arbitrary: Arbitrary[A],
    eqA: cats.Eq[F[A]],
    applicative: Applicative[F]
  ): RuleSet =
    new DefaultRuleSet(
      name   = name,
      parent = None,
      "roundTrip" -> Prop.forAll { a: A =>
        laws.codecRoundTrip(a)
      }
    )
}
object JsonCodecTests {
  def apply[A: Arbitrary](implicit format: Format[A]): CodecTests[JsResult, A, JsValue] =
    new CodecTests[JsResult, A, JsValue] {
      override def laws: CodecLaws[JsResult, A, JsValue] =
        CodecLaws[JsResult, A, JsValue](format.reads, format.writes)
    }
}

And to define a working Person serialization test in 1 line of code:

import JsonCodecSpec.Person
import play.api.libs.json._
import org.scalacheck.ScalacheckShapeless._
import org.scalatest.FunSuiteLike
import org.scalatest.prop.Checkers
import org.typelevel.discipline.scalatest.Discipline
class JsonCodecSpec extends Checkers with FunSuiteLike with Discipline { 
  checkAll("PersonSerdeTests", JsonCodecTests[Person].tests)
}

The power of abstraction

We were able to define our tests and laws without giving away any implementation details. This means we can switch to using a different library for serialization tomorrow and all our laws and tests will still hold.

Another example

We can test this theory by adding support to BSON serialization using the reactive-mongo library:

import cats.Id
import io.bigpanda.laws.serde.{ CodecLaws, CodecTests }
import org.scalacheck.Arbitrary
import reactivemongo.bson.{ BSONDocument, BSONReader, BSONWriter }
object BsonCodecTests {
  def apply[A: Arbitrary](
    implicit
    reader: BSONReader[BSONDocument, A],
    writer: BSONWriter[A, BSONDocument]
  ): CodecTests[Id, A, BSONDocument] =
    new CodecTests[Id, A, BSONDocument] {
      override def laws: CodecLaws[Id, A, BSONDocument] =
        CodecLaws[Id, A, BSONDocument](reader.read, writer.write)
override def name: String = "BSON serde tests"
    }
}

The types here flow from

A => BsonDocument => A

and not F[A] as we had expected. Luckily for us, we have a solution and use the Id-type to represent just that.

And given the (very long) serializer definition:

implicit val personBsonFormat
  : BSONReader[BSONDocument, Person] with BSONWriter[Person, BSONDocument] =
  new BSONReader[BSONDocument, Person] with BSONWriter[Person, BSONDocument] {
    override def read(bson: BSONDocument): Person =
      Person(bson.getAs[String]("name").get, bson.getAs[Int]("age").get)
override def write(t: Person): BSONDocument =
      BSONDocument("name" -> t.name, "age" -> t.age)
  }

we can now define BsonCodecTests in all its 1 line of logic glory.

class BsonCodecSpec extends Checkers with FunSuiteLike with Discipline {
    checkAll("PersonSerdeTests", BsonCodecTests[Person].tests)
}

A (First-order) logic perspective on our code

Our first test attempt can be described as follows:

∃p:Person,s:OFormat that holds : s.read(s.write(p)) <-> p

Meaning, for the specific person p(“John Doe”,32) and for the format **s**, the following statement is true: decode(encode(p)) <-> p.

The second attempt (using PBT) can be:

∃s:OFormat, ∀p:Person the following should hold :  s.read(s.write(p)) <-> p

Which means, for all persons p and for the specific format **s**, the following is true: decode(encode(p))<->p.

The third (and most powerful statement thus far) using law testing:

∀s:Encoder, ∀p:Person the the following should hold :  s.read(s.write(p)) <-> p

Which means, for all formats s, and **for all** persons p, the following is true: decode(encode(p))<->p.

Summary

Law testing allows you to reason about your data-types and functions in a mathematical and concise way and provides a totally new paradigm for testing your code!
Most of the type level libraries you use (like cats, circe and many more) use law testing internally to test their data-types.
Avoid writing specific test-cases for your data-types and functions and try to generalize them using property-based law tests.

Thank you for reaching this far! I am super excited about finding more abstract and useful laws that I can use in my code! Please let me know about any you’ve used or can think of.

More inspiring and detailed content can be found in the cats-laws site or circe.

The complete code examples can be found here.

Beyond unit tests: an intro to property and law testing in Scala

freeCodeCamp — Fri, 05 Apr 2019 15:42:57 +0000

By Daniel Sebban

Taking your unit tests to the next level.

I have been using ScalaCheck testing library for at least 2 years now. It allows you to take your unit tests to the next level.

You can do Property-based testing by generating a lot of tests with random data and asserting properties on your functions. A simple code example is described below.
You can do Law testing that is even more powerful and allows you to check mathematical properties on your types.

Property-based testing

Here is our beloved User data type:

case class User(name: String, age: Int)

And a random User generator:

import org.scalacheck.{ Gen, Arbitrary }
import Arbitrary.arbitrary
implicit val randomUser: Arbitrary[User] = Arbitrary(for {
  randomName <- Gen.alphaStr
  randomAge  <- Gen.choose(0,80)
  } yield User(randomName, randomAge))

We can now generate a User like this:

scala> randomUser.arbitrary.sample
res0: Option[User] = Some(User(OtwlaaxGbmdhuorlmgvXitbmGfbgetm,22))

Let’s define some functions on the User:

def isAdult: User => Boolean = _.age >= 18
def isAllowedToDrink : User => Boolean = _.age >= 21

Let’s claim that:

All adults are allowed to drink.

Can we somehow prove this? Is this correct for all users?

This is where property testing comes to the rescue. It allows us not to write specific unit-tests. Here they would be:

18-year-olds are not allowed to drink
19-year-olds are not allowed to drink
20-year-olds are not allowed to drink

All of these statements can be replaced by a single property check:

import org.scalacheck.Prop.forAll
val allAdultsCanDrink = forAll { u: User =>
    if(isAdult(u)) isAllowedToDrink(u) else true }

Let’s run it:

scala> allAdultsCanDrink.check()
! Falsified after 0 passed tests.
> ARG_0: User(,19)

It fails as expected for a 19-year-old.

Property testing is awesome for a few reasons:

Saves time by writing less specific tests
Finds new use cases generated by Scala check that you forgot to handle
Forces you think in a more general way
Gives you more confidence for refactoring than conventional unit tests

Law testing

It gets better: let’s take it to the next level and define an Ordering between Users:

import scala.math.Orderingimplicit val userOrdering: Ordering[User] = Ordering.by(_.age)

We want to make sure that we didn't forget any edge cases and that we defined our order properly. This property has a name, and it’s called a total order. It needs to holds for the following properties:

Totality
Antisymmetry
Transitivity

Can we somehow prove this? Is this correct for all users?

This is possible without writing a single test!

We use cats-laws library to define the laws we want to test on the ordering we defined:

import cats.kernel.laws.discipline.OrderTests
import cats._
import org.scalatest.FunSuite
import org.typelevel.discipline.scalatest.Discipline
import org.scalacheck.ScalacheckShapeless._
class UserOrderSpec extends FunSuite with Discipline  {
//needed boilerplate to satisfy the dependencies of the framework
  implicit def eqUser[A: Eq]: Eq[Option[User]] = Eq.fromUniversalEquals
//convert our standard ordering to a `cats` order 
  implicit val catsUserOrder: Order[User] = Order.fromOrdering(userOrdering)

//check all mathematical properties on our ordering
  checkAll("User", OrderTests[User].order)
}

Let’s run it:

scala> new UserOrderSpec().execute()
UserOrderSpec:
- User.order.antisymmetry *** FAILED ***
  GeneratorDrivenPropertyCheckFailedException was thrown during property evaluation.
   (Discipline.scala:14)
    Falsified after 1 successful property evaluations.
    Location: (Discipline.scala:14)
    Occurred when passed generated values (
      arg0 = User(h,17),
      arg1 = User(edsb,17),
      arg2 = org.scalacheck.GenArities$$Lambda$2739/1277317528@41d7b4cf
    )
    Label of failing property:
      Expected: true
  Received: false
- User.order.compare
- User.order.gt
- User.order.gteqv
- User.order.lt
- User.order.max
- User.order.min
- User.order.partialCompare
- User.order.pmax
- User.order.pmin
- User.order.reflexitivity
- User.order.reflexitivity gt
- User.order.reflexitivity lt
- User.order.symmetry
- User.order.totality
- User.order.transitivity

Sure enough, it fails on the antisymmetry law! Same age and different names are not supposed to be equals. We forgot to use the name in our original Ordering, so let's fix it and rerun the laws:

implicit val userOrdering: Ordering[User] = Ordering.by( u => (u.age, u.name))
scala> new UserOrderSpec().execute()
UserOrderSpec:
- User.order.antisymmetry
- User.order.compare
- User.order.gt
- User.order.gteqv
- User.order.lt
- User.order.max
- User.order.min
- User.order.partialCompare
- User.order.pmax
- User.order.pmin
- User.order.reflexitivity
- User.order.reflexitivity gt
- User.order.reflexitivity lt
- User.order.symmetry
- User.order.totality
- User.order.transitivity

And now it passes :)

If you are wondering what can you test besides Orders, go check out the docs here: https://typelevel.org/cats/typeclasses/lawtesting.html

Summary

Property tests are more powerful than unit tests. They allow us to define properties on functions and generate a large number of tests using random data generators.
Law testing takes it to the next level and uses the mathematical properties of structures like Order to generate the properties and the tests.
Next time you define an ordering and wonder if it’s well-defined, go ahead and run the laws on it!

An introduction to Vert.x, the fastest Java framework today

freeCodeCamp — Wed, 06 Mar 2019 16:52:04 +0000

By Martin Budi

If you’ve recently googled “best web framework” you might have stumbled upon the Techempower benchmarks where more than three hundred frameworks are ranked. There you might have noticed that Vert.x is one of the top ranked, if not the first by some measures.

So let’s talk about it.

Vert.x is a polyglot web framework that shares common functionalities among its supported languages Java, Kotlin, Scala, Ruby, and Javascript. Regardless of language, Vert.x operates on the Java Virtual Machine (JVM). Being modular and lightweight, it is geared toward microservices development.

Techempower benchmarks measure the performance of updating, fetching and delivering data from a database. The more requests served per second, the better. In such an IO scenario where little computing is involved, any non-blocking framework would have an edge. In recent years, such a paradigm is almost inseparable from Node.js which popularized it with its single-threaded event loop.

Vert.x, like Node, operates a single event loop. But Vert.x also takes advantage of the JVM. Whereas Node runs on a single core, Vert.x maintains a thread pool with a size that can match the number of available cores. With greater concurrency support, Vert.x is suitable for not only IO but also CPU-heavy processes that require parallel computing.

Event loops, however, are half of the story. The other half has little to do with Vert.x.

To connect to a database, a client requires a connector driver. In the Java realm, the most common driver for Sql is JDBC. The problem is, this driver is blocking. And it’s blocking at the socket level. A thread will always get stuck there until it returns with a response.

Needless to say, the driver has been a bottleneck in realizing a fully non-blocking application. Fortunately there has been progress (albeit unofficial) on an async driver with several active forks, among them:

https://github.com/jasync-sql/jasync-sql (for Postgres and MySql)
https://github.com/reactiverse/reactive-pg-client (Postgres)

The golden rule

Vert.x is pretty simple to work with, and an http server can be brought up with a few lines of code.

The method requestHandler is where the event loop delivers the request event. As Vert.x is un-opinionated, handling it is free style. But keep in mind the single important rule of non-blocking thread: don’t block it.

When working with concurrency we can draw from so many options available today such as Promise, Future, Rx, as well as Vert.x’s own idiomatic way. But as the complexity of an application grows, having async functionality alone is not enough. We also need the ease of coordinating and chaining calls while avoiding callback hell, as well as passing any error gracefully.

Scala Future satisfies all the conditions above with the additional advantage of being based on functional programming principles. Although this article doesn’t explore Scala Future in depth, we can try it with a simple app. Let’s say the app is an API service to find a user given their id:

There are three operations involved: checking request parameter, checking if the id is valid, and fetching the data. We will wrap each of these operations in a Future and coordinate the execution in a “for comprehension” structure.

The first step is to match the request with a service. Scala has a powerful pattern matching feature that we can use for this purpose. Here we intercept any mention of “/user” and pass it into our service.
Next is the core of this service where our futures are arranged in a sequential for-comprehension. The first future f1 wraps parameter check. We specifically want to retrieve the id from the get request and cast it into int. (Scala doesn’t require explicit return if the return value is the last line in the method.) As you see, this operation could potentially throw an exception as id might not be an int or not even available, but that is okay for now.
The second future f2 checks the validity of id. We block any id lower than 100 by explicitly calling Future.failed with our own CustomException. Otherwise we pass an empty Future in the form of Future.unit as successful validation.
The last future f3 retrieves the user with the id provided by f1. As this is just a sample, we don’t really connect to a database. We just return some mock string.
map runs the arrangement that yields the user data from f3 then prints it into the response.
Now if in any part of the sequence an error occurs, a Throwable is passed to recover. Here we can match its type to a suitable recovery strategy. Looking back in our code, we have anticipated several potential failures such as missing id, or id that was not int or not valid which would throw specific exceptions. We are handling each of them in handleException by passing an error message to client.

This arrangement provides not only an asynchronous flow from the start to the end but also a clean approach to handling errors. And as it is streamlined across handlers we can focus on things that matter, like database query.

Verticles, Event Bus, and other gotchas

Vert.x also offers a concurrency model called verticle which resembles the Actor system. (If you’d like to learn more, head over to my Akka Actor guide.) Verticle isolates its state and behavior to provide a thread-safe environment. The only way to communicate with it is through an event bus.

However, the Vert.x event bus requires its messages to be String or JSON. This makes it difficult to pass arbitrary non-POJO objects. And in a high performance system, dealing with JSON conversion is undesirable as it imposes some computing cost. If you are developing IO applications, you may be better off not using either verticle or event bus, as such applications have little need for local state.

Working with some Vert.x components can also be pretty challenging. You might find lack of documentation, unexpected behavior, and even failure to function. Vert.x might be suffering from its own ambition, as developing new components would require porting across many languages. This is a difficult undertaking. For that reason sticking to the core would be the best.

If you are developing a public API, then vertx-core should be enough. If it’s a web app, you may add vertx-web which provides http parameter handling and JWT/Session authentication. These two are the ones that dominated the benchmarks anyway. There is some decrease in performance in some tests for using vertx-web, but as it seems to have stemmed from optimization, it might get ironed out in the subsequent releases.

How to make a simple application with Akka Cluster

freeCodeCamp — Tue, 26 Feb 2019 21:08:29 +0000

By Luca Florio

If you read my previous story about Scalachain, you probably noticed that it is far from being a distributed system. It lacks all the features to properly work with other nodes. Add to it that a blockchain composed by a single node is useless. For this reason I decided it is time to work on the issue.

Since Scalachain is powered by Akka, why not take the chance to play with Akka Cluster? I created a simple project to tinker a bit with Akka Cluster, and in this story I’m going to share my learnings. We are going to create a cluster of three nodes, using Cluster Aware Routers to balance the load among them. Everything will run in a Docker container, and we will use docker-compose for an easy deployment.

Ok, Let’s roll! ?

Quick introduction to Akka Cluster

Akka Cluster provides great support to the creation of distributed applications. The best use case is when you have a node that you want to replicate N times in a distributed environment. This means that all the N nodes are peers running the same code. Akka Cluster gives you out-of-the-box the discovery of members in the same cluster. Using Cluster Aware Routers it is possible to balance the messages between actors in different nodes. It is also possible to choose the balancing policy, making load-balancing a piece of cake!

Actually you can chose between two types of routers:

Group Router — The actors to send the messages to — called routees — are specified using their actor path. The routers share the routees created in the cluster. We will use a Group Router in this example.

Group Router

Pool Router — The routees are created and deployed by the router, so they are its children in the actor hierarchy. Routees are not shared between routers. This is ideal for a primary-replica scenario, where each router is the primary and its routees the replicas.

Pool Router

This is just the tip of the iceberg, so I invite you to read the official documentation for more insights.

A Cluster for mathematical computations

Let’s picture a use-case scenario. Suppose to design a system to execute mathematical computations on request. The system is deployed online, so it needs a REST API to receive the computation requests. An internal processor handles these requests, executing the computation and returning the result.

Right now the processor can only compute the Fibonacci number. We decide to use a cluster of nodes to distribute the load among the nodes and improve performance. Akka Cluster will handle cluster dynamics and load-balancing between nodes. Ok, sounds good!

Actor hierarchy

First things first: we need to define our actor hierarchy. The system can be divided in three functional parts: the business logic, the cluster management, and the node itself. There is also the server but it is not an actor, and we will work on that later.

Business logic

The application should do mathematical computations. We can define a simple Processor actor to manage all the computational tasks. Every computation that we support can be implemented in a specific actor, that will be a child of the Processor one. In this way the application is modular and easier to extend and maintain. Right now the only child of Processor will be the ProcessorFibonacci actor. I suppose you can guess what its task is. This should be enough to start.

Cluster management

To manage the cluster we need a ClusterManager. Sounds simple, right? This actor handles everything related to the cluster, like returning its members when asked. It would be useful to log what happens inside the cluster, so we define a ClusterListener actor. This is a child of the ClusterManager, and subscribes to cluster events logging them.

Node

The Node actor is the root of our hierarchy. It is the entry point of our system that communicates with the API. The Processor and the ClusterManager are its children, along with the ProcessorRouter actor. This is the load balancer of the system, distributing the load among Processors. We will configure it as a Cluster Aware Router, so every ProcessorRouter can send messages to Processors on every node.

Actor hierarchy

Actor Implementation

Time to implement our actors! Fist we implement the actors related to the business logic of the system. We move then on the actors for the cluster management and the root actor (Node) in the end.

ProcessorFibonacci

This actor executes the computation of the Fibonacci number. It receives a Compute message containing the number to compute and the reference of the actor to reply to. The reference is important, since there can be different requesting actors. Remember that we are working in a distributed environment!

Once the Compute message is received, the fibonacci function computes the result. We wrap it in a ProcessorResponse object to provide information on the node that executed the computation. This will be useful later to see the round-robin policy in action.

The result is then sent to the actor we should reply to. Easy-peasy.

object ProcessorFibonacci {
  sealed trait ProcessorFibonacciMessage
  case class Compute(n: Int, replyTo: ActorRef) extends ProcessorFibonacciMessage

  def props(nodeId: String) = Props(new ProcessorFibonacci(nodeId))

  def fibonacci(x: Int): BigInt = {
    @tailrec def fibHelper(x: Int, prev: BigInt = 0, next: BigInt = 1): BigInt = x match {
      case 0 => prev
      case 1 => next
      case _ => fibHelper(x - 1, next, next + prev)
    }
    fibHelper(x)
  }
}

class ProcessorFibonacci(nodeId: String) extends Actor {
  import ProcessorFibonacci._

  override def receive: Receive = {
    case Compute(value, replyTo) => {
      replyTo ! ProcessorResponse(nodeId, fibonacci(value))
    }
  }
}

Processor

The Processor actor manages the specific sub-processors, like the Fibonacci one. It should instantiate the sub-processors and forward the requests to them. Right now we only have one sub-processor, so the Processor receives one kind of message: ComputeFibonacci. This message contains the Fibonacci number to compute. Once received, the number to compute is sent to a FibonacciProcessor, along with the reference of the sender().

object Processor {

  sealed trait ProcessorMessage

  case class ComputeFibonacci(n: Int) extends ProcessorMessage

  def props(nodeId: String) = Props(new Processor(nodeId))
}

class Processor(nodeId: String) extends Actor {
  import Processor._

  val fibonacciProcessor: ActorRef = context.actorOf(ProcessorFibonacci.props(nodeId), "fibonacci")

  override def receive: Receive = {
    case ComputeFibonacci(value) => {
      val replyTo = sender()
      fibonacciProcessor ! Compute(value, replyTo)
    }
  }
}

ClusterListener

We would like to log useful information about what happens in the cluster. This could help us to debug the system if we need to. This is the purpose of the ClusterListener actor. Before starting, it subscribes itself to the event messages of the cluster. The actor reacts to messages like MemberUp, UnreachableMember, or MemberRemoved, logging the corresponding event. When ClusterListener is stopped, it unsubscribes itself from the cluster events.

object ClusterListener {
  def props(nodeId: String, cluster: Cluster) = Props(new ClusterListener(nodeId, cluster))
}

class ClusterListener(nodeId: String, cluster: Cluster) extends Actor with ActorLogging {

  override def preStart(): Unit = {
    cluster.subscribe(self, initialStateMode = InitialStateAsEvents,
      classOf[MemberEvent], classOf[UnreachableMember])
  }

  override def postStop(): Unit = cluster.unsubscribe(self)

  def receive = {
    case MemberUp(member) =>
      log.info("Node {} - Member is Up: {}", nodeId, member.address)
    case UnreachableMember(member) =>
      log.info(s"Node {} - Member detected as unreachable: {}", nodeId, member)
    case MemberRemoved(member, previousStatus) =>
      log.info(s"Node {} - Member is Removed: {} after {}",
        nodeId, member.address, previousStatus)
    case _: MemberEvent => // ignore
  }
}

ClusterManager

The actor responsible of the management of the cluster is ClusterManager. It creates the ClusterListener actor, and provides the list of cluster members upon request. It could be extended to add more functionalities, but right now this is enough.

object ClusterManager {

  sealed trait ClusterMessage
  case object GetMembers extends ClusterMessage

  def props(nodeId: String) = Props(new ClusterManager(nodeId))
}

class ClusterManager(nodeId: String) extends Actor with ActorLogging {

  val cluster: Cluster = Cluster(context.system)
  val listener: ActorRef = context.actorOf(ClusterListener.props(nodeId, cluster), "clusterListener")

  override def receive: Receive = {
    case GetMembers => {
      sender() ! cluster.state.members.filter(_.status == MemberStatus.up)
        .map(_.address.toString)
        .toList
    }
  }
}

ProcessorRouter

The load-balancing among processors is handled by the ProcessorRouter. It is created by the Node actor, but this time all the required information are provided in the configuration of the system.

class Node(nodeId: String) extends Actor {

  //...

  val processorRouter: ActorRef = context.actorOf(FromConfig.props(Props.empty), "processorRouter")

  //...
}

Let’s analyse the relevant part in the application.conf file.

akka {
  actor {
    ...
    deployment {
      /node/processorRouter {
        router = round-robin-group
        routees.paths = ["/user/node/processor"]
        cluster {
          enabled = on
          allow-local-routees = on
        }
      }
    }
  }
  ...
}

The first thing is to specify the path to the router actor, that is /node/processorRouter. Inside that property we can configure the behaviour of the router:

router: this is the policy for the load balancing of messages. I chose the round-robin-group, but there are many others.
routees.paths: these are the paths to the actors that will receive the messages handled by the router. We are saying: “When you receive a message, look for the actors corresponding to these paths. Choose one according to the policy and forward the message to it.” Since we are using Cluster Aware Routers, the routees can be on any node of the cluster.
cluster.enabled: are we operating in a cluster? The answer is on, of course!
cluster.allow-local-routees: here we are allowing the router to choose a routee in its node.

Using this configuration we can create a router to load balance the work among our processors.

Node

The root of our actor hierarchy is the Node. It creates the children actors — ClusterManager, Processor, and ProcessorRouter — and forwards the messages to the right one. Nothing complex here.

object Node {

  sealed trait NodeMessage

  case class GetFibonacci(n: Int)

  case object GetClusterMembers

  def props(nodeId: String) = Props(new Node(nodeId))
}

class Node(nodeId: String) extends Actor {

  val processor: ActorRef = context.actorOf(Processor.props(nodeId), "processor")
  val processorRouter: ActorRef = context.actorOf(FromConfig.props(Props.empty), "processorRouter")
  val clusterManager: ActorRef = context.actorOf(ClusterManager.props(nodeId), "clusterManager")

  override def receive: Receive = {
    case GetClusterMembers => clusterManager forward GetMembers
    case GetFibonacci(value) => processorRouter forward ComputeFibonacci(value)
  }
}

Server and API

Every node of our cluster runs a server able to receive requests. The Server creates our actor system and is configured through the application.conf file.

object Server extends App with NodeRoutes {

  implicit val system: ActorSystem = ActorSystem("cluster-playground")
  implicit val materializer: ActorMaterializer = ActorMaterializer()

  val config: Config = ConfigFactory.load()
  val address = config.getString("http.ip")
  val port = config.getInt("http.port")
  val nodeId = config.getString("clustering.ip")

  val node: ActorRef = system.actorOf(Node.props(nodeId), "node")

  lazy val routes: Route = healthRoute ~ statusRoutes ~ processRoutes

  Http().bindAndHandle(routes, address, port)
  println(s"Node $nodeId is listening at http://$address:$port")

  Await.result(system.whenTerminated, Duration.Inf)

}

Akka HTTP powers the server itself and the REST API, exposing three simple endpoints. These endpoints are defined in the NodeRoutes trait.

The first one is /health, to check the health of a node. It responds with a 200 OK if the node is up and running

lazy val healthRoute: Route = pathPrefix("health") {
    concat(
      pathEnd {
        concat(
          get {
            complete(StatusCodes.OK)
          }
        )
      }
    )
  }

The /status/members endpoint responds with the current active members of the cluster.

lazy val statusRoutes: Route = pathPrefix("status") {
    concat(
      pathPrefix("members") {
        concat(
          pathEnd {
            concat(
              get {
                val membersFuture: Future[List[String]] = (node ? GetClusterMembers).mapTo[List[String]]
                onSuccess(membersFuture) { members =>
                  complete(StatusCodes.OK, members)
                }
              }
            )
          }
        )
      }
    )
  }

The last (but not the least) is the /process/fibonacci/n endpoint, used to request the Fibonacci number of n.

lazy val processRoutes: Route = pathPrefix("process") {
    concat(
      pathPrefix("fibonacci") {
        concat(
          path(IntNumber) { n =>
            pathEnd {
              concat(
                get {
                  val processFuture: Future[ProcessorResponse] = (node ? GetFibonacci(n)).mapTo[ProcessorResponse]
                  onSuccess(processFuture) { response =>
                    complete(StatusCodes.OK, response)
                  }
                }
              )
            }
          }
        )
      }
    )
  }

It responds with a ProcessorResponse containing the result, along with the id of the node where the computation took place.

Cluster Configuration

Once we have all our actors, we need to configure the system to run as a cluster! The application.conf file is where the magic takes place. I’m going to split it in pieces to present it better, but you can find the complete file here.

Let’s start defining some useful variables.

clustering {
  ip = "127.0.0.1"
  ip = ${?CLUSTER_IP}

  port = 2552
  port = ${?CLUSTER_PORT}

  seed-ip = "127.0.0.1"
  seed-ip = ${?CLUSTER_SEED_IP}

  seed-port = 2552
  seed-port = ${?CLUSTER_SEED_PORT}

  cluster.name = "cluster-playground"
}

Here we are simply defining the ip and port of the nodes and the seed, as well as the cluster name. We set a default value, then we override it if a new one is specified. The configuration of the cluster is the following.

akka {
  actor {
    provider = "cluster"
    ...
    /* router configuration */
    ...
  }
  remote {
    log-remote-lifecycle-events = on
    netty.tcp {
      hostname = ${clustering.ip}
      port = ${clustering.port}
    }
  }
  cluster {
    seed-nodes = [
      "akka.tcp://"${clustering.cluster.name}"@"${clustering.seed-ip}":"${clustering.seed-port}
    ]
    auto-down-unreachable-after = 10s
  }
}
...
/* server vars */
...
/* cluster vars */
}

Akka Cluster is build on top of Akka Remoting, so we need to configure it properly. First of all, we specify that we are going to use Akka Cluster saying that provider = "cluster". Then we bind cluster.ip and cluster.port to the hostname and port of the netty web framework.

The cluster requires some seed nodes as its entry points. We set them in the seed-nodes array, in the format akka.tcp://"{clustering.cluster.name}"@"{clustering.seed-ip}":”${clustering.seed-port}”. Right now we have one seed node, but we may add more later.

The auto-down-unreachable-after property sets a member as down after it is unreachable for a period of time. This should be used only during development, as explained in the official documentation.

Ok, the cluster is configured, we can move to the next step: Dockerization and deployment!

Dockerization and deployment

To create the Docker container of our node we can use sbt-native-packager. Its installation is easy: add addSbtPlugin("com.typesafe.sbt" % "sbt-native-packager" % "1.3.15") to the plugin.sbt file in the project/ folder. This amazing tool has a plugin for the creation of Docker containers. it allows us to configure the properties of our Dockerfile in the build.sbt file.

// other build.sbt properties

enablePlugins(JavaAppPackaging)
enablePlugins(DockerPlugin)
enablePlugins(AshScriptPlugin)

mainClass in Compile := Some("com.elleflorio.cluster.playground.Server")
dockerBaseImage := "java:8-jre-alpine"
version in Docker := "latest"
dockerExposedPorts := Seq(8000)
dockerRepository := Some("elleflorio")

Once we have setup the plugin, we can create the docker image running the command sbt docker:publishLocal. Run the command and taste the magic… ?

We have the Docker image of our node, now we need to deploy it and check that everything works fine. The easiest way is to create a docker-compose file that will spawn a seed and a couple of other nodes.

version: '3.5'

networks:
  cluster-network:

services:
  seed:
    networks:
      - cluster-network
    image: elleflorio/akka-cluster-playground
    ports:
      - '2552:2552'
      - '8000:8000'
    environment:
      SERVER_IP: 0.0.0.0
      CLUSTER_IP: seed
      CLUSTER_SEED_IP: seed

  node1:
    networks:
      - cluster-network
    image: elleflorio/akka-cluster-playground
    ports:
      - '8001:8000'
    environment:
      SERVER_IP: 0.0.0.0
      CLUSTER_IP: node1
      CLUSTER_PORT: 1600
      CLUSTER_SEED_IP: seed
      CLUSTER_SEED_PORT: 2552

  node2:
    networks:
      - cluster-network
    image: elleflorio/akka-cluster-playground
    ports:
      - '8002:8000'
    environment:
      SERVER_IP: 0.0.0.0
      CLUSTER_IP: node2
      CLUSTER_PORT: 1600
      CLUSTER_SEED_IP: seed
      CLUSTER_SEED_PORT: 2552

I won’t spend time going through it, since it is quite simple.

Let’s run it!

Time to test our work! Once we run the docker-compose up command, we will have a cluster of three nodes up and running. The seed will respond to requests at port :8000, while node1 and node2 at port :8001 and :8002. Play a bit with the various endpoints. You will see that the requests for a Fibonacci number will be computed by a different node each time, following a round-robin policy. That’s good, we are proud of our work and can get out for a beer to celebrate! ?

Conclusion

We are done here! We learned a lot of things in these ten minutes:

What Akka Cluster is and what can do for us.
How to create a distributed application with it.
How to configure a Group Router for load-balancing in the cluster.
How to Dockerize everything and deploy it using docker-compose.

You can find the complete application in my GitHub repo. Feel free to contribute or play with it as you like! ?

See you! ?

A survival guide to the Either monad in Scala

freeCodeCamp — Fri, 21 Dec 2018 16:37:01 +0000

By Luca Florio

I started to work with Scala few months ago. One of the concepts that I had the most difficulties to understand is the Either monad. So, I decided to play around with it and better understand its power.

In this story I share what I’ve learned, hoping to help coders approaching this beautiful language.

The Either monad

Either is one of the most useful monads in Scala. If you are wondering what a monad is, well… I cannot go into the details here, maybe in a future story!

Imagine Either like a box containing a computation. You work inside this box, until you decide to get the result out of it.

In this specific case, our Either box can have two “forms”. It can be (either) a Left or a Right, depending on the result of the computation inside it.

I can hear you asking: “OK, and what is it useful for?”

The usual answer is: error handling.

We can put a computation in the Either, and make it a Left in case of errors, or a Right containing a result in case of success. The use of Left for errors, and Right for success is a convention. Let’s understand this with some code!

https://scalafiddle.io/sf/sgAxvBI/0

In this snippet we are only defining an Either variable.

We can define it as a Right containing a valid value, or as Left containing an error. We also have a computation that return an Either, meaning it can be a Left or a Right. Simple, isn’t it?

Right and left projection

Once we have the computation in the box, we may want to get the value out of it. I’m sure you expect to call a .get on the Either and extract your result.

That’s not so simple.

Think about it: you put your computation in the Either, but you don’t know if it resulted in a Left or a Right. So what should a .get call return? The error, or the value?

This is why to get the result you should make an assumption about the outcome of the computation.

Here is where the projection comes into play.

Starting from an Either, you can get a RightProjection or a LeftProjection. The former means that you assume the computation resulted in a Right, the latter in a Left.

I know, I know… this may be a little confusing. It’s better to understand it with some code. After all, code always tells the truth.

https://scalafiddle.io/sf/8amYK4r

That’s it. Note that when you try to get the result from a RightProjection, but it is a Left, you get an exception. The same goes for a LeftProjection and you have a Right.

The cool thing is that you can map on projections. This means you can say: “assume it is a Right: do this with it”, leaving the Left unchanged (and the other way around).

From Option to Either

Option is another common way to deal with invalid values.

An Option can have a value or be empty (it’s value is Nothing). I bet you noticed a similarity with Either… It’s even better, because we can actually transform an Option into an Either! Code time!

https://scalafiddle.io/sf/KCtNRkL

It is possible to transform an Option to a Left or a Right. The resulting side of the Either will contain the value of the Option if it is defined. Cool. Wait a minute… What if the Option is empty? We get the other side, but we need to specify what we expect to find in it.

Inside out

Either is magic, we all agree on that. So we decide to use it for our uncertain computations. A typical scenario when doing functional programming is the mapping a function on a List of elements, or on a Map. Let’s do it with our fresh new Either-powered computation…

https://scalafiddle.io/sf/WozPbpV

Huston, we have a “problem” (ok, it’s not a BIG problem, but it is a bit uncomfortable). It would be better to have the collection inside the Either than lots of Either inside the collection. We can work on that.

List

Let’s start with List. First we reason about it, then we can play with code.

We have to extract the value from the Either, put it in the List, and put the list inside an Either. Good, I like it.

The point is that we can have a Left or a Right, so we need to handle both cases. Until we find a Right, we can put its value inside a new List. We proceed this way accumulating every value in the new List.

Eventually we will reach the end of the List of Either, meaning we have a new List containing all the values. We can pack it in a Right and we are done. This was the case where our computation didn’t return an Error inside a Left.

If this happens, it means that something went wrong in our computation, so we can return the Left with the Error. We have the logic, now we need the code.

https://scalafiddle.io/sf/LjaXKPT

Map

The work on Map is quite simple once we have done the homework for the List (despite needing to make it generic):

Step one: transform the Map in a List of Either containing the tuple (key, value).
Step two: pass the result to the function we defined on List.
Step three: transform the List of tuples inside the Either in a Map.

Easy Peasy.

https://scalafiddle.io/sf/vlY7xV0

Let’s get classy: a useful implicit converter

We introduced Either and understood it is useful for error handling. We played a bit with projections. We saw how to pass from an Option to an Either. We also implemented some useful functions to “extract” Either from List and Map. So far so good.

I would like to conclude our journey in the Either monad going a little bit further. The utility functions we defined do their jobs, but I feel like something is missing…

It would be amazing to do our conversion directly on the collection. We would have something like myList.toEitherList or myMap.toEitherMap. More or less like what we do with Option.toRight or Option.toLeft.

Good news: we can do it using implicit classes!

https://scalafiddle.io/sf/j8Ixqz5

Using implicit classes in Scala lets us extend the capabilities of another class.

In our case, we extend the capability of List and Map to automagically “extract” the Either. The implementation of the conversion is the same we defined before. The only difference is that now we make it generic. Isn’t Scala awesome?

Since this can be a useful utility class, I prepared for you a gist you can copy and paste with ease.

object EitherConverter {
  implicit class EitherList[E, A](le: List[Either[E, A]]){
    def toEitherList: Either[E, List[A]] = {
      def helper(list: List[Either[E, A]], acc: List[A]): Either[E, List[A]] = list match {
        case Nil => Right(acc)
        case x::xs => x match {
          case Left(e) => Left(e)
          case Right(v) => helper(xs, acc :+ v)
        }
      }

      helper(le, Nil)
    }
  }

  implicit class EitherMap[K, V, E](me: Map[K, Either[E, V]]) {
    def toEitherMap: Either[E, Map[K, V]] = me.map{
        case (k, Right(v)) => Right(k, v)
        case (_, e) => e
      }.toList.toEitherList.map(l => l.asInstanceOf[List[(K, V)]].toMap)
  }
}

Conclusion

That’s all folks. I hope this short story may help you to better understand the Either monad.

Please note that my implementation is quite simple. I bet there are more complex and elegant ways to do the same thing. I’m a newbie in Scala and I like to KISS, so I prefer readability over (elegant) complexity.

If you have a better solution, especially for the utility class, I will be happy to see it and learn something new! :-)

An introduction to Akka HTTP routing

freeCodeCamp — Wed, 19 Dec 2018 19:17:41 +0000

By Miguel Lopez

Akka HTTP’s routing DSL might seem complicated at first, but once you get the hang of it you’ll see how powerful it is.

In this tutorial we will focus on creating routes and their structure. We won’t cover parsing to and from JSON, we have other tutorials that cover that topic.

What are directives?

One of the first concepts we’ll find when learning server-side Akka HTTP (there’s a client-side library as well) is directives.

So, what are they?

You can think of them as building blocks, Lego pieces if you will, that you can use to construct your routes. They are composable, which means we can create directives on top of other directives.

If you want a more in-depth reading, feel free to check out Akka HTTP’s official documentation.

Before moving on, let’s discuss what we’ll build.

Blog-like API

We’ll create a sample of a public facing API for a blog, where we will allow users to:

query a list of tutorials
query a single tutorial by ID
query the list of comments in a tutorial
add comments to a tutorial

The endpoints will be:

- List all tutorials GET /tutorials

- Create a tutorial GET /tutorials/:id

- Get all comments in a tutorial GET /tutorials/:id/comments

- Add a comment to a tutorial POST /tutorials/:id/comments

We will only implement the endpoints, no logic in them. This way we’ll learn how to create this structure and the common pitfalls when starting with Akka HTTP.

Project Setup

We’ve created a repo for this tutorial, in it you’ll find a branch per each section that requires coding. Feel free to clone it and use it as a base project or even just change between branches to look at the differences.

Otherwise, create a new SBT project, and then add the dependencies in the build.sbt file:

name := "akkahttp-routing-dsl"

version := "0.1"

scalaVersion := "2.12.7"

val akkaVersion = "2.5.17" val akkaHttpVersion = "10.1.5"

libraryDependencies ++= Seq(   "com.typesafe.akka" %% "akka-actor" % akkaVersion,   "com.typesafe.akka" %% "akka-testkit" % akkaVersion % Test,  "com.typesafe.akka" %% "akka-stream" % akkaVersion,   "com.typesafe.akka" %% "akka-stream-testkit" % akkaVersion % Test,   "com.typesafe.akka" %% "akka-http" % akkaHttpVersion,   "com.typesafe.akka" %% "akka-http-testkit" % akkaHttpVersion % Test,      "org.scalatest" %% "scalatest" % "3.0.5" % Test )

We added Akka HTTP and its dependencies, Akka Actor and Streams. And we will also use Scalatest for testing.

Listing all the tutorials

We’ll take a TDD approach to build our directive hierarchy, creating the tests first to make sure when don’t break our routes when adding others. Taking this approach is quite helpful when starting with Akka HTTP.

Let’s start with our route to listing all the tutorials. Create a new file under src/test/scala (if the folders don't exist, create them) named RouterSpec:

import akka.http.scaladsl.testkit.ScalatestRouteTest import org.scalatest.{Matchers, WordSpec}

class RouterSpec extends WordSpec with Matchers with ScalatestRouteTest {

WordSpec and Matchers are provided by Scalatest, and we'll use them to structure our tests and assertions. ScalatestRouteTest is a trait provided by Akka HTTP's test kit, it will allow us to test our routes in a convenient way. Let's see how we can accomplish that.

Because we’re using Scalatest’s WordSpec, we’ll start by creating a scope for our Router object that we will create soon and the first test:

"A Router" should {   "list all tutorials" in {   } }

Next, we want to make sure can send a GET request to the path /tutorials and get the response we expect, let's see how we can accomplish that:

Get("/tutorials") ~> Router.route ~> check {   status shouldBe StatusCodes.OK   responseAs[String] shouldBe "all tutorials" }

It won’t even compile because we haven’t created our Router object. Let's do that now.

Create a new Scala object under src/main/scala named Router. In it we will create a method that will return a Route:

import akka.http.scaladsl.server.Route

object Router {

  def route: Route = ???

Don’t worry too much about the ???, it's just a placeholder to avoid compilation errors temporarily. However, if that code is executed, it'll throw a NotImplementedError as we'll see soon.

Now that our tests and project are compiling, let’s run the tests (Right-click the spec and “Run ‘RouterSpec’”).

The test failed with the exception we were expecting, we haven’t implemented our routes. Let’s begin!

Creating the listing route

By looking into the official documentation we see that the route begins with the path directive. Let's mimic what they're doing and build our route:

import akka.http.scaladsl.server.{Directives, Route}

object Router extends Directives {

  def route: Route = path("tutorials") {    get {      complete("all tutorials")    }  }}

Seems reasonable, let’s run our spec. And it passes, great!

For reference, our entire RouterSpec now looks like:

import akka.http.scaladsl.model.StatusCodesimport akka.http.scaladsl.testkit.ScalatestRouteTestimport org.scalatest.{Matchers, WordSpec}class RouterSpec extends WordSpec with Matchers with ScalatestRouteTest {  "A Router" should {    "list all tutorials" in {      Get("/tutorials") ~> Router.route ~> check {        status shouldBe StatusCodes.OK        responseAs[String] shouldBe "all tutorials"      }    }  }}

Getting a single tutorial by ID

Next, we will allow our users to retrieve a single tutorial.

Let’s add a test for our new route:

"return a single tutorial by id" in {  Get("/tutorials/hello-world") ~> Router.route ~> check {    status shouldBe StatusCodes.OK    responseAs[String] shouldBe "tutorial hello-world"  }}

We expect to get back a message that includes the tutorial ID.

The test will fail because we haven’t created our route, let’s do that now.

From the same resource we used earlier to base our route on, we can see how we can place multiple directives at the same level in the hierarchy using the ~ directive.

We will have to nest path directives because need another segment after the /tutorials route for the tutorial ID. In the documentation they use IntNumber to extract a number from the path, but we'll use a string and for that we use can Segment instead.

Our route looks like:

def route: Route = path("tutorials") {  get {    complete("all tutorials")  } ~ path(Segment) { id =>    get {      complete(s"tutorial $id")    }  }}

Let’s run the tests. And you should get a similar error:

Request was rejectedScalaTestFailureLocation: RouterSpec at (RouterSpec.scala:17)org.scalatest.exceptions.TestFailedException: Request was rejected

What’s going on?!

Well, a request is rejected when it doesn’t match our directive hierarchy. This is one of the things that got me when starting.

Now is probably a good time to look into how these directives match the incoming request as it goes through the hierarchy.

Different directives will match different aspects of an incoming request, we’ve seen path and get, one matches the URL of the request and the other the method. If a request matches a directive it will go inside it, if it doesn't it will continue to the next one. This also tells us that order matters. If it doesn't match any directive the request is rejected.

Now that we now that our request is not matching our directives, let’s start looking into why.

If we look the documentation for the path directive (Cmd + Click on Mac) we'll find:

/** * Applies the given [[PathMatcher]] to the remaining unmatched path after consuming a leading slash. * The matcher has to match the remaining path completely. * If matched the value extracted by the [[PathMatcher]] is extracted on the directive level. * * @group path */

So, the path directive has to match exactly the path, meaning our first path directive will only match /tutorials and never /tutorials/:id.

In the same PathDirectives trait that contains the path directive we can see another directive named pathPrefix:

/** * Applies the given [[PathMatcher]] to a prefix of the remaining unmatched path after consuming a leading slash. * The matcher has to match a prefix of the remaining path. * If matched the value extracted by the PathMatcher is extracted on the directive level. * * @group path */

pathPrefix matches only a prefix and removes it. Sounds like this is what we're looking for, let's update our routes:

def route: Route = pathPrefix("tutorials") {  get {    complete("all tutorials")  } ~ path(Segment) { id =>    get {      complete(s"tutorial $id")    }  }}

Run the tests, and… we get another error. ?

"[all tutorials]" was not equal to "[tutorial hello-world]"ScalaTestFailureLocation: RouterSpec at (RouterSpec.scala:18)Expected :"[tutorial hello-world]"Actual   :"[all tutorials]"

Looks like our request matched the first get directive. It now matches the pathPrefix, and because it also is a GET request it will match the first get directive. Order matters.

There are a couple of things we can do. The simplest solution would be to move the first get request to the end of the hierarchy, however, we would have to remember this or document it. Not ideal.

Personally, I prefer avoiding such solutions and instead make the intend clear through code. If we look in the PathDirectives trait from earlier, we'll find a directive called pathEnd:

/** * Rejects the request if the unmatchedPath of the [[RequestContext]] is non-empty, * or said differently: only passes on the request to its inner route if the request path * has been matched completely. * * @group path */

That’s exactly what we want, so let’s wrap our first get directive with pathEnd:

def route: Route = pathPrefix("tutorials") {  pathEnd {    get {      complete("all tutorials")    }  } ~ path(Segment) { id =>    get {      complete(s"tutorial $id")    }  }}

Run the tests again, and… finally, the tests are passing! ?

Listing all comments in a tutorial

Let’s put into practice what we learned about nesting routes by taking it a bit further.

First the test:

"list all comments of a given tutorial" in {  Get("/tutorials/hello-world/comments") ~> Router.route ~> check {    status shouldBe StatusCodes.OK    responseAs[String] shouldBe "comments for the hello-world tutorial"  }}

It’s a similar case as before: we know we’ll need to place a route next to another one, which means we need to:

change the path(Segmenter) to pathPrefix(Segmenter)
wrap the first get with the pathEnd directive
place the new route next to the pathEnd

Our routes end up looking like:

def route: Route = pathPrefix("tutorials") {  pathEnd {    get {      complete("all tutorials")    }  } ~ pathPrefix(Segment) { id =>    pathEnd {      get {        complete(s"tutorial $id")      }    } ~ path("comments") {      get {        complete(s"comments for the $id tutorial")      }    }  }}

Run the tests, and they should pass! ?

Adding comments to a tutorial

Our last endpoint is similar to the previous, but it will match POST requests. We’ll use this example to see the difference between implementing and testing a GET request versus a POST request.

The test:

"add comments to a tutorial" in {  Post("/tutorials/hello-world/comments", "new comment") ~> Router.route ~> check {    status shouldBe StatusCodes.OK    responseAs[String] shouldBe "added the comment 'new comment' to the hello-world tutorial"  }}

We’re using the Post method instead of the Get we've been using, and we're giving it an additional parameter which is the request body. The rest is familiar to us now.

To implement our last route, we can refer to the documentation and look at how it’s usually done.

We have a post directive just as we have a get one. To extract the request body we need two directives, entity and as, to which we supply the type we expect. In our case it's a string.

Let’s give that a try:

post {  entity(as[String]) { comment =>    complete(s"added the comment '$comment' to the $id tutorial")  }}

Looks reasonable. We extract the request body as a string and use it in our response. Let’s add it to our route method next to the previous route we worked on:

def route: Route = pathPrefix("tutorials") {  pathEnd {    get {      complete("all tutorials")    }  } ~ pathPrefix(Segment) { id =>    pathEnd {      get {        complete(s"tutorial $id")      }    } ~ path("comments") {      get {        complete(s"comments for the $id tutorial")      } ~ post {        entity(as[String]) { comment =>          complete(s"added the comment '$comment' to the $id tutorial")        }      }    }  }}

If you’d like to learn how to parse Scala classes to and from JSON we’ve got tutorials for that as well.

Run the tests, and they should all pass.

Conclusion

Akka HTTP’s routing DSL might seem confusing at first, but after overcoming some bumps it just clicks. After a while it’ll come naturally and it can be very powerful.

We learned how to structure our routes, but more importantly, we learned how to create that structure guided by tests which will make sure we don’t break them at some point in the future.

Even though we only worked on four endpoints, we ended up with a somewhat complicated and deep structure. Stay tuned and we’ll explore different ways to simplify our routes and make them more manageable!

Learn how to build REST APIs with Scala and Akka HTTP with this step-by-step free course!

Originally published at www.codemunity.io.

How to write composable functions and correct programs

freeCodeCamp — Sun, 02 Dec 2018 11:52:18 +0000

By Pavel Zaytsev

An overview with monads and Kleisli arrows

A few words on the traditions ?

If you have not written about monads are you even a programmer?

It is a well-known fact that any respectable programmer who blogs about tech must at least once in their life write a tutorial on monads. Most of the times it starts from functors and builds all the way up. The difficult comparisons are drawn here and there with the short snippets of code sprinkled around to demonstrate a couple of applications — all seem completely orthogonal, of course.

The problem is that there is a gigantic conceptual gap between “What is a monad?” and “How is a monad used in programming?”

The main reason for this gap is that people don’t usually think of their programs in such a way that they would need a monad in the first place. If you come from the imperative background — and most of us do — even when you finally go functional, it usually takes a while to see the entire picture and understand the correct way of doing things. We often preach referential transparency and purity. We often have a general idea of what these things mean separately, but it takes some time to piece them together in one coherent concept.

This article is not about monads but instead about writing composable functions and correct programs. The monads appear as the by-product of this goal. The following discussion also, hopefully, will facilitate bridging the aforementioned conceptual gap.

It’s all about that composition ?‍

In programming, the interesting and important problems are usually too complex too to deal with as a whole. Programmers break them up into smaller comprehensible pieces that they solve one-by-one and then bring the solutions to these subproblems into a clear picture. There are many ways of doing it, but there can be clearly defined two schools of thought — declarative and imperative approaches.

The imperative approach views a program as a sequence of computations in time, and the unit of a program is a statement. A statement does something. For example, it can print output to a console, assign a value to a variable, or compute a result.

The declarative approach is global. It cares only about the initial and final state of the system, and a unit of a program is an expression. The expression always evaluates to a value. In general, it’s harder to reason about a statement rather than about an expression because a statement is far too broad and unbounded, while an expression always deterministically evaluates to the same value for all the programs it can potentially appear in.

If you are a functional programmer and you follow a declarative paradigm, you see a program as a composition of functions, and functions are just the expressions. You can think of a composition as just a fusion of two or more functions into one.

If you have a function f that takes an argument of type A and it returns a result of type B and another function g that takes an argument of type B, and it returns a result of type C, you can compose them by passing a result of f to g. What you get is a new function that accepts an argument of type A and produces the result of type C. Do a couple of these, and you get a working program. This is similar to deductive reasoning, which is just a composition of logical steps, that follow from premises to the conclusions.

The problem is that in real programming it’s not so simple, and many functions do not evaluate to the same value all the time even if their arguments do. These functions do network calls, they read from files and databases, they write to them, they process a lot of different data, they print to a console, etc. Functions become context-dependent, and the more context-dependent they are, the more difficult it is to compose them. So if you have to deal with functions where each one of them introduces some of this complexity here and there, you better find a way to compose them.

The ever-lasting problem ?

As I have already mentioned, writing useful code entails dealing with the chaos of the outside world — a problem of handling effects and state. By effect, I mean an effect on the outside world, like writing to a database or printing to a console. Even a simple beaten-to-death HelloWorld program produces an effect.

Many situations are traditionally solved by abandoning the purity of functions, including but not limited to:

computations that access/modify a state outside of the context of a program.
computations that may fail or never terminate
computations whose result might be only known sometime in the future, aka asynchronous computations
console input and output

As you can imagine, the functions that behave like this are not so easy to compose. Let’s say a function, given some username in the form of an input string, reads from a file, finds the given user, and then passes their last name to the next function for later processing. How do you represent a function like that? String => String? But that would not be entirely correct, as you do get a string and you do return one, but you also do something else in between and it depends on the context — a file system and a file.

In functional programming, we usually encode this information in the function’s return type. We place the result of a computation in a box, so to speak, and our function turns from A =>; B to A => Box[B].

Whatever happens inside this box is implementation-specific. But the definition of a function now describes what it does and allows to handle the function's behavior and compose it, much better.

Let’s look at the situations above and try to come up with a box for every one of them.

Box all the things ?

Disclaimer: In this article, I will be using Scala to express my ideas, as it’s arguably one of the mainstream functional programming languages out there. But the same concepts can be easily translated into any functional programming language with a strong static type system and type constructors (like Haskell and OCaml).

A function that reads but does not modify the external state can be thought of as a function that also accepts this external state as an argument and, given some input parameter, produces a result. An example is reading from a database or config file.

We also supply a run function to materialize a reader and compute a result. The classic use case would be to:

provide a configuration file
stack the readers to set up an application
execute the run function in the end on the instance of the config file.

In the OOP world, a similar approach goes by Dependency Injection.

Functions that carry along the state of a computation accept two arguments — a regular one, that participates in the actual computation, and the state that will be propagated all the way through the composition.

We supply a run function that yields the result which a current writer holds. A typical use case of Writer is to propagate messages and errors during the program execution, and extracts the log alongside the final result. Another use case would be to record the sequences of steps in a multi-threaded environment. Since the result of a computation is tied to a particular log, the messages don’t get intertwined.

A function that never terminates (for example a long running process) is lifted to a bottom type of Nothing. If you expect that your computation might either return a value or run forever, you can model it as a disjoint union of a result type and Nothing. An example would be a stream consumer that processes data indefinitely until some event makes it yield.

A function that might fail can be modeled as a partial function — the one that’s not defined for some values. So we do just that — for some cases return nothing and, for others, an actual value. If you want the information regarding the failure and to interop with Java APIs that throw exceptions, you can carry the information of an error with you in the disjoint union.

Asynchronous computations do not execute on the current call-stack or a main flow of the program. They can be modeled as a function that accepts a handler, or callback. When this handler eventually is called — for example by some other thread or a web server — the result is produced.

ExecutionContext manages the details of the asynchronous execution. On JVM it is a thread-pool, but it does not have to be threads that deal with asynchrony. Function run takes it implicitly because callbacks need to be called asynchronously. Every function that calls run needs to have an implicit ExecutionContext in its signature as well. ExecutionContext also makes recursive asynchronous calls stack-safe because you introduce an asynchronous boundary every time you call a function. As I mentioned earlier, asynchronous computations do not execute on the same call-stack. Pretty neat.

Console input can be modeled as a two-phase process. In the first phase we describe the computation and a result that a console input might produce. In the second phase, we run or interpret this computation. The data type that holds the outcome of this computation is called IO. So our function takes a singleton and produces the result inside of an IO.

Notice, since I mentioned that we describe the computation first, we pass call-by-name parameters =>; A. They execute only when we need them to.
Console output can be modeled through IO as well, with our run function producing a singleton type of Unit.

So, now what ? ?

We, of course, implemented all the boxes with a type parameter to account for variability in types. We transform, or map, from some stuff to other stuff by supplying a function. A mapping between a category of some things to a category of other things is called a functor.

Functors are cool because they allow us to transform stuff inside, while preserving the structure of the original category, without messing things up. When using a functor, following the functor laws proves that things work as expected. It will get clear later in the article why these laws in particular.

Let’s define a map method that does transformation for each of the derived data types:

Notice that here we have defined a function pure that permits us to lift a pure computation into an asynchronous value.

Cool, now we can map things inside of the boxes. The problem is — it’s not useful because, although we can sequence computations within a functor, we can’t compose functor-producing functions:

F1: A -> Functor[B] ==> F2: A -> Functor[B] ==> F3: A -> Functor[B]

Each of F1, F2, and F3 may do something completely different. We need to account for that, we need to compose them. Fortunately, there is an excellent way of doing this.

Oh no, it’s that guy again! ?

Ok, I need to write a function that composes the functions for each of the functors in A => Functor[B]. The mathematical definition of composition is:

If A =>; B and B => C then A => C

So in our case:

If A => Functor[B] and B => Functor[C] then A => Functor[C]

Let’s start with a reader. Again, just follow the types:

We defined composition as andThen . We also defined pure , which lifts a value into a functor.

Composition in any category follows two simple laws.

It is associative:
There exists such a function, called identity, that, when composed with any function from left or right, produces that function again:

You can take a pen and a piece of paper and try to draw these two laws. Substituting entities with circles and connections between them as arrows, you can visually prove to yourself that this is the case. Remember functor laws earlier? It’s the same thing. These laws establish the composability in every category.

As programmers, we mostly deal with a category of sets — objects are sets or types, and arrows are functions. Not all mathematical structures need to have an identity, but category certainly does.

Identity, for example, can be used to show whether two objects, “sets,” are isomorphic, or equal. It can also provide guarantees in certain instances as we will see later in the article.

Category only has two properties — simple enough.

A functor that participates in A => Functor[B] and follows these composition laws, and hence composes for all the objects in the same category is not a burrito. It’s a monad.

Accordingly, the functions that we try to compose are actually A => Monad[B] . A wrapper around them, a category that is naturally associated with a Monad[B], is called a Kleisli category. A => Monad[B] or Kleisli arrows is just a way to compose these sort of functions, nothing more.

Kleisli has the same objects as the original category but arrows between a and b in Kleisli correspond to the arrows between a and Fb, where F is a functor.

Compose all the things! ?

We replace Java-like andThen with an arrow >==> . This is called a fish operator because it looks like a fish ?. We also define a function flatMap that, given a Monad[A] and A => Monad[B], produces A =&gt; Monad[B]. This will make implementing arrows easier:

Monoids

For a writer to be a monad, a second type should be composable too, and it’s composable only if it’s a monoid. Something is a monoid only if it can be associatively combined and has an empty element, which is an identity element.

For example, integers that sum up are a monoid with an identity element of 0. Strings that concat are a monoid with identity element of an empty string. Sets that union are a monoid with an identity element of an empty set. Sets that intersect form a semigroup and not a monoid. Intersecting a non-empty set with an empty one produces an empty set. You see why identity is useful?

Every other monad is easy to implement:

We will go into detail why IO looks like this and have a discussion about Free probably in another article.

When to use which? ?

Kleisli and Monad are two sides of the same coin. Many functional languages support them natively, but for Scala, at least in the current version 2.12.8, you can get them from the libraries like Cats and Scalaz. They have type classes for Kleisli and Monad: Kleisli[F[_], A, B] and Monad[A] respectively.

Monads are better at expressing the sequencing of the computations that happen within some context. In Scala, we do this contextual sequencing usually by utilizing for comprehensions.

Fun fact: If you compose Kleisli arrows for IO monad, you will get a description of your computer program. Your computer program is essentially one gigantic Kleisli arrow, with some input and output of Unit that acts as a description, and a runtime environment that executes this program works as an interpreter.

So every program, by default, has an IO context. If you have a function that produces an error, it creates a context of Option or Either. So every subsequent function should have a signature of A => Option[A] or A => Either[A, B]. Within an IO program it’s A => IO[Option[A]] and A => IO[Either[A, B]]. You decide when to collapse this context or nest it even further. Monad transformers can help with the sequencing of the nested contexts, but that is material for another article.

Asynchronous computations are only sequenced with the help of monads because monads explicitly solve the problems of synchronization and ordering in time. If you use Future combinators, for example, it is within a monadic context.

Kleisli is better when the arrow combinator is more suitable. For example, we can have a bunch of effectful or impure functions, and we can fuse them without wrapping the result into IO. You can also create a bunch of individual programs and combine them into one Kleisli in the end and run if you want, of course.

Reader[A] is better expressed with Kleisli than with monadic compositions because we can easily compose small local Kleisli arrows into a Kleisli arrow that represents our program’s global configuration. The large Kleisli arrow depends on some environment, let’s say a configuration file for production and development. Then we can run this at the very end, and configure the entire app.

This approach mimics the concept of dependency injection. Here we demonstrate KleisliIO, an effectful Kleisli from ZIO that performs a reader functionality to configure the application (error type omitted to save some space):

End ?

Whew, that was a blast. I hope you enjoyed this article and now understand what’s going a little bit better :).

Here are some useful links to learn more about Kleisli and composition:

A streaming library with a superpower: FS2 and functional programming

freeCodeCamp — Mon, 19 Nov 2018 21:20:26 +0000

By Daniel Sebban

Scala has a very special streaming library called FS2 (Functional Streams for Scala). This library embodies all the advantages of functional programming (FP). By understanding its design goals you will get exposure to the core ideas that make FP so appealing.

FS2 has one central type: **Stream[Effect,Output]**

You might get from this type that it’s a Stream and that it emits values of type Output.

The obvious question here is what is Effect? What is the link between Effect and Output? And what advantages does FS2 have over other streaming libraries?

Overview

I will start by reviewing what problems FS2 solves. Then I compare List and Stream with several code examples. After that, I will focus on how to use Stream with a DB or any other IO. That is where FS2 shines and where the Effecttype is used. Once you will understand what Effect is, the advantages of Functional Programming should be evident to you.

At the end of this post you will get the answers to the following questions:

What problems can I solve with FS2?
What can I do with Stream that List cannot ?
How can I feed data from an API/File/DB to Stream ?
What is this Effect type and how does it relate to functional programming?

Note: The code is in Scala and should be understandable even without prior knowledge of the syntax.

What problems can I solve with FS2?

Streaming I/O: Loading incrementally big data sets that would not fit in memory and operating on them without blowing your heap.
Control Flow (not covered): Moving data from one/several DBs/files/APIs to others in a nice declarative way.
Concurrency (not covered): Run different streams in parallel and make them communicate together. For example loading data from multiple files and processing them concurrently as opposed to sequentially. You can do some advanced stuff here. Streams can communicate together during the processing stage and not only at the end.

`List` vs `Stream`

List is the most well known and used data structure. To get a feel for how it differs from an FS2 Stream, we will go through a few use cases. We will see how Stream can solve problems that List cannot.

Your data is too big and does not fit in memory

Let’s say you have a very big file (40GB) fahrenheit.txt. The file has a temperature on each line and you want to convert it to celsius.txt.

Loading a big file using `List`

import scala.io.Source
val list = Source.fromFile("testdata/fahrenheit.txt").getLines.toList
java.lang.OutOfMemoryError: Java heap space
  java.util.Arrays.copyOfRange(Arrays.java:3664)
  java.lang.String.(String.java:207)
  java.io.BufferedReader.readLine(BufferedReader.java:356)
  java.io.BufferedReader.readLine(BufferedReader.java:389)

List fails miserably because of course, the file is too big to fit in memory. If you are curious, you can go check the full solution using Stream here — but do that later, read on :)

When List won’t do…Stream to the rescue!

Let’s say I succeeded in reading my file and I want to write it back. I would like to preserve the line structure. I need to insert a newline character \n after each temperature.

I can use the intersperse combinator to do that

import fs2._
Stream(1,2,3,4).intersperse("\n").toList

Another nice one is zipWithNext

scala> Stream(1,2,3,4).zipWithNext.toList
res1: List[(Int, Option[Int])] = List((1,Some(2)), (2,Some(3)), (3,Some(4)), (4,None))

It bundles consecutive things together, very useful if you want to remove consecutive duplicates.

These are only a few from a lot of very useful ones, here is the full list.

Obviously Stream can do a lot of things that List cannot, but the best feature is coming in the next section, it's all about how to use Stream in the real world with DBs/Files/APIs ...

How can I feed data from an API/File/DB to `Stream`?

Let’s just say for now that this our program

scala> Stream(1,2,3)
res2: fs2.Stream[fs2.Pure,Int] = Stream(..)

What does this Pure mean? Here is the scaladoc from the source code:

/**
    * Indicates that a stream evaluates no effects.
    *
    * A `Stream[Pure,O]` can be safely converted to a `Stream[F,O]` for all `F`.
*/
type Pure[A] <: Nothing

It means no effects, ok …, but What is an effect? and more specifically what is the effect of our program Stream(1,2,3)?

This program has literally no effect on the world. Its only effect will be to make your CPU work and consumes some power!! It does not affect the world around you.

By affecting the world I mean it consumes any meaningful resource like a file, a database, or it produces anything like a file, uploading some data somewhere, writing to your terminal, and so on.

How do I turn a `Pure` stream to something useful?

Let’s say I want to load user ids from a DB, I am given this function, it does a call to the DB and returns the userId as a Long.

import scala.concurrent.Future
def loadUserIdByName(userName: String): Future[Long] = ???

It returns a [Future](https://www.scala-lang.org/api/2.12.3/scala/concurrent/Future.html) which indicates that this call is asynchronous and the value will be available at some point in the future. It wraps the value returned by the DB.

I have this Pure stream.

scala> val names = Stream("bob", "alice", "joe")
names: fs2.Stream[fs2.Pure,String] = Stream(..)

How do I get a Stream of ids?

The naive approach would be to use the map function, it should run the function for each value in the Stream.

scala> userIdsFromDB.compile
res5: fs2.Stream.ToEffect[scala.concurrent.Future,Long] = fs2.Stream$ToEffect@fc0f18da

I still got back a Pure! I gave the Stream a function that affects the world and I still got a Pure, not cool ... It would have been neat if FS2 would have detected automatically that the loadUserIdByName function has an effect on the world and returned me something that is NOT Pure but it does not work like that. You have to use a special combinator instead of map: you have to use evalMap.

scala> userIdsFromDB.toList
:18: error: value toList is not a member of fs2.Stream[scala.concurrent.Future,Long]
       userIdsFromDB.toList
                     ^

No more Pure! we got Future instead, yay! What just happened?

It took:

loadUserIdByName: Future[Long]
Stream[Pure, String]

And switched the types of the stream to

Stream[Future, Long]

It separated the Future and isolated it! The left side that was the Effect type parameter is now the concrete Future type.

Neat trick, but how does it help me?

You just witnessed true separation of concerns. You can continue to operate on the stream with all the nice List like combinators and you don't have to worry about if the DB is down, slow or all the stuff that is related to the network (effect) concerns.

It all works until I want to use toList to get the values back

scala> userIdsFromDB.toList
:18: error: value toList is not a member of fs2.Stream[scala.concurrent.Future,Long]
       userIdsFromDB.toList
                     ^

What???!!! I could swear that I used toList before and it worked, how can it say that toList is not a member of fs2.Stream[Future,String] any more? It is as if this function was removed the moment I started using an effect-ful stream, that's impressive! But how do I get my values back?

scala> userIdsFromDB.compile
res5: fs2.Stream.ToEffect[scala.concurrent.Future,Long] = fs2.Stream$ToEffect@fc0f18da

First we use compile to tell the Stream to combine all the effects into one, effectively it folds all the calls to loadUserIdByName into one big Future. This is needed by the framework, and it will become apparent why this step is needed soon.

Now toList should work

scala> userIdsFromDB.compile.toList
:18: error: could not find implicit value for parameter F: cats.effect.Sync[scala.concurrent.Future]
       userIdsFromDB.compile.toList
                             ^

What?! the compiler is still complaining. That’s because Future is not a good Effect type — it breaks the philosophy of separation of concerns as explained in the next very important section.

IMPORTANT: The ONE thing to take away from this post

A key point here, is that the DB has not been called at this point. Nothing happened really, the full program does not produce anything.

def loadUserIdByName(userName: String): Future[Long] = ???
Stream("bob", "alice", "joe").evalMap(loadUserIdByName).compile

Separating program description from evaluation

Yes it might be surprising but the major theme in FP is separating the

Description of your program: a good example is the program we just wrote, it’s a pure description of the problem “I give you names and a DB, give me back IDs”

And the

Execution of your program: running the actual code and asking it to go to the DB

One more time our program has literally no effect on the world besides making your computer warm, exactly like our Pure stream.

Code that does not have an effect is called pure and that’s what all Functional Programming is about: writing programs with functions that are pure. Bravo, you now know what FP is all about.

Why would you want write code this way? Simple: to achieve separation of concerns between the IO parts and the rest of our code.

Now let’s fix our program and take care of this Future problem.

As we said Future is a bad Effect type, it goes against the separation of concerns principle. Indeed, Future is eager in Scala: the moment you create one it starts to executes on some thread, you don't have control of the execution and thus it breaks. FS2 is well aware of that and does not let you compile. To fix this we have to use a type called IO that wraps our bad Future.

That brings us to the last part, what is this IO type? and how do I finally get my list of usedIds back?

scala> import cats.effect.IO
import cats.effect.IO
scala> Stream("bob", "alice", "joe").evalMap(name => IO.fromFuture(IO(loadUserIdByName(name)))).compile.toList
res8: cats.effect.IO[List[Long]] = IO$2104439279

It now gives us back a List but still, we didn't get our IDs back, so one last thing must be missing.

What does `IO` really mean?

IO comes from cats-effect library. First let's finish our program and finally get out the ids back from the DB.

scala> userIds.compile.toList.unsafeRunSync
:18: error: not found: value userIds
       userIds.compile.toList.unsafeRunSync
       ^

The proof that it’s doing something is the fact that it’s failing.

loadUserIdByName(userName: String): Future[Long] = ???

When ??? is called you will get this exception, it means the function was executed (as opposed to before when we made the point that nothing was really happening). When we implement this function it will go to the DB and load the ids, and it will have an effect on the world (network/files system).

IO[Long] is a description of how to get a value of type Long and it most certainly involves doing some I/O i.e going to the network, loading a file,...

It’s the How and not the What. It describes how to get the value from the network. If you want to execute this description, you can use unsafeRunSync (or other functions prefixed unsafe). You can guess why they are called this way: indeed a call to a DB is inherently unsafe as it could fail if, for example, your Internet connection is out.

Recap

Let’s take a last look at **Stream[Effect,Output]**.

Output is the type that the stream emits (could be a stream of String, Long or whatever type you defined).

Effect is the way (the recipe) to produce the Output (i.e go to the DB and give me an id of type Long).

It’s important to understand that if these types are separated to make things easier, breaking down a problem in subproblems allows you to reason about the subproblems independently. You can then solve them and combine their solutions.

The link between these 2 types is the following :

In order for the Stream to emit an element of type

Output

It needs to evaluate a type

Effect

A special type that encodes an effective action as a value of type IO, this IO value allows the separation of 2 concerns:

Description:IO is a simple immutable value, it’s a recipe to get a type A by doing some kind of IO(network/filesystem/…)
Execution: in order forIO to do something, you need to execute/run it using io.unsafeRunSync

Putting it all together

Stream[IO,Long] says:

This is a Stream that emits values of type Long and in order to do so, it needs to run an effective function that producesIO[Long] for each value.

That’s a lot of details packed in this very short type. The more details you get about how things happen the fewer errors you make.

Takeaways

Stream is a super charged version of List
Stream(1,2,3) is of type Stream[Pure, Int] , the second type Int is the type of all values that this stream will emit
Pure means no effect on the world. It just makes your CPU work and consumes some power, but besides that it does not affect the world around you.
Use evalMap instead of map when you want to apply a function that has an effect like loadUserIdByName to a Stream.
Stream[IO, Long] separates the concerns of What and How by letting you work only with the values and not worrying about how to get them (loading from the db).
Separating program description from evaluation is a key aspect of FP.
All the programs you write with Stream will do nothing until you use unsafeRunSync. Before that your code is effectively pure.
IO[Long] is an effect type that tells you: you will get Long values from IO (could be a file, the network, the console ...). It's a description and not a wrapper!r
Future does not abide by this philosophy and thus is not compatible with FS2, you have to use IO type instead.

FS2 videos

Hands on screencast by Michael Pilquist: https://www.youtube.com/watch?v=B1wb4fIdtn4
Talk by Fabio Labella https://www.youtube.com/watch?v=x3GLwl1FxcA

How to build a simple actor-based blockchain

freeCodeCamp — Thu, 15 Nov 2018 20:32:40 +0000

By Luca Florio

Scalachain is a blockchain built using the Scala programming language and the actor model (Akka Framework).

In this story I will show the development process to build this simple prototype of a blockchain. This means that the project is not perfect, and there may be better implementations. For all these reasons any contribution — may it be a suggestion, or a PR on the GitHub repository — is very welcome! :-)

Let’s start with a little introduction to the blockchain. After that we can define the simplified model that we will implement.

Quick Introduction to the blockchain

There a lot of good articles that explain how a blockchain works, so I will do a high level introduction just to provide some context to this project.

The blockchain is a distributed ledger: it registers some transaction of values (like coins) between a sender and a receiver. What makes a blockchain different from a traditional database is the decentralized nature of the blockchain: it is distributed among several communicating nodes that guarantee the validity of the transactions registered.

The blockchain stores transactions in blocks, that are created —we say mined — by nodes investing computational power. Every block is created by solving a cryptographic puzzle that is hard to solve, but easy to verify. In this way, every block represents the work needed to solve such puzzle. This is the reason why the cryptographic puzzle is called the Proof of Work: the solution of the puzzle is the proof that a node spent a certain amount of work to solve it and mine the block.

Why do nodes invest computational power to mine a block? Because the creation of a new block is rewarded by a predefined amount of coins. In this way nodes are encouraged to mine new blocks, contributing in the growth and strength of the blockchain.

Simple blockchain

The solution of the Proof Of Work depends on the values stored in the last mined block. In this way every block is chained to the previous one. This means that, to change a mined block, a node should mine again all the blocks above the modified one. Since every block represents an amount of work, this operation would be unfeasible once several blocks are mined upon the modified one. This is the foundation of the distributed consensus, The agreement of all the nodes on the validity of the blocks (that is the transactions) stored in the blockchain.

It may happen that different nodes mine a block at the same time, creating different “branches” from the same blockchain — this is called a fork in the blockchain. This situation is solved when a branch becomes longer than the others: the longest chain always wins, so the winning branch becomes the new blockchain.

The blockchain model

Scalachain is based on a blockchain model that is a simplification of the Bitcoin one.

The main components of our blockchain model are the Transaction, the Chain, the Proof of Work (PoW) algorithm, and the Node. The transactions are stored inside the blocks of the chain, that are mined using the PoW. The node is the server that runs the blockchain.

Scalachain model

Transaction

Transactions register the movement of coins between two entities. Every transaction is composed by a sender, a recipient, and an amount of coin. Transactions will be registered inside the blocks of our blockchain.

Chain

The chain is a linked list of blocks containing a list of transactions. Every block of the chain has an index, the proof that validates it (more on this later), the list of transactions, the hash of the previous block, the list of previous blocks, and a timestamp. Every block is chained to the previous one by its hash, that is computed converting the block to a JSON string and then hashing it through a SHA-256 hashing function.

PoW

The PoW algorithm is required to mine the blocks composing the blockchain. The idea is to solve a cryptographic puzzle that is hard to solve, but easy to verify having the proof. The PoW algorithm that is implemented in Scalachain is similar to the Bitcoin one (based on Hashcash). It consists in finding a hash with N leading zeros, that is computed starting from the hash of the last block and a number, that is the proof of our algorithm.

We can formalize it as:

NzerosHash = SHA-256(previousNodeHash + proof)

The higher is N, the harder is to find the proof. In Scalachain N=4 (It will be configurable eventually).

Node

The Node is the server running our blockchain. It provides some REST API to interact with it and perform basic operations such as send a new transaction, get the list of pending transactions, mine a block, and get the current status of the blockchain.

Blockchain implementation in Scala

We are going to implement the defined model using the Scala Programming Language. From an high level view, the things we need to implement a blockchain are:

transactions
the chain of blocks containing lists of transactions
the PoW algorithm to mine new blocks

These components are the essential parts of a blockchain.

Transaction

The transaction is a very simple object: it has a sender, a recipient and a value. We can implement it as a simple case class.

case class Transaction(sender: String, recipient: String, value: Long)

Chain

The chain is the core of our blockchain: it is a linked list of blocks containing transactions.


sealed trait Chain {

  val index: Int
  val hash: String
  val values: List[Transaction]
  val proof: Long
  val timestamp: Long
}

We start by creating a sealed trait that represents the block of our chain. The Chain can have two types: it can be an EmptyChain or a ChainLink. The former is our block zero (the genesis block), and it is implemented as a singleton (it is a case object), while the latter is a regular mined block.

case class ChainLink(index: Int, proof: Long, values: List[Transaction], previousHash: String = "", tail: Chain = EmptyChain, timestamp: Long = System.currentTimeMillis()) extends Chain {
  val hash = Crypto.sha256Hash(this.toJson.toString)
}

case object EmptyChain extends Chain {
  val index = 0
  val hash = "1"
  val values = Nil
  val proof = 100L
  val timestamp = System.currentTimeMillis()
}

Let’s look more in detail at our chain. It provides an index, that is the current height of the blockchain. There is the list of Transaction, the proof that validated the block, and the timestamp of the block creation. The hash value is set to a default one in the EmptyChain, while in the ChainLink it is computed converting the object to its JSON representation and hashing it with an utility function (see the crypto package in the repository). The ChainLink provides also the hash of the previous block in the chain (our link between blocks). The tail field is a reference to the previously mined blocks. This may not be the most efficient solution, but it is useful to see how the blockchain grows in our simplified implementation.

We can improve our Chain with some utilities. We can add it a companion object that defines an apply method to create a new chain passing it a list of blocks. A companion object is like a “set of static methods” — doing an analogy with Java — that has complete access rights on the fields and methods of the class/trait.

object Chain {
  def apply[T](b: Chain*): Chain = {
    if (b.isEmpty) EmptyChain
    else {
      val link = b.head.asInstanceOf[ChainLink]
      ChainLink(link.index, link.proof, link.values, link.previousHash, apply(b.tail: _*))
    }
  }
}

If the list of blocks is empty, we simply initialize our blockchain with an EmptyChain. Otherwise we create a new ChainLink adding as a tail the result of the apply method on the remaining blocks of the list. In this way the list of blocks is added following the order of the list.

It would be nice to have the possibility to add a new block to our chain using a simple addition operator, like the one we have on List. We can define our own addition operator :: inside the Chain trait.

sealed trait Chain {

  val index: Int
  val hash: String
  val values: List[Transaction]
  val proof: Long
  val timestamp: Long

  def ::(link: Chain): Chain = link match {
    case l:ChainLink => ChainLink(l.index, l.proof, l.values, this.hash, this)
    case _ => throw new InvalidParameterException("Cannot add invalid link to chain")
  }
}

We pattern match on the block that is passed as an argument: if it is a valid ChainLink object we add it as the head of our chain, putting the chain as the tail of the new block, otherwise we throw an exception.

PoW

The PoW algorithm is fundamental for the mining of new blocks. We implement it as a simple algorithm:

Take the hash of the last block and a number representing the proof.
Concatenate the hash and the proof in a string.
hash the resulting string using the SHA-256 algorithm.
check the 4 leading characters of the hash: if they are four zeros return the proof.
otherwise repeat the algorithm increasing the proof by one.

This a simplification of the HashCash algorithm used in the Bitcoin blockchain.

Since it is a recursive function, we can implement it as a tail recursive one to improve the usage of resources.

object ProofOfWork {

  def proofOfWork(lastHash: String): Long = {
    @tailrec
    def powHelper(lastHash: String, proof: Long): Long = {
      if (validProof(lastHash, proof))
        proof
      else
        powHelper(lastHash, proof + 1)
    }

    val proof = 0
    powHelper(lastHash, proof)
  }

  def validProof(lastHash: String, proof: Long): Boolean = {
    val guess = (lastHash ++ proof.toString).toJson.toString
    val guessHash = Crypto.sha256Hash(guess)
    (guessHash take 4) == "0000"
  }
}

The validProof function is used to check if the proof we are testing is the correct one. The powHelper function is a helper function that executes our loop using tail recursion, increasing the proof at each step. The proofOfWork function wrap all the things up, and is exposed by the ProofOfWork object.

The actor model

The actor model is a programming model designed for concurrent processing, scalability, and fault tolerance. The model defines the atomic elements that compose the software systems — the actors — and the way this elements interact between them. In this project we will use the actor model implemented in Scala by the Akka Framework.

Actor Model

Actor

The actor is the atomic unit of the actor model. it is a computational unit that can send and receive messages. Every actor has an internal private state and a mailbox. When an actor receives and compute a message, it can react in 3 ways:

Send a message to another actor.
Change its internal state.
Create another actor.

Communication is asynchronous, and messages are popped out from the mailbox and processed in series. To enable the parallel computation of messages you need to create several actors. Many actors together crate an actor system. The behavior of the application arises from the interaction between actors providing different functionalities.

Actors are independent

Actors are independent one to another, and they do not share their internal state. This fact has a couple of important consequences:

Actors can process messages without side-effects one to another.
It’s not important where an actor is — be it your laptop, a sever, or in the cloud — once we know its address we can request its services sending it a message.

The first point makes concurrent computation very easy. We can be sure that the processing of a message will not interfere with the processing of another one. To achieve concurrent processing we can deploy several actors able to process the same kind of message.

The second point is all about scalability: we need more computational power? No problem: we can start a new machine and deploy new actors that will join the existing actor system. Their mailbox addresses will be discoverable by existing actors, that will start communicate with them.

Actors are supervised

As we said in the description of the actor, one of the possible reaction to a message is the creation of other actors. When this happens, the father becomes the supervisor of its children. If a children fails, the supervisor can decide the action to take, may it be create a new actor, ignore the failure, or throw it up to its own supervisor. In this way the Actor System becomes a hierarchy tree, each node supervising its children. This is the way the actor model provides fault tolerance.

Actor hierarchy

Broker, a simple actor

The first actor we are going to implement is the Broker Actor: it is the manager of the transactions of our blockchain. Its responsibilities are the addition of new transactions, and the retrieval of pending ones.

The Broker Actor reacts to three kind of messages, defined in the companion object of the Broker class:


object Broker {
  sealed trait BrokerMessage
  case class AddTransaction(transaction: Transaction) extends BrokerMessage
  case object GetTransactions extends BrokerMessage
  case object Clear extends BrokerMessage

  val props: Props = Props(new Broker)
}

We create a trait BrokerMessage to identify the messages of the Broker Actor. Every other message will extend this trait. AddTransaction adds a new transaction to the list of pending ones. GetTransaction retrieve the pending transactions, and Clear empties the list. The props value is used to initialize the actor when it will be created.

class Broker extends Actor with ActorLogging {
  import Broker._

  var pending: List[Transaction] = List()

  override def receive: Receive = {
    case AddTransaction(transaction) => {
      pending = transaction :: pending
      log.info(s"Added $transaction to pending Transaction")
    }
    case GetTransactions => {
      log.info(s"Getting pending transactions")
      sender() ! pending
    }
    case Clear => {
      pending = List()
      log.info("Clear pending transaction List")
    }
  }
}

The Broker class contains the business logic to react to the different messages. I won’t go into the details because it is trivial. The most interesting thing is how we respond to a request of the pending transactions. We send them to the sender() of the GetTransaction message using the tell (!) operator. This operator means “send the message and don’t wait for a response” — aka fire-and-forget.

Miner, an actor with different states

The Miner Actor is the one mining new blocks for our blockchain. Since we don’t want mine a new block while we are mining another one, the Miner Actor will have two states: ready, when it is ready to mine a new block, and busy, when it is mining a block.

Let’s start by defining the companion object with the messages of the Miner Actor. The pattern is the same, with a sealed trait — MinerMessage — used to define the kind of messages this actor reacts to.

object Miner {
  sealed trait MinerMessage
  case class Validate(hash: String, proof: Long) extends MinerMessage
  case class Mine(hash: String) extends MinerMessage
  case object Ready extends MinerMessage

  val props: Props = Props(new Miner)
}

The Validate message asks for a validation of a proof, and pass to the Miner the hash and the proof to check. Since this component is the one interacting with the PoW algorithm, it is its duty to execute this check. The Mine message asks for the mining starting from a specified hash. The last message, Ready, triggers a state transition.

Same actor, different states

The peculiarity of this actor is that it reacts to the messages according to its state: busy or ready. Let’s analyze the difference in the behavior:

busy: the Miner is busy mining a block. If a new mining request comes, it should deny it. If it is requested to be ready, the Miner should change its state to the ready one.
ready: the Miner is idle. If a mining request come, it should start mining a new block. If it is requested to be ready, it should say: “OK, I’m ready!”
both: the Miner should be always available to verify the correctness of a proof, both in a ready or busy state.

Time so see how we can implement this logic in our code. We start by defining the common behavior, the validation of a proof.

We define a function validate that reacts to the Validate message: if the proof is valid we respond to the sender with a success, otherwise with a failure. The ready and the busy states are defined as functions that “extends” the validate one, since that is a behavior we want in both states.

def validate: Receive = {
    case Validate(hash, proof) => {
      log.info(s"Validating proof $proof")
      if (ProofOfWork.validProof(hash, proof)){
        log.info("proof is valid!")
        sender() ! Success
      }
      else{
        log.info("proof is not valid")
        sender() ! Failure(new InvalidProofException(hash, proof))
      }
    }
  }

A couple of things to highlight here.

The state transition is triggered using the become function, provided by the Akka Framework. This takes as an argument a function that returns a Receive object, like the ones we defined for the validation, busy, and ready state.
When a mining request is received by the Miner, it responds with a Future containing the execution of the PoW algorithm. In this way we can work asynchronously, making the Miner free to do other tasks, such as the validation one.
The supervisor of this Actor controls the state transition. The reason of this choice is that the Miner is agnostic about the state of the system. It doesn’t know when the mining computation in the Future will be completed, and it can’t know if the block that it is mining has been already mined from another node. This would require to stop mining the current hash, and start mining the hash of the new block.

The last thing is to provide an initial state overriding the receive function.

override def receive: Receive = {
    case Ready => become(ready)
  }

We start waiting for a Ready message. When it comes, we start our Miner.

Blockchain, a persistent actor

The Blockchain Actor interacts with the business logic of the blockchain. It can add a new block to the blockchain, and it can retrieve information about the state of the blockchain. This actor has another superpower: it can persist and recover the state of the blockchain. This is possible implementing the PersistentActor trait provided by the Akka Framework.

object Blockchain {
  sealed trait BlockchainEvent
  case class AddBlockEvent(transactions: List[Transaction], proof: Long) extends BlockchainEvent

  sealed trait BlockchainCommand
  case class AddBlockCommand(transactions: List[Transaction], proof: Long) extends BlockchainCommand
  case object GetChain extends BlockchainCommand
  case object GetLastHash extends BlockchainCommand
  case object GetLastIndex extends BlockchainCommand

  case class State(chain: Chain)

  def props(chain: Chain, nodeId: String): Props = Props(new Blockchain(chain, nodeId))
}
view raw

We can see that the companion object of this actor has more elements than the other ones. The State class is where we store the state of our blockchain, that is its Chain. The idea is to update the state every time a new block is created.

For this purpose, there are two different traits: BlockchainEvent and BlockchainCommand. The former is to handle the events that will trigger the persistence logic, the latter is used to send direct commands to the actor. The AddBlockEvent message is the event that will update our state. The AddBlockCommand, GetChain, GetLastHash, and LastIndex commands are the one used to interact with the underlying blockchain.

The usual props function initializes the Blockchain Actor with the initial Chain and the nodeId of the Scalachain node.

class Blockchain(chain: Chain, nodeId: String) extends PersistentActor with ActorLogging{
  import Blockchain._

  var state = State(chain)

  override def persistenceId: String = s"chainer-$nodeId"

  //Code...
}

The Blockchain Actor extends the trait PersistentActor provided by the Akka framework. In this way we have out-of-the-box all the logic required to persist and recover our state.

We initialize the state using the Chain provided as an argument upon creation. The nodeId is part of the persistenceId that we override. The persistence logic will use it to identify the persisted state. Since we can have multiple Scalachain nodes running in the same machine, we need this value to correctly persist and recover the state of each node.

def updateState(event: BlockchainEvent) = event match {
    case AddBlockEvent(transactions, proof) =>
      {
        state = State(ChainLink(state.chain.index + 1, proof, transactions) :: state.chain)
        log.info(s"Added block ${state.chain.index} containing ${transactions.size} transactions")
      }
  }

The updateState function executes the update of the Actor state when the AddBlockEvent is received.

override def receiveRecover: Receive = {
    case SnapshotOffer(metadata, snapshot: State) => {
      log.info(s"Recovering from snapshot ${metadata.sequenceNr} at block ${snapshot.chain.index}")
      state = snapshot
    }
    case RecoveryCompleted => log.info("Recovery completed")
    case evt: AddBlockEvent => updateState(evt)
  }

The receiveRecover function reacts to the recovery messages sent by the persistence logic. During the creation of an actor a persisted state (snapshot) may be offered to it using the SnapshotOffer message. In this case the current state becomes the one provided by the snapshot.

RecoveryCompleted message informs us that the recovery process completed successfully. The AddBlockEvent triggers the updateState function passing the event itself.

override def receiveCommand: Receive = {
    case SaveSnapshotSuccess(metadata) => log.info(s"Snapshot ${metadata.sequenceNr} saved successfully")
    case SaveSnapshotFailure(metadata, reason) => log.error(s"Error saving snapshot ${metadata.sequenceNr}: ${reason.getMessage}")
    case AddBlockCommand(transactions : List[Transaction], proof: Long) => {
      persist(AddBlockEvent(transactions, proof)) {event =>
        updateState(event)
      }

      // This is a workaround to wait until the state is persisted
      deferAsync(Nil) { _ =>
        saveSnapshot(state)
        sender() ! state.chain.index
      }
    }
    case AddBlockCommand(_, _) => log.error("invalid add block command")
    case GetChain => sender() ! state.chain
    case GetLastHash => sender() ! state.chain.hash
    case GetLastIndex => sender() ! state.chain.index
  }

The receiveCommand function is used to react to the direct commands sent to the actor. Let’s skip the GetChain, GetLastHash, and GetLastIndex commands, since they are trivial. The AddBlockCommand is the interesting part: it creates and fires an AddBlock event, that is persisted in the event journal of the Actor. In this way events can be replayed in case of recovery.

The deferAsync function waits until the state is updated after the processing of the event. Once the event has been executed the actor can save the snapshot of the state, and inform the sender of the message with the updated last index of the Chain. The SaveSnapshotSucces and SaveSnapshotFailure messages helps us to keep track of possible failures.

Node, an actor to rule them all

The Node Actor is the backbone of our Scalachain node. It is the supervisor of all the other actors (Broker, Miner, and Blockchain), and the one communicating with the outside world through the REST API.

object Node {

  sealed trait NodeMessage

  case class AddTransaction(transaction: Transaction) extends NodeMessage

  case class CheckPowSolution(solution: Long) extends NodeMessage

  case class AddBlock(proof: Long) extends NodeMessage

  case object GetTransactions extends NodeMessage

  case object Mine extends NodeMessage

  case object StopMining extends NodeMessage

  case object GetStatus extends NodeMessage

  case object GetLastBlockIndex extends NodeMessage

  case object GetLastBlockHash extends NodeMessage

  def props(nodeId: String): Props = Props(new Node(nodeId))

  def createCoinbaseTransaction(nodeId: String) = Transaction("coinbase", nodeId, 100)
}

The Node Actor has to handle all the high level messages that coming from the REST API. This is the reason why we find in the companion object more or less the same messages we implemented in the children actors. The props function takes a nodeId as an argument to create our Node Actor. This will be the one used for the initialization of Blockchain Actor. The createCoinbaseTransaction simply creates a transaction assigning a predefined coin amount to the node itself. This will be the reward for the successful mining of a new block of the blockchain.

class Node(nodeId: String) extends Actor with ActorLogging {

  import Node._

  implicit lazy val timeout = Timeout(5.seconds)

  val broker = context.actorOf(Broker.props)
  val miner = context.actorOf(Miner.props)
  val blockchain = context.actorOf(Blockchain.props(EmptyChain, nodeId))

  miner ! Ready

  //Code...
}

Let’s look at the initialization of the Node Actor. The timeout value is used by the ask (?) operator (this will be explained shortly). All our actors are created in the actor context, using the props function we defined in each actor.

The Blockchain Actor is initialized with the EmptyChain and the nodeId of the Node. Once everything is created, we inform the Miner Actor to be ready to mine sending it a Ready message. Ok, we are now ready to receive some message and react to it.

override def receive: Receive = {
    case AddTransaction(transaction) => {
      //Code...
    }
    case CheckPowSolution(solution) => {
      //Code...
    }
    case AddBlock(proof) => {
      //Code...
    }
    case Mine => {
      //Code...
    }
    case GetTransactions => broker forward Broker.GetTransactions
    case GetStatus => blockchain forward GetChain
    case GetLastBlockIndex => blockchain forward GetLastIndex
    case GetLastBlockHash => blockchain forward GetLastHash
  }

This is an overview of the usual receive function that we should override. I will analyze the logic of the most complex cases later, now let’s look at the last four. Here we forward the messages to the Blockchain Actor, since it isn’t required any processing. Using the forward operator the sender() of the message will be the one that originated the message, not the Node Actor. In this way the Blockchain Actor will respond to the original sender of the message (the REST API layer).

override def receive: Receive = {
    case AddTransaction(transaction) => {
      val node = sender()
      broker ! Broker.AddTransaction(transaction)
      (blockchain ? GetLastIndex).mapTo[Int] onComplete {
        case Success(index) => node ! (index + 1)
        case Failure(e) => node ! akka.actor.Status.Failure(e)
      }
    }

  //Code...
}

The AddTransaction message triggers the logic to store a new transaction in the list of pending ones of our blockchain. The Node Actor responds with the index of the block that will contain the transaction.

First of all we store the “address” of the sender() of the message in a node value to use it later. We send to the Broker Actor a message to add a new transaction, then we ask to the Blockchain Actor the last index of the chain. The ask operator — the one expressed with ? — is used to send a message to an actor and wait for a response. The response (mapped to an Int value) can be a Success or a Failure.

In the first case we send back to the sender (node) the index+1, since it will be the index of the next mined block. In case of failure, we respond to the sender with a Failure containing the reason of the failure. Remember this pattern:

ask → wait for a response → handle success/failure

because we will see it again.

override def receive: Receive = {
    //Code...

    case CheckPowSolution(solution) => {
      val node = sender()
      (blockchain ? GetLastHash).mapTo[String] onComplete {
        case Success(hash: String) => miner.tell(Validate(hash, solution), node)
        case Failure(e) => node ! akka.actor.Status.Failure(e)
      }
    }

  //Code...
}
view raw

This time we have to check if a solution to the PoW algorithm is correct. We ask to the Blockchain Actor the hash of the last block, and we tell the Miner Actor to validate the solution against the hash. In the tell function we pass to the Miner the Validate message along with the address of the sender, so that the miner can respond directly to it. This is another approach, like the forward one we saw before.

override def receive: Receive = {
    //Code...

    case AddBlock(proof) => {
      val node = sender()
      (self ? CheckPowSolution(proof)) onComplete {
        case Success(_) => {
          (broker ? Broker.GetTransactions).mapTo[List[Transaction]] onComplete {
            case Success(transactions) => blockchain.tell(AddBlockCommand(transactions, proof), node)
            case Failure(e) => node ! akka.actor.Status.Failure(e)
          }
          broker ! Clear
        }
        case Failure(e) => node ! akka.actor.Status.Failure(e)
      }
    }

    //Code...
}

Other nodes can mine blocks, so we may receive a request to add a block that we didn’t mine. The proof is enough to add the new block, since we assume that all the nodes share the same list of pending transactions.

override def receive: Receive = {
    //Code...

    case Mine => {
      val node = sender()
      (blockchain ? GetLastHash).mapTo[String] onComplete {
        case Success(hash) => (miner ? Miner.Mine(hash)).mapTo[Future[Long]] onComplete {
          case Success(solution) => waitForSolution(solution)
          case Failure(e) => log.error(s"Error finding PoW solution: ${e.getMessage}")
        }
        case Failure(e) => node ! akka.actor.Status.Failure(e)
      }
    }

    //Code...
  }

  def waitForSolution(solution: Future[Long]) = Future {
    solution onComplete {
      case Success(proof) => {
        broker ! Broker.AddTransaction(createCoinbaseTransaction(nodeId))
        self ! AddBlock(proof)
        miner ! Ready
      }
      case Failure(e) => log.error(s"Error finding PoW solution: ${e.getMessage}")
    }
  }

This is a simplification, in the Bitcoin network there cannot be such assumption. First of all we should check if the solution is valid. We do this sending a message to the node itself: self ? CheckPowSolution(proof). If the proof is valid, we get the list of pending transaction from the Broker Actor, then we tell to the Blockchain Actor to add to the chain a new block containing the transactions and the validated proof. The last thing to do is to command the Broker Actor to clear the list of pending transactions.

The last message is the request to start mining a new block. We need the hash of the last block in the chain, so we request it to the Blockchain Actor. Once we have the hash, we can start mining a new block.

The PoW algorithm is a long-running operation, so the Miner Actor responds immediately with a Future containing the computation. The waitForSolution function waits for the computation to complete, while the Node Actor keeps doing its business.

When we have a solution, we reward ourselves adding the coinbase transaction to the list of pending transactions. Then we add the new block to the chain and tell the Miner Actor to be ready to mine another block.

REST API with Akka HTTP

This last section describes the server and REST API. This is the most “external” part of our application, the one connecting the outside world to the Scalachain node. We will make use of Akka HTTP library, which is part of the Akka Framework. Let’s start looking at the server, the entry point of our application.

object Server extends App with NodeRoutes {

  val address = if (args.length > 0) args(0) else "localhost"
  val port = if (args.length > 1) args(1).toInt else 8080

  implicit val system: ActorSystem = ActorSystem("scalachain")

  implicit val materializer: ActorMaterializer = ActorMaterializer()

  val node: ActorRef = system.actorOf(Node.props("scalaChainNode0"))

  lazy val routes: Route = statusRoutes ~ transactionRoutes ~ mineRoutes

  Http().bindAndHandle(routes, address, port)

  println(s"Server online at http://$address:$port/")

  Await.result(system.whenTerminated, Duration.Inf)

}

Since the Server is our entry point, it needs to extend the App trait. It extends also NodeRoutes, a trait that contains all the http routes to the various endpoint of the node.

The system value is where we store our ActorSystem. Every actor created in this system will be able to talk to the others inside it. Akka HTTP requires also the definition of another value, the ActorMaterializer. This relates to the Akka Streams module, but since Akka HTTP is built on top of it, we still need this object to be initialized in our server (if you want to go deep on the relation with streams, look here).

The Node Actor is created along with the HTTP routes of the node, that are chained using the ~ operator. Don’t worry about the routes now, we will be back to them in a moment.

The last thing to do is to start our server using the function Http().bindHandle, that will also bind the routes we pass to it as an argument. The Await.result function will wait the termination signal to stop the server.

The server will be useless without the routes to trigger the business logic of the application. We define the routes in the trait NodeRoutes, differentiating them according to the different logic they trigger:

statusRoutes contains the endpoints to ask the Scalachain node for its status.
transactionRoutes handles everything related to transactions.
mineRoutes has the endpoint to start the mining process

Notice that this differentiation is a logic one, just to keep things ordered and readable. The three routes will be chained in a single one after their initialization in the server.

//Imports...
import com.elleflorio.scalachain.utils.JsonSupport._
// Imports...

trait NodeRoutes extends SprayJsonSupport {

  implicit def system: ActorSystem

  def node: ActorRef

  implicit lazy val timeout = Timeout(5.seconds)

  //Code...
}

The NodeRoutes trait extends SprayJsonSupport to add JSON serialization/deserialization. SprayJson is a Scala library analogous to Jackson in Java, and it comes for free with Akka HTTP.

To convert our objects to a JSON string we import the class JsonSupport defined in the utils package, which contains custom reader/writer for every object. I won’t go into the details, you can find the class in the repository if you want to look at the implementation.

We have a couple of implicit values. The ActorSystem is used to define the system of actors, while the Timeout is used by the OnSuccess function that waits for a response from the actors. The ActorRef is defined by overriding in the server implementation.

//Code...

lazy val statusRoutes: Route = pathPrefix("status") {
    concat(
      pathEnd {
        concat(
          get {
            val statusFuture: Future[Chain] = (node ? GetStatus).mapTo[Chain]
            onSuccess(statusFuture) { status =>
              complete(StatusCodes.OK, status)
            }
          }
        )
      }
    )
  }

//Code...

The endpoint to get the status of the blockchain is defined in the statusRoutes. We define the pathPrefix as "status" so the node will respond to the path ` http://

//Code...

lazy val transactionRoutes: Route = pathPrefix("transactions") {
    concat(
      pathEnd {
        concat(
          get {
            val transactionsRetrieved: Future[List[Transaction]] =
              (node ? GetTransactions).mapTo[List[Transaction]]
            onSuccess(transactionsRetrieved) { transactions =>
              complete(transactions.toList)
            }
          },
          post {
            entity(as[Transaction]) { transaction =>
              val transactionCreated: Future[Int] =
                (node ? AddTransaction(transaction)).mapTo[Int]
              onSuccess(transactionCreated) { done =>
                complete((StatusCodes.Created, done.toString))
              }
            }
          }
        )
      }
    )
  }

//Code...

The transactionRoutes allows the interaction with the pending transactions of the node. We define the HTTP action get to retrieve the list of pending transactions. This time we also define the HTTP action post to add a new transaction to the list of pending ones. The entity(as[Transaction]) function is used to deserialize the JSON body into a Transaction object.

//Code...
lazy val mineRoutes: Route = pathPrefix("mine") {
    concat(
      pathEnd {
        concat(
          get {
            node ! Mine
            complete(StatusCodes.OK)
          }
        )
      }
    )
  }

//Code...

The last route is the MineRoutes. This is a very simple one, used only to ask the Scalachain node to start mine a new block. We define a get action since we do not need to send anything to start the mining process. It is not required to wait for a response, since it may take some time, so we immediately respond with an Ok status.

The API to interact with the Scalachain node are documented here.

Conclusion

With the last section, we concluded our tour inside Scalachain. This prototype of a blockchain is far from a real implementation, but we learned a lot of interesting things:

How a blockchain works, at least from an high level perspective.
How to use functional programming (Scala) to build a blockchain.
How the Actor Model works, and its application to our use case using the Akka Framework.
How to use the Akka HTTP library to create a sever to run our blockchain, along with the APIs to interact with it.

The code is not perfect, and some things can be implemented in a better way. For this reason, feel free to contribute to the project! ;-)

How to (Un)marshal JSON in Akka HTTP with Circe

freeCodeCamp — Fri, 26 Oct 2018 08:38:53 +0000

By Miguel Lopez

Even though the usual library to (un)marshal JSON in Akka HTTP applications is spray-json, I decided to give circe a try. I use it in the Akka HTTP beginners course I’m working on. cough it’s free ? c**_ough_**

In this post I’d like to show you why I tried it out.

To use circe with Akka HTTP — and other JSON libraries for that matter — we have to create the marshalers and unmarshalers manually. Thankfully, there is an additional library that already does that for us.

Project setup and overview

Clone the project’s repository, and checkout the branch 3.3-repository-implementation.

Under src/main/scala you'll find the following files:

$ tree srcsrc└── main    └── scala        ├── Main.scala        ├── Todo.scala        └── TodoRepository.scala2 directories, 3 files

The Main object is the application's entry point. So far it has a hello world route and it binds it to a given host and route.

Todo is our application’s model. And the TodoRepository is in charge of persisting and accessing todos. So far it only has an in-memory implementation to keep things simple and focused.

Listing all the todos

We’ll change the Main object’s route to list all the todos in a repository. We will also add some initial todos for testing:

import akka.actor.ActorSystemimport akka.http.scaladsl.Httpimport akka.stream.ActorMaterializerimport scala.concurrent.Awaitimport scala.util.{Failure, Success}object Main extends App {  val host = "0.0.0.0"  val port = 9000  implicit val system: ActorSystem = ActorSystem(name = "todoapi")  implicit val materializer: ActorMaterializer = ActorMaterializer()  import system.dispatcher  val todos = Seq(    Todo("1", "Clean the house", "", done = false),    Todo("2", "Learn Scala", "", done = true),  )  val todoRepository = new InMemoryTodoRepository(todos)  import akka.http.scaladsl.server.Directives._  def route = path("todos") {    get {      complete(todoRepository.all())    }  }  val binding = Http().bindAndHandle(route, host, port)  binding.onComplete {    case Success(_) => println("Success!")    case Failure(error) => println(s"Failed: ${error.getMessage}")  }  import scala.concurrent.duration._  Await.result(binding, 3.seconds)}

Now we’re listening to requests under /todos and we respond with all the todos we have in our todoRepository.

However, if we try to run this it won’t compile:

Error:(26, 34) type mismatch; found   : scala.concurrent.Future[Seq[Todo]] required: akka.http.scaladsl.marshalling.ToResponseMarshallable      complete(todoRepository.all())

The compilation error is telling us it doesn’t know how to marshal our todos into JSON.

We need to import circe and the support library:

import akka.http.scaladsl.server.Directives._import de.heikoseeberger.akkahttpcirce.FailFastCirceSupport._import io.circe.generic.auto._def route = path("todos") {  get {    complete(todoRepository.all())  }}

With those two extra lines we can now run our Main object and test our new route.

Make a GET request to http://localhost:9000/todos :

And we get our todos back! ?

Creating todos

Turns out that unmarshaling JSON into our models doesn’t take much effort either. But our TodoRepository doesn’t support saving todos at the moment. Let’s add that functionality first:

import scala.concurrent.{ExecutionContext, Future}trait TodoRepository {  def all(): Future[Seq[Todo]]  def done(): Future[Seq[Todo]]  def pending(): Future[Seq[Todo]]  def save(todo: Todo): Future[Todo]}class InMemoryTodoRepository(initialTodos: Seq[Todo] = Seq.empty)(implicit ec: ExecutionContext) extends TodoRepository {  private var todos: Vector[Todo] = initialTodos.toVector  override def all(): Future[Seq[Todo]] = Future.successful(todos)  override def done(): Future[Seq[Todo]] = Future.successful(todos.filter(_.done))  override def pending(): Future[Seq[Todo]] = Future.successful(todos.filterNot(_.done))  override def save(todo: Todo): Future[Todo] = Future.successful {    todos = todos :+ todo    todo  }}

We added a method save to the trait and the implementation. Because we are using a Vector our implementation of save will store duplicated todos. That’s fine for the purposes of this tutorial.

Let’s add a new route that will listen to POST requests. This route receive a Todo as the request’s body and saves it into our repository:

def route = path("todos") {  get {    complete(todoRepository.all())  } ~ post {    entity(as[Todo]) { todo =>      complete(todoRepository.save(todo))    }  }}

Using the entity directive we can build a route that automatically parses incoming JSON to our model. It also rejects requests with invalid JSON:

We sent the done field as a string, and it should have been a boolean, which our API responded to with bad request.

Let’s make a valid request to create a new todo:

This time we sent the property as done := false, which tells HTTPie to send the value as Boolean instead of String.

We get our todo back and a 200 status code, which means it went well. We can confirm it worked by querying the todos again:

We get three todos, the hardcoded ones and the new one we created.

Wrapping up

We added JSON marshaling and unmarshaling to our application by adding the dependencies (it was already done in the project) and by importing both libraries.

Circe figures out how to handle our models without much intervention from us.

In a future post, we will explore how to accomplish the same functionality with spray-json instead.

Stay tuned!

If you liked this tutorial and want to learn how to build an API for a todo application, check out our new free course! ???

Akka HTTP Quickstart
_Learn how to create web applications and APIs with Akka HTTP in this free course!_link.codemunity.io

Originally published at www.codemunity.io.

Scala - freeCodeCamp.org

How to Read and Write Deeply Partitioned Files Using Apache Spark

Here’s what we’ll cover:

Prerequisite

Setup

False Starts

My Solution

Conclusion

How to Use Thread.sleep Without Blocking on the JVM

What We'll Cover in This Article

How Does Sleeping Work at the OS Level?

Limitations of Threads

The Problem with Sleeping

Your mission if you accept it is to run 2 tasks concurrently with 1 thread

1 task -> 1 thread

2 tasks -> 1 thread

2 tasks -> 2 threads

The problem : Thread.sleep is blocking

First Solution: Upgrade your JVM with Project Loom

Key insight: You can express sleeping in terms of scheduling a task in the future

The scheduling problem

How to Use Functional Programming to Write a DSL with a Non-Blocking Sleep

The model

The interpreter

Non-Blocking Sleep in ZIO

ZIO.sleep – a better version of Thread.sleep

The ZIO interpreter

Takeaways

How to implement Change Data Capture using Kafka Streams

TL;DR

Use case & infrastructure

Server implementation

Models and DAO

Kafka Producer

Event Listener

Processing Topology

REST API

Server main class

Server configuration

Connectors configuration

How to run the project

Conclusion

How to understand Scala variances by building restaurants

What is type variance?

Building a restaurant empire

Different types of food

Recipe, a covariant type

Chef, a contravariant type

Restaurant, where things come together

Conclusion

Futures Made Easy with Scala

Sequential run

Simultaneous or Parallel run

Recovering from errors

The world outside Scala

Wrapping up

An introduction to Law Testing in Scala

Introduction

A concrete example

Property-based testing

The Codec Law

The Codec Test

A (type) detour

The power of abstraction

Another example

A (First-order) logic perspective on our code

Summary

Beyond unit tests: an intro to property and law testing in Scala

Taking your unit tests to the next level.

Property-based testing

Law testing

Summary

An introduction to Vert.x, the fastest Java framework today

The golden rule

Verticles, Event Bus, and other gotchas

How to make a simple application with Akka Cluster

Quick introduction to Akka Cluster

A Cluster for mathematical computations

Actor hierarchy

Actor Implementation

`1 task -> 1 thread`

`2 tasks -> 1 thread`

`2 tasks -> 2 threads`

The problem : `Thread.sleep is blocking`

`ZIO.sleep` – a better version of `Thread.sleep`

`List` vs `Stream`

Loading a big file using `List`

How can I feed data from an API/File/DB to `Stream`?

How do I turn a `Pure` stream to something useful?

What does `IO` really mean?