Neo4j - freeCodeCamp.org

Learn to Build Graph Databases with Neo4j (Full Course)

Beau Carnes — Thu, 01 Jun 2023 16:49:15 +0000

Neo4j is revolutionizing the way we handle complex relationships between data points. Its intuitive graph-based structure provides a flexible and efficient solution for various applications.

We just published a Neo4j course on the freeCodeCamp.org YouTube channel. Whether you're a developer, a data scientist, or an aspiring technology enthusiast, this course is designed to equip you with the knowledge and skills needed to harness the full potential of Neo4j.

The course is taught by freeCodeCamp team members Farhan Chowdhury and Gavin Lon. They will teach you the basics of Neo4j and how to integrate it into real-world applications.

The course begins with a comprehensive introduction to Neo4j and graph database management systems. You'll learn how incorporating Neo4j into your applications can bring numerous benefits, such as improved performance, simplified querying, and enhanced data modeling capabilities. By understanding the fundamentals, you'll be well-prepared to dive deeper into the practical aspects of using Neo4j.

One of the highlights of this course is the hands-on project that guides you through building a real-world application using Java and Spring Boot. You'll discover how to leverage Neo4j as the backend storage for your application, enabling you to effectively model and manage relationships between data entities. From creating the initial database and connecting to it, to implementing courses, lessons, users, and authentication, you'll gain invaluable experience in building a robust application powered by Neo4j.

But that's not all! The course takes a holistic approach to application development by also covering the frontend implementation. You'll learn how to create a dynamic user interface using React to interact with the data stored in Neo4j. By combining the power of Neo4j's graph database with a modern frontend framework like React, you'll have the tools to create cutting-edge applications that excel in performance and usability.

Neo4j provided a grant to make this course possible. Their support has enabled us to bring you this comprehensive and immersive learning experience, empowering you to leverage the full potential of graph databases.

To fully benefit from this course, it is recommended that you have some basic knowledge of databases and programming. Familiarity with Java, Spring Boot, React, and JavaScript will also be advantageous.

So if you are ready to start learning about this powerful graph database system, watch the full course on the freeCodeCamp.org YouTube channel (5-hour watch).

How to produce and consume data streams directly via Cypher with Streams Procedures

freeCodeCamp — Thu, 09 May 2019 17:13:07 +0000

By Andrea Santurbano

Leveraging Neo4j Streams — Part 3

This article is the third part of the Leveraging Neo4j Streams series (Part 1 is here, Part 2 is here). In it, I’ll show you how to bring Neo4j into your Apache Kafka flow by using the streams procedures available with Neo4j Streams.

In order to show how to integrate them, simplify the integration, and let you test the whole project by hand, I’ll use Apache Zeppelin, a notebook runner that simply allows you to natively interact with Neo4j.

What is a Neo4j Stored Procedure?

Starting from Neo4j 3.x, the concept of user-defined procedures and functions was introduced. These are custom implementations of certain functionalities and/or business rules that can’t be (easily) expressed in Cypher itself.

Neo4j provides a number of built-in procedures. The APOC library adds another 450 to cover all kinds of uses from data integration to graph refactorings.

What are the streams procedures?

The Neo4j Streams project comes out with two procedures:

streams.publish: allows custom message streaming from Neo4j to the configured environment by using the underlying configured Producer
streams.consume: allows consuming messages from a given topic.

Set-Up the Environment

Going to the following Github repo, you’ll find everything necessary in order to replicate what I’m presenting in this article. What you will need to start is Docker, and then you can simply spin-up the stack by entering into the directory and from the Terminal execute the following command:

$ docker-compose up

This will start-up the whole environment that comprises:

Neo4j + Neo4j Streams module + APOC procedures
Apache Kafka
Apache Spark (which is not necessary in this article, but it’s used in the previous two)
Apache Zeppelin

By going into Apache Zeppelin @ http://localhost:8080 you’ll find in directory Medium/Part 3 one notebook called “Streams Procedures” which is the subject of this article.

streams.publish

This procedure allows custom message streaming from Neo4j to the configured environment by using the underlying configured Producer.

It takes two variables as input and returns nothing (as it sends its payload asynchronously to the stream):

topic, type String: where the data will be published
payload, type Object: what you want to stream.

Example:

CALL streams.publish('my-topic', 'Hello World from Neo4j!')

The message retrieved from the Consumer is the following:

{"payload": "Hello world from Neo4j!"}

You can send any kind of data in the payload: nodes, relationships, paths, lists, maps, scalar values and nested versions thereof.

In case of nodes and/or relationships, if the topic is defined in the patterns provided by the Change Data Capture (CDC) configuration, their properties will be filtered according to the configuration.

Following is a simple video that shows the procedure in action:

The streams.publish procedure in action

streams.consume

This procedure allows for consuming messages from a given topic.

It takes two variables as input:

topic, type String: where you want to consume the data
config, _type Map: the configuration parameters

and returns a list of collected events.

The config params are:

timeout, type Long: it’s the value passed to Kafka [Consumer#poll](https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#poll-long-) method (milliseconds). Default 1000.
from, type String: it’s the Kafka configuration parameter auto.offset.reset

Use:

CALL streams.consume('my-topic', {<config>}) YIELD event RETURN event

Example: Imagine you have a producer that publishes events like this:

{"name": "Andrea", "surname": "Santurbano"}

We can create user nodes in this way:

CALL streams.consume('my-topic', {<config>}) YIELD eventCREATE (p:Person{firstName: event.data.name, lastName: event.data.surname})

Following is a simple video that shows the procedure in action:

The stream.consume procedure in action

So this is the end of the “Leveraging Neo4j Streams” series, I hope you enjoyed it!

If you have already tested the Neo4j-Streams module or tested it via this notebook, please fill out our feedback survey.

If you run into any issues or have thoughts about improving our work, please raise a GitHub issue.

How to detect a user’s preferred color scheme in JavaScript

freeCodeCamp — Mon, 18 Mar 2019 16:54:59 +0000

By Oskar Hane

In recent versions of macOS (Mojave) and Windows 10, users have been able to enable a system level dark mode. This works well and is easy to detect for native applications.

Websites have been the odd apps where it’s up to the website publisher to decide what color scheme the users should use. Some websites do offer theme support. For the users to switch, they have to find the configuration for it and manually update the settings for each individual website.

Would it be possible to have this detection done automatically and have websites present a theme that respects the user’s preference?

Light vs Dark theme in Neo4j Browser

CSS media query ‘`prefers-color-scheme'` draft

There is a CSS media queries draft level 5 where prefers-color-scheme is specified. It is meant to detect if the user has requested the system to use a light or dark color theme.

This sounds like something we can work with! We need to stay up to date with any changes to the draft, though, as it might change at any time since it’s just a… draft. The prefers-color-scheme query can have three different values: light, dark, and no-preference.

Web browser support as of March 2019

The current browser support is very limited, and it’s not available in any of the stable releases of any vendor. We can only enjoy this in Safari Technology Preview of version 12.1 and in Firefox 67.0a1. What’s great is that there are binaries that do support it, so we can work with it and try it out in web browsers. For current browser support, check out https://caniuse.com/#search=prefers-color-scheme.

Why CSS only detection isn’t enough

The common approach I’ve seen so far is to use a CSS only approach and override CSS rules for certain classes when a media query is matched.
Something like this:

/* global.css */

.themed {
  display: block;
  width: 10em;
  height: 10em;
  background: black;
  color: white;
}

@media (prefers-color-scheme: light) {
  .themed {
    background: white;
    color: black;
  }
}

As this works fine for many use cases, there are styling techniques that do not use CSS in a way like this. If styled-components is used for theming, for example, then a JS object is replaced when the theme is changed.

Having access to the preferred scheme is also useful for analytics and more predictable CSS overrides as well as more fine-grained control over which elements should be themed and not.

Initial JS approach

I’ve learned in the past that you can do media query detection by setting the CSS content of an element to a value if a media query is matched. This is definitely a hack, but it works!

Something like this:

/* global.css */

html {
  content: "";
}

@media (prefers-color-scheme: light) {
  html {
    content: "light";
  }
}

@media (prefers-color-scheme: dark) {
  html {
    content: "dark";
  }
}

So when a user loads the CSS and the media query matches one of the above color schemes, the content property value of the html element is set to either ‘light’ or ‘dark’.

The question then is, how do we read the content value of the html element?

We can use window.getComputedStyle, like this:

const value = window
  .getComputedStyle(document.documentElement)
  .getPropertyValue('content')
  .replace(/"/g, '')

// value is now "dark", "light" or empty string

And this works fine! This approach is fine for a one-time read, but it’s not reactive and automatically updates when the user changes their system color scheme. To be updated, a page reload is needed (or have the above read done at an interval).

Reactive JS approach

How can we know when the user changes the system color scheme? Are there any events we can listen to?

Yes there are!

There is window.matchMedia available in modern web browsers.

What’s great with matchMedia is that we can attach a listener to it that will be called anytime the match changes.

The listener will be called with an object containing the information if the media query started matching or if it stopped matching. With this info, we can skip the CSS altogether and just work with JS.

const DARK = '(prefers-color-scheme: dark)'
const LIGHT = '(prefers-color-scheme: light)'

function changeWebsiteTheme(scheme) {
  // 'dark' or 'light' string is in scheme here
  // so the website theme can be updated
}

function detectColorScheme() {
  if (!window.matchMedia) {
    return
  }

  function listener({ matches, media }) {
    if (!matches) {
      // Not matching anymore = not interesting
      return
    }

    if (media === DARK) {
      changeWebsiteTheme('dark')
    } else if (media === LIGHT) {
      changeWebsiteTheme('light')
    }
  }

  const mqDark = window.matchMedia(DARK)
  mqDark.addListener(listener)

  const mqLight = window.matchMedia(LIGHT)
  mqLight.addListener(listener)
}

This approach works really well in the supported web browsers and just opts out if window.matchMedia isn't supported.

React hook

Since we are using React in neo4j-browser, I wrote this as a custom React hook to make it easy to re-use in all of our apps and fit into the React system.

// useDetectColorScheme.js
import { useState, useEffect } from 'react'

// Define available themes
export const colorSchemes = {
  DARK: '(prefers-color-scheme: dark)',
  LIGHT: '(prefers-color-scheme: light)',
}

export default function useDetectColorScheme(defaultScheme = 'light') {
  // Hook state
  const [scheme, setScheme] = useState(defaultScheme)

  useEffect(() => {
    // No support for detection
    if (!window.matchMedia) {
      return
    }

    // The listener
    const listener = (e) => {
      // No match = not interesting
      if (!e || !e.matches) {
        return
      }

      // Look for the matching color scheme
      // and update the hook state
      const schemeNames = Object.keys(colorSchemes)
      for (let i = 0; i < schemeNames.length; i++) {
        const schemeName = schemeNames[i]

        if (e.media === colorSchemes[schemeName]) {
          setScheme(schemeName.toLowerCase())
          break
        }
      }
    }

    // Loop through and setup listeners for the
    // media queries we want to monitor
    let activeMatches = []
    Object.keys(colorSchemes).forEach((schemeName) => {
      const mq = window.matchMedia(colorSchemes[schemeName])

      mq.addListener(listener)
      activeMatches.push(mq)
      listener(mq)
    })

    // Remove listeners, no memory leaks
    return () => {
      activeMatches.forEach((mq) => mq.removeListener(listener))
      activeMatches = []
    }
    // Run on first load of hook only
  }, [])

  // Return the current scheme from state
  return scheme
}

It’s a bit more code than in the first short proof-of-concept. We have better error detection and we also remove the event listeners when the hook un-mounts.

In our use case, the users can choose to override the autodetected scheme with something else (we offer an outlined theme for example, often used when doing presentations).

And then use it like this in the application layer:

// App.jsx
import React from 'react'
import ThemeProvider from './ThemeProvider'
import useDetectColorScheme from './useDetectColorScheme'
export default function App({ configuredTheme, themeData, children }) {
  // Detect scheme and have 'light' as the default
  const autoScheme = useDetectColorScheme('light')

  // Check if user want to override the auto detected scheme
  const scheme = configuredTheme === 'auto' ? autoScheme : configuredTheme

  // Pass the theme data to a theme provider component
  return <ThemeProvider theme={themeData[scheme]}>{children}ThemeProvider>
}

The last part depends on how theming is made in your application. In the example above, the theme data object is passed into a context provider that makes this object available throughout the whole React application.

End result

Here’s a gif with the end result, and as you can see, it’s instant.

Final thoughts

This was a fun experiment made during a so-called “Lab Day” we have in the UX team at Neo4j. The early stages of the spec and (therefore) the lack of browser support does not justify this to make it into any product yet. But support might come sooner than later.

And besides, we do ship some Electron-based products and there is an [electron.systemPreferences.isDarkMode()](https://github.com/electron/electron/blob/master/docs/api/system-preferences.md#systempreferencesisdarkmode-macos) available there...

About the author

Oskar Hane is a team lead / senior engineer at Neo4j.
He works on multiple of Neo4j:s end-user applications and code libraries and have authored two tech books.

Follow Oskar on twitter: @oskarhane

How to ingest data into Neo4j from a Kafka stream

freeCodeCamp — Fri, 15 Feb 2019 16:47:10 +0000

By Andrea Santurbano

This article is the second part of the Leveraging Neo4j Streams series (Part 1 is here). I’ll show how to bring Neo4j into your Apache Kafka flow by using the Sink module of the Neo4j Streams project in combination with Apache Spark’s Structured Streaming Apis.

_Photo by [Unsplash](https://unsplash.com/photos/-qrcOR33ErA?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" rel="noopener" target="_blank" title="">Hendrik Cornelissen on Apache Zeppelin — a notebook runner that simply allows you to natively interact with Neo4j.

The result

Leveraging Neo4j Streams

The Neo4j Streams project is composed of three main pillars:

The Change Data Capture that allows you to stream database changes over Kafka topics
The Sink (the subject of the first article) that allows consuming data streams from Kafka topics
A set of procedures that allows you to Produce/Consume data to/from Kafka Topics

The Neo4j Streams Sink

This module allows Neo4j to consume data from a Kafka topic. It does it in a “smart” way: by allowing you to define your custom queries. What you need to do is write in your neo4j.conf something like this:

streams.sink.topic.cypher.=<CYPHER_QUERY>

So if you define a query just like this:

streams.sink.topic.my-topic=MERGE (n:Person{id: event.id}) \
    ON CREATE SET n += event.properties

And for events like this:

{id:"alice@example.com",properties:{name:"Alice",age:32}}

Under the hood the Sink module will execute a query like this:

UNWIND {batch} AS event
MERGE (n:Label {id: event.id})
    ON CREATE SET n += event.properties

The batch parameter is a set of Kafka events that are gathered from the SINK and processed in a single transaction in order to maximize the execution efficiency.

So continuing with the example above, a possible full representation could be:

WITH [{id:"alice@example.com",properties:{name:"Alice",age:32}},
    {id:"bob@example.com",properties:{name:"Bob",age:42}}] AS batch
UNWIND batch AS event
MERGE (n:Person {id: event.id})
    ON CREATE SET n += event.properties

This gives to the developer the power to define their own business rules because you can choose to update, add to, remove, or adapt your graph data based on the events you get.

A simple use case: Ingest data from Open Data APIs

Imagine your data pipeline needs to read data from an Open Data API, enrich it with some other internal source, and in the end persist it into Neo4j. In this case, the best solution for doing this is using Apache Spark. This easily allows managing different data sources with the same Dataset abstraction.

Set-Up the Environment

Going to the following Github repo, you’ll find the whole code necessary in order to replicate what I’m presenting in this article. What you will need to start is Docker, and then you can simply spin up the stack by entering the directory and executing the following command from the terminal:

$ docker-compose up

This will start up the whole environment that comprises:

Neo4j + Neo4j Streams module + APOC procedures
Apache Kafka
Apache Spark
Apache Zeppelin

The whole architecture based on Docker containers

By going into Apache Zeppelin @ http://localhost:8080 you’ll find in the directory Medium/Part 2 one notebook “From Open Data to Sink” which is the subject of this article.

The Open Data API

We’ll choose the Italian Ministry of Health dataset of Pharmacy stores.

Define the Sink Query

If you go into the [d](http://localhost:8080)ocker-compose.yml file you’ll find a new property that corresponds to the Sink query definition:

NEO4J_streams_sink_topic_cypher_pharma: "
MERGE (p:Pharmacy{fiscalId: event.FISCAL_ID}) ON CREATE SET p.name = event.NAME
MERGE (t:PharmacyType{type: event.TYPE_NAME})
MERGE (a:Address{name: event.ADDRESS + ', ' + event.CITY})
  ON CREATE SET a.latitude = event.LATITUDE,
                a.longitude = event.LONGITUDE,
                a.code = event.POSTAL_CODE,
                a.point = event.POINT
MERGE (c:City{name: event.CITY})
MERGE (p)-[:IS_TYPE]-(t)
MERGE (p)-[:HAS_ADDRESS]-(a)
MERGE (a)-[:IS_LOCATED_IN]->(c)"

The NEO4J_streams_sink_topic_cypher_pharma property defines that all the data that comes from a topic named pharma will be consumed with the corresponding query.

The graph model that results from the query above is:

Our data model

The Notebook — From Open Data to Sink

The first step is download the CSV from the Open Data Portal and load it into a Spark Dataframe:

val fileUrl = z.input("File Url").toString

val url = new java.net.URL(fileUrl)
val localFilePath = s"/zeppelin/spark-warehouse/${url.getPath.split("/").last}"

val src = scala.io.Source.fromURL(fileUrl)("ISO-8859-1")
val out = new java.io.FileWriter(localFilePath)
out.write(src.mkString)
out.close

val csvDF = (spark.read
    .format("csv")
    .option("delimiter", ";")
    .option("header", "true")
    .load(localFilePath))

Now let’s explore the structure of the csvDF:

root
|-- CODICEIDENTIFICATIVOFARMACIA: string (nullable = true)
|-- CODFARMACIAASSEGNATODAASL: string (nullable = true)
|-- INDIRIZZO: string (nullable = true)
|-- DESCRIZIONEFARMACIA: string (nullable = true)
|-- PARTITAIVA: string (nullable = true)
|-- CAP: string (nullable = true)
|-- CODICECOMUNEISTAT: string (nullable = true)
|-- DESCRIZIONECOMUNE: string (nullable = true)
|-- FRAZIONE: string (nullable = true)
|-- CODICEPROVINCIAISTAT: string (nullable = true)
|-- SIGLAPROVINCIA: string (nullable = true)
|-- DESCRIZIONEPROVINCIA: string (nullable = true)
|-- CODICEREGIONE: string (nullable = true)
|-- DESCRIZIONEREGIONE: string (nullable = true)
|-- DATAINIZIOVALIDITA: string (nullable = true)
|-- DATAFINEVALIDITA: string (nullable = true)
|-- DESCRIZIONETIPOLOGIA: string (nullable = true)
|-- CODICETIPOLOGIA: string (nullable = true)
|-- LATITUDINE: string (nullable = true)
|-- LONGITUDINE: string (nullable = true)
|-- LOCALIZE: string (nullable = true)

We want to focus on two fields:

CODICEIDENTIFICATIVOFARMACIA: it “should” be the unique identifier given by the Italian Ministry of Health to a Pharmacy Store
DATAFINEVALIDITA: it indicates if the Pharmacy Store is still active (if it has no value it is active, otherwise it is closed)

We now save the Dataframe into a Spark temp view called OPEN_DATA:

csvDF.createOrReplaceTempView("open_data")

Let’s now overwrite the OPEN_DATA temp view by filtering the dataset for valid records and renaming some fields:

%sql
CREATE OR REPLACE TEMP VIEW OPEN_DATA AS
SELECT CODICEIDENTIFICATIVOFARMACIA AS PHARMA_ID,
 INDIRIZZO AS ADDRESS,
 DESCRIZIONEFARMACIA AS NAME,
 PARTITAIVA AS FISCAL_ID,
 CAP AS POSTAL_CODE,
 DESCRIZIONECOMUNE AS CITY,
 DESCRIZIONEPROVINCIA AS PROVINCE,
 DATAFINEVALIDITA,
 DESCRIZIONETIPOLOGIA AS TYPE_NAME,
 CODICETIPOLOGIA AS TYPE,
 REPLACE(LATITUDINE, ‘,’, ‘.’) AS LATITUDE,
 REPLACE(LONGITUDINE, ‘,’, ‘.’) AS LONGITUDE,
 REPLACE(LATITUDINE, ‘,’, ‘.’) || ‘,’ || REPLACE(LONGITUDINE, ‘,’, ‘.’) AS POINT
FROM OPEN_DATA
WHERE DATAFINEVALIDITA <> ‘-’
AND CODICEIDENTIFICATIVOFARMACIA <> ‘-’

Let’s now create the OPEN_DATA_KAFKA_STAGE temp view that must contain two columns:

VALUE: JSON that represents the data that we want to send to the Kafka topic
KEY: a key that identifies the row

You may notice that this is exactly the minimum requirement for a ProducerRecord:

%sql
CREATE OR REPLACE TEMP VIEW OPEN_DATA_KAFKA_STAGE AS
SELECT TO_JSON(
    STRUCT(PHARMA_ID,
        ADDRESS,
        NAME,
        FISCAL_ID,
        POSTAL_CODE,
        CITY,
        PROVINCE,
        TYPE_NAME,
        TYPE,
        LATITUDE,
        LONGITUDE,
        POINT)
    ) AS VALUE,
    PHARMA_ID AS KEY
FROM OPEN_DATA

Let’s now send the data to the pharma topic via spark:

(spark.table("OPEN_DATA_KAFKA_STAGE").selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
    .write
    .format("kafka")
    .option("kafka.enable.auto.commit", "true")
    .option("kafka.bootstrap.servers", "broker:9093")
    .option("topic", "pharma")
    .save())

The data streamed to the pharma topic via the spark job will now be consumed from the Neo4j Streams Sink module thanks to the Cypher template that we defined at the beginning of the article.

Now in the final paragraph, we can explore the ingested data. In the following video we are exploring all the Pharmacy stores located in Turin:

Explore the data just ingested

Wrapping up

In this second article (please check the first one if you haven’t already) we have seen how to use the SINK module in order to transform Apache Kafka events into arbitrary Graph Structures. You can do it in a very simple way by using the Apache Spark APIs.

In Part 3 we’ll discover how to use the Streams procedure in order to produce/consume data directly via Cypher queries, so please stay tuned!

If you have already tested the Neo4j-Streams module or tested it via this notebook please fill out our feedback survey.

If you run into any issues or have thoughts about improving our work, please raise a GitHub issue.

How to leverage Neo4j Streams and build a just-in-time data warehouse

freeCodeCamp — Tue, 29 Jan 2019 16:37:47 +0000

By Andrea Santurbano

In this article, we’ll show how to create a Just-In-Time Data Warehouse by using Neo4j and the Neo4j Streams module with Apache Spark’s Structured Streaming Apis and Apache Kafka.

In order to show how to integrate them, simplify the integration, and let you test the whole project by hand, I’ll use Apache Zeppelin a notebook runner that simply allows to natively interact with Neo4j.

The final result: how a kafka event streamed by Neo4j gets collected by Apache Spark

Leveraging Neo4j Streams

The Neo4j Streams project is composed of three main pillars:

The Change Data Capture (the subject of this first article) that allows us to stream database changes over Kafka topics
The Sink that allows consuming data streams from the Kafka topic
A set of procedures that allows us to Produce/Consume data to/from Kafka Topics

What is a Change Data Capture?

It’s a system that automatically captures changes from a source system (a Database, for instance) and automatically provides these changes to downstream systems for a variety of use cases.

CDC typically forms part of an ETL pipeline. This is an important component for ensuring Data Warehouses (DWH) are kept up to date with any record changes.

Also traditionally CDC applications used to work off of transaction logs, thereby allowing us to replicate databases without having much of a performance impact on its operation.

How does the Neo4j Streams CDC module deal with database changes?

Every transaction inside Neo4j gets captured and transformed in order to stream an atomic element of the transaction.

Let’s suppose we have a simple creation of two nodes and one relationship between them:

CREATE (andrea:Person{name:"Andrea"})-[knows:KNOWS{since:2014}]->(michael:Person{name:"Michael"})

The CDC module will transform this transaction into 3 events (2 node creation, 1 relationship creation).

The Event structure was inspired by the Debezium format and has the following general structure:

{  "meta": { /* transaction meta-data */ },  "payload": { /* the data related to the transaction */    "before": { /* the data before the transaction */},    "after": { /* the data after the transaction */}  }}

Node source (andrea):

{  "meta": {    "timestamp": 1532597182604,    "username": "neo4j",    "tx_id": 1,    "tx_event_id": 0,    "tx_events_count": 3,    "operation": "created",    "source": {      "hostname": "neo4j.mycompany.com"    }  },  "payload": {    "id": "1004",    "type": "node",    "after": {      "labels": ["Person"],      "properties": {        "name": "Andrea"      }    }  }}

Node target (michael):

{  "meta": {    "timestamp": 1532597182604,    "username": "neo4j",    "tx_id": 1,    "tx_event_id": 1,    "tx_events_count": 3,    "operation": "created",    "source": {      "hostname": "neo4j.mycompany.com"    }  },  "payload": {    "id": "1006",    "type": "node",    "after": {      "labels": ["Person"],      "properties": {        "name": "Michael"      }    }  }}

Relationship knows:

{  "meta": {    "timestamp": 1532597182604,    "username": "neo4j",    "tx_id": 1,    "tx_event_id": 2,    "tx_events_count": 3,    "operation": "created",    "source": {      "hostname": "neo4j.mycompany.com"    }  },  "payload": {    "id": "1007",    "type": "relationship",    "label": "KNOWS",    "start": {      "labels": ["Person"],      "id": "1005"    },    "end": {      "labels": ["Person"],      "id": "106"    },    "after": {      "properties": {        "since": 2014      }    }  }}

By default, all the data will be streamed on the neo4j topic. The CDC module allows controlling which nodes are sent to Kafka, and which of their properties you want to send to the topic:

streams.source.topic.nodes.=<PATTERN>

With the following example:

streams.source.topic.nodes.products=Product{name, code}

The CDC module will send to the products topic all the nodes that have the label Product. It then sends, to that topic, only the changes about name and code properties. Please go the official documentation for a full description on how label filtering works.

For a more in-depth description of the Neo4j Streams project and how/why we at LARUS and Neo4j built it, check out this article that provides an in-depth description.

Beyond the traditional Data Warehouse

A traditional DWH requires data teams to constantly build multiple costly and time-consuming Extract Transform Load (ETL) pipelines to ultimately derive business insights.

One of the biggest pain points is that, due to its rigid architecture that’s difficult to change, Enterprise Data Warehouses are inherently rigid. That’s because:

they are based on the Schema-On-Write architecture: first, you define your schema, then you write your data, then you read your data and it comes back in the schema you defined up-front
they are based on (expensive) batched/scheduled jobs

This results in having to build costly and time-consuming ETL pipelines to access and manipulate the data. And as new data types and sources are introduced, the need to augment your ETL pipelines exacerbates the problem.

Thanks to the combination of the stream data processing with the Neo4j Streams CDC module and the Schema-On-Read approach provided by Apache Spark, we can overcome this rigidity and build a new kind of (flexible) DWH.

A paradigm shift: Just-In-Time Data Warehouse

A JIT-DWH solution is designed to easily handle a wider variety of data from different sources and starts from a different approach about how to deal with and manage data: Schema-On-Read.

Schema-On-Read

Schema-On-Read follows a different sequence: it just loads the data as-is and applies your own lens to the data when you read it back out. With this kind of approach, you can present data in a schema that is adapted best to the queries being issued. You’re not stuck with a one-size-fits-all schema. With schema-on-read, you can present the data back in a schema that is most relevant to the task at hand.

Set-Up the Environment

Going to the following Github repo you’ll find everything you need in order to replicate what I’m presenting in this article. What you will need to start is Docker. Then you can simply spin-up the stack by entering into the directory and from the Terminal, executing the following command:

$ docker-compose up

This will start-up the whole environment that comprises:

Neo4j + Neo4j Streams module + APOC procedures
Apache Kafka
Apache Spark
Apache Zeppelin

The whole architecture based on Docker containers

By going into Apache Zeppelin @ http://localhost:8080 you’ll find in the directory Medium/Part 1 two notebooks:

Create a Just-In-Time Data Warehouse: in this notebook, we will build the JIT-DWH
Query The JIT-DWH: in this notebook, we will perform some queries over the JIT-DWH

The Use-Case:

We’ll create a fake social network like dataset. This will activate the CDC module of Neo4j Stream, and via Apache Spark we’ll intercept this event and persist them on the File System as JSON.

Then we’ll demonstrate how new fields added in our nodes will be automatically added to our JIT-DWL without the modification of the ETL pipeline, thanks to the Schema-On-Read approach.

We’ll execute the following steps:

Create the fake data set
Build our data pipeline that intercepts the Kafka events published by the Neo4j Streams CDC module
Make the first query over our JIT-DWH on Spark
Add a new field in our graph model
Show how the new field is automatically exposed in real time thanks to the Neo4j Streams CDC module (without the need for changes over our ETL pipeline thanks to the Schema-On-Read approach).

Notebook 1: Create a Just-In-Time Data Warehouse

We’ll create a fake social network by using the APOC apoc.periodic.repeat procedure that executes this query every 15 seconds:

WITH ["M", "F", ""] AS genderUNWIND range(1, 10) AS idCREATE (p:Person {id: apoc.create.uuid(), name: "Name-" +  apoc.text.random(10), age: round(rand() * 100), index: id, gender: gender[toInteger(size(gender) * rand())]})WITH collect(p) AS peopleUNWIND people AS p1UNWIND range(1, 3) AS friendWITH p1, people[(p1.index + friend) % size(people)] AS p2CREATE (p1)-[:KNOWS{years: round(rand() * 10), engaged: (rand() > 0.5)}]->(p2)

If you need more details about the APOC project, please follow this link.

So the resulting graph model is quite straightforward:

The Graph Model

Let’s create an index over the Person node:

%neo4jCREATE INDEX ON :Person(id)

Now let’s set the Background Job in Neo4j:

%neo4jCALL apoc.periodic.repeat('create-fake-social-data', 'WITH ["M", "F", "X"] AS gender UNWIND range(1, 10) AS id CREATE (p:Person {id: apoc.create.uuid(), name: "Name-" +  apoc.text.random(10), age: round(rand() * 100), index: id, gender: gender[toInteger(size(gender) * rand())]}) WITH collect(p) AS people UNWIND people AS p1 UNWIND range(1, 3) AS friend WITH p1, people[(p1.index + friend) % size(people)] AS p2 CREATE (p1)-[:KNOWS{years: round(rand() * 10), engaged: (rand() > 0.5)}]->(p2)', 15) YIELD nameRETURN name AS created

This background query brings the Neo4j-Streams CDC module to stream related events over the “neo4j” Kafka topic (the default topic of the CDC).

Now let’s create a Structured Streaming Dataset that consumes the data from the “neo4j” topic:

val kafkaStreamingDF = (spark    .readStream    .format("kafka")    .option("kafka.bootstrap.servers", "broker:9093")    .option("startingoffsets", "earliest")    .option("subscribe", "neo4j")    .load())

The kafkaStreamingDF Dataframe is basically a ProducerRecord representation. And in fact its schema is:

root|-- key: binary (nullable = true)|-- value: binary (nullable = true)|-- topic: string (nullable = true)|-- partition: integer (nullable = true)|-- offset: long (nullable = true)|-- timestamp: timestamp (nullable = true)|-- timestampType: integer (nullable = true)

Now let’s create the Structure of the data streamed by the CDC using the Spark APIs in order to read the streamed data:

val cdcMetaSchema = (new StructType()    .add("timestamp", LongType)    .add("username", StringType)    .add("operation", StringType)    .add("source", MapType(StringType, StringType, true)))    val cdcPayloadSchemaBeforeAfter = (new StructType()    .add("labels", ArrayType(StringType, false))    .add("properties", MapType(StringType, StringType, true)))    val cdcPayloadSchema = (new StructType()    .add("id", StringType)    .add("type", StringType)    .add("label", StringType)    .add("start", MapType(StringType, StringType, true))    .add("end", MapType(StringType, StringType, true))    .add("before", cdcPayloadSchemaBeforeAfter)    .add("after", cdcPayloadSchemaBeforeAfter))    val cdcSchema = (new StructType()    .add("meta", cdcMetaSchema)    .add("payload", cdcPayloadSchema))

The cdcSchema is suitable for both node and relationships events.

What we need now is to extract only the CDC event from the Dataframe, so let’s perform a simple transformation query over Spark:

val cdcDataFrame = (kafkaStreamingDF    .selectExpr("CAST(value AS STRING) AS VALUE")    .select(from_json('VALUE, cdcSchema) as 'JSON))

The cdcDataFrame contains just one column JSON which is the data streamed from the Neo4j-Streams CDC module.

Let’s perform a simple ETL query in order to extract fields of interest:

val dataWarehouseDataFrame = (cdcDataFrame    .where("json.payload.type = 'node' and (array_contains(nvl(json.payload.after.labels, json.payload.before.labels), 'Person'))")    .selectExpr("json.payload.id AS neo_id", "CAST(json.meta.timestamp / 1000 AS Timestamp) AS timestamp",        "json.meta.source.hostname AS host",        "json.meta.operation AS operation",        "nvl(json.payload.after.labels, json.payload.before.labels) AS labels",        "explode(json.payload.after.properties)"))

This query is quite important, because it represents how the data will be persisted over the filesystem. Every node will be exploded in a number of JSON snippets, one for each node property, just like this:

{"neo_id":"35340","timestamp":"2018-12-19T23:07:10.465Z","host":"neo4j","operation":"created","labels":["Person"],"key":"name","value":"Name-5wc62uKO5l"}

{"neo_id":"35340","timestamp":"2018-12-19T23:07:10.465Z","host":"neo4j","operation":"created","labels":["Person"],"key":"index","value":"8"}

{"neo_id":"35340","timestamp":"2018-12-19T23:07:10.465Z","host":"neo4j","operation":"created","labels":["Person"],"key":"id","value":"944e58bf-0cf7-49cf-af4a-c803d44f222a"}

{"neo_id":"35340","timestamp":"2018-12-19T23:07:10.465Z","host":"neo4j","operation":"created","labels":["Person"],"key":"gender","value":"F"}

This kind of structure can be easily turned into tabular representation (we’ll see in the next few steps how to do this).

Now let's write a Spark continuous streaming query that saves the data to the file system as JSON:

val writeOnDisk = (dataWarehouseDataFrame    .writeStream    .format("json")    .option("checkpointLocation", "/zeppelin/spark-warehouse/jit-dwh/checkpoint")    .option("path", "/zeppelin/spark-warehouse/jit-dwh")    .queryName("nodes")    .start())

We have now created a simple JIT-DWH. In the second notebook we’ll learn how to query it and how simple it is to deal with dynamical changes in the data structures thanks schema-on-read.

Notebook 2: Query The JIT-DWH

The first paragraph let us query and display our JIT-DWH

val flattenedDF = (spark.read.format("json").load("/zeppelin/spark-warehouse/jit-dwh/**")    .where("neo_id is not null")    .groupBy("neo_id", "timestamp", "host", "labels", "operation")    .pivot("key")    .agg(first($"value")))z.show(flattenedDF)

Remember how we saved the data in JSON some row above? The flattenedDF simply pivoted the JSONs over the key field thus grouping the data over 5 columns that represent the “unique key” (_“neoid”, “timestamp”, “host”, “labels”, “operation”). This allows us to have this tabular representation of the source data as follows:

The result of the query

Now imagine that our Person dataset gets a new field: birth. Let's add this new field to one node; in this case, you must choose an id from your dataset and update it with the following paragraph:

Just fill the form with your data and execute the paragraph

Now the final step: reuse the same query and filter the DWH by the id that we have previously changed in order to check how our dataset changed according to the changes made over Neo4j.

The birth field is present without changes to our queries

Conclusions

In this first part, we learned how to leverage the events produced by Neo4j Stream CDC module in order to build a simple (Real-Time) JIT-DWL that uses the Schema-On-Read approach.

In Part 2 we’ll discover how to use the Sink module in order to ingest data into Neo4j directly from Kafka.

If you have already tested the Neo4j-Streams module or tested it via these notebooks please fill out our feedback survey.

If you run into any issues or have thoughts about improving our work, please raise a GitHub issue.

How to embrace event-driven graph analytics using Neo4j and Apache Kafka

freeCodeCamp — Thu, 24 Jan 2019 08:12:47 +0000

By Ljubica Lazarevic

Introduction

With the new Neo4j Kafka streams now available, my fellow Neo4j colleague Tom Geudens and I were keen to try it out. We have many use-cases in mind that leverage the power of graph databases and event-driven architectures. The first one we explore combines the power of Graph Algorithms with a transactional database.

The new Neo4j Kafka streams library is a Neo4j plugin that you can add to each of your Neo4j instances. It enables three types of Apache Kafka mechanisms:

Producer: based on the topics set up in the Neo4j configuration file. Outputs to said topics will happen when specified node or relationship types change
Consumer: based on the topics set up in the Neo4j configuration file. When events for said topics are picked up, the specified Cypher query for each topic will be executed
Procedure: a direct call in Cypher to publish a given payload to a specified topic

You can get a more detailed overview of how each of these might look like here.

Overview of the situation

Graph algorithms provide powerful analytical abilities. They help us understand the context of our data better by analysing relationships. For example, graph algorithms are used to:

Understand network dependencies
Detect communities
Identify influencers
Calculate recommendations
And so forth.

Neo4j offers a set of graph algorithms out of the box via a plugin that can run directly on data within Neo4j. This library of algorithms has been very popularly received. Many times I’ve received feedback that the plugins are as fast or faster than what clients have used before. With such wonderful feedback, why wouldn’t we want to apply these optimised and performant algorithms on a Neo4j database?

The Neo4j graph algorithm categories

Getting the full advantage of any analytical process needs resources. To get a nice, performant experience, we want to provide as much CPU and memory as we can afford.

Now, we could run this kind of work on our transactional cluster. But in this typical architecture, we’re going to run into some challenges. For example, if one machine is big, the other machines in the cluster should be matching. This could mean that the set up architecture is expensive.

The other challenge we face is that our cluster is supposed to be managing transactions — day-to-day queries such as processing requests. We don’t want to weigh it down with crunching through various iterations and permutations of a model. Ideally, we want to offload this along with associated analytical work.

If we know that the heavy querying that is going to take place is read-only, then it’s an easy solution. We can spin up read replicas to manage the load. This keeps the cluster focussed on what it’s supposed to be doing, supporting an operational, transactional system.

But how do we handle write backs to the operational graph as part of the analytical processing? We want those results, such as recommendations, as soon as they are available.

Read replicas are as the name suggests — they are for read-only applications. They will not be involved in either elections of leaders in the cluster, nor in writing. Using Neo4j-streams, we can stream the results back from the read replica back to the cluster for consumption.

The big advantages of approaching it this way include:

We have our high availability/disaster recovery afforded to us from the cluster.
The data is going to be identical on both the read replica and the cluster. We don’t have to worry about updating the read replica because the cluster is going to take care of that for us.
The id’s for nodes and relationships will be identical on both the servers of the cluster, and the read replica. This makes updating really easy.
We can provision resources as necessary to the read replica, which is likely to be very different from the cluster.

Our architecture will look like the figure below. A is our read replica, and B is our causal cluster. A will receive transactional information from B. Any results calculated by A will be streamed back to B via Kafka messages.

So with our updated pattern, let’s continue with our simple example.

The Example Data Set

We’re going to use the Movie Database data set available from the :play movie-guide guide in Neo4j Browser. For this example we are going to use four Neo4j instances:

The analytics instance — this is going to be our read replica, and on this instance we’re going to run PageRank on all Person nodes in the data set. We will call the streams.publish() procedure to post the output to our Kafka topic.
The operational instances — this is going be our three-server causal cluster which is going to be listening for any changes to the person node. We will update as changes come in.

For Kafka, we’ll follow the instructions from the quick start guide up until step 2. Before we get Kafka up and running, we will need to set up the consumer elements in the Neo4j configuration files. We also will set up the cluster itself. Please note that at the moment neo4j-streams only works with Neo4j version 3.4.x.

To set up the three server clusters and a read replica, we will follow the instructions provided in the Neo4j operations manual. Follow the instructions for the cores, and also for one read replica.

Additionally, we’re going to need to add the following to neo4j.config for the causal cluster servers:

#************# Kafka Config — Consumer#************kafka.zookeeper.connect=localhost:2181kafka.bootstrap.servers=localhost:9092kafka.group.id=neo4j-core1streams.sink.enabled=truestreams.sink.topic.cypher.neorr=WITH event.payload as payload MATCH (p:Person) WHERE ID(p)=payload[0] SET p.pagerank = payload[1]

Note that we want to change kafka.group.id to neo4j-core2 and neo4j-core3 respectively.

For the read replica, we’ll need to add the following to neo4j.config:

#************# Kafka Config - Procedure#************kafka.zookeeper.connect=localhost:2181kafka.bootstrap.servers=localhost:9092kafka.group.id=neo4j-read1

You will need ti download and save the neo4j-streams jar into the plugins folder. Also you need to add the graph algorithm library, via Neo4j Desktop, or manually.

With these changes to the respective config files set and saved and plugins installed, we will start everything up, in the following order:

Apache Zookeeper
Apache Kafka
The three instances for the Neo4j causal cluster
The read replica

Once all of the Neo4j instances are up and running and the cluster has discovered all of the members, we can now run the following query on the read replica:

CALL algo.pageRank.stream('MATCH (p:Person) RETURN id(p) AS id','MATCH (p1:Person)-->()<--(p2:Person) RETURN distinct id(p1) AS source, id(p2) AS target',{graph:'cypher'}) YIELD nodeId, scoreWITH [nodeId,score] AS resCALL streams.publish('neorr',res)RETURN COUNT(*)

This Cypher query will call the PageRank algorithm with the specified configuration. Once the algorithm is complete, we will stream the returned node id’s and the PageRank score to the specified topic.

We can have a look at what the neorr topic looks like by running Step 5 of the Apache Kafka quick start guide (replacing test with neorr):

Summary

In this post we’ve demonstrated:

Separating transactional and analytical data concerns
Painlessly flowing analytical results back back for real-time consumption

Whilst we’ve used a simple example, you can see how complex analytical work can be carried out, supporting an event-driven architecture.

Monitoring the French Presidential Election on Twitter with Python

freeCodeCamp — Sun, 12 Feb 2017 09:19:26 +0000

By Romain Thalineau

A while ago I read this nice article from Laurent Luce where he explained how he implemented a system that collected the tweets related to the 2012 French presidential election. The article is very well written, and I highly recommend reading it.

This gave me the idea to implement something similar for the 2017 election. But I wanted to add some features:

Instead of using a SQL database for storing the data, I wanted to use a Graph database. The main reason was to experiment with such a system, but it’s fairly easy to see how this is a good fit for social media data.
I wanted to be able to monitor the data in real time. Practically speaking, this means that the data need to be processed as they arrive. This would also involve serving the analyzed data to a web site with data visualizations.
Ideally I wanted to run a sentiment analysis on the tweets. I would train a learning algorithm and implement it along the data pipeline to serve its results in real time.

_[Time Series Analysis](https://www.auguratech.com/#/twitter/time_series" rel="noopener" target="blank" title=")

Well, I managed to build all of this. You can go to see how it looks like on my personal website. So far, there are two simple analyses:

The first one is a time series analysis, which shows the numbers of tweets per candidates as a function of the date. Besides being able to select the starting/ending date and the period, you can also display just the candidates you would like to see by clicking on their names in the visualization.
The second analysis displays the geolocation of the tweets. The options are relatively similar to the first analysis.

_[Tweet geolocation analysis](https://www.auguratech.com/#/twitter/geospatial" rel="noopener" target="blank" title=")

For collecting the data from Twitter, I used an approach similar to Laurent Luce. Instead of focusing on the similarities, I’ll show you the approaches I took that were different.

Storing the tweets in a graph database

As I said, I wanted to store the data in a graph database. I chose to use Neo4J. In a graph database, data are modeled using a combination of nodes, edges, and properties structures.

_[Image credit](http://network.graphdemos.com/" rel="noopener" target="blank" title=")

In our case, nodes can represent a tweet, a user or even a hashtag. They can be distinguished by using a label. The relationship between nodes is handled by connecting them through edges. For example, a user node can be connected to a tweet node via a POSTS relationship.

The relationships are directional. A tweet can’t POST a user, but it can MENTION a user.

Finally both nodes and edges (relationships) can hold properties. For example, a user has a name and a tweet has text.

When interacting with a graph database, Object Graph Mapper (OGM) are particularly useful. In this project, I’ve been using Neomodel. It exposes an API relatively similar to the Django models API. You define your models like:

As you can see, both the property and the relationships are defined. I invite you checking the models file in my github repo to see the full data model definition.

Neo4J being a NoSQL database, it uses a non-SQL query language called Cypher. It’s a pretty straightforward language. For instance, the following query will return all the tweets posted by a user that contain the word “fillon” (one of the candidates):

MATCH (u:User)-[:POSTS]->(t:Tweet) WHERE t.text contains "fillon" return t

Neomodel being an OGM, it provides an API so you don’t have to write very many queries manually. You can obtained the same results as above by running:

Tweet.nodes.filter(text__contains="fillon")

Streaming from Twitter

Twitter provides two ways to get their data. The first one is through a standard REST API. Each endpoint access is limited, so it isn’t the preferred solution in our case.

Luckily, Twitter also provides a streaming API. By setting a filter, we can receive all the tweets that pass this filter (with a limit of 1% of the global amount of tweets published at instant t). The library Tweepy facilitates this process.

As you can see in my repo, you need to define a Listener class, which will trigger some actions while streaming. For instance, the method “on_status” is called any time a tweet is streamed.

In addition, I defined a Streaming class whose responsibilities are to authenticate to Twitter, to instantiate a Tweepy stream with the above Listener, and to expose a method to start streaming. The “start_streaming” method accepts a “to_track” argument, which is a list of words on which you want to filter.

You have to instantiate the Streaming class with a bunch of arguments. Aside from the Twitter API credentials, you need “pipeline” and “batch_size” arguments. The latter is a number specifying the amount of tweets that are processed at once.

Since processing a tweet involves saving it to Neo4J, doing it one by one is a very costly operation. Saving them by batches of 100 (or even more in some cases) improves performance considerably.

The “pipeline” argument must be a reference to a function, which will receive the batch of tweets. Inside of this, you are free to do whatever you want. I provided an example of it in the utils.py module.

As you can see, this function makes a call to an asynchronous Celery task defined in the tasks.py module. Celery is a Python distributed task queue library. I used it with RabbitMQ as a message broker. So how does it work? Let us get back to the “streaming_pipeline” function in the utils.py module, and focus on this line:

bulk_parsing.delay(users_attributes, tweets_attributes)

When this line is processed, instead of processing the “bulk_parsing” function synchronously, a message will be published to a broker (here RabbitMQ). It allows for consumers (workers) to retrieve these messages, and therefore to process the “bulk_parsing” task asynchronously and in parallel. Why’s that? Because it enables horizontal scaling of tweet processing. If the messages accumulate faster than you can process them, you can add more workers to help consume them.

One final remark. I wanted the process to be as versatile as possible, in the sense that if the processing needed to be change — or if something needed to be added — it must be easy to do so. In this case, I can just change the “streaming_pipeline” function and add some asynchronous tasks. It’s quick and easy to modify.

Thanks for reading!

Be sure to check out the code in my Github repo.
You can see all this in action on my site, where I used this to feed some analysis.

Neo4j - freeCodeCamp.org

Learn to Build Graph Databases with Neo4j (Full Course)

How to produce and consume data streams directly via Cypher with Streams Procedures

Leveraging Neo4j Streams — Part 3

What is a Neo4j Stored Procedure?

What are the streams procedures?

Set-Up the Environment

streams.publish

streams.consume

How to detect a user’s preferred color scheme in JavaScript

CSS media query ‘prefers-color-scheme' draft

Web browser support as of March 2019

Why CSS only detection isn’t enough

Initial JS approach

Reactive JS approach

React hook

End result

Final thoughts

About the author

How to ingest data into Neo4j from a Kafka stream

Leveraging Neo4j Streams

The Neo4j Streams Sink

A simple use case: Ingest data from Open Data APIs

Set-Up the Environment

The Open Data API

Define the Sink Query

The Notebook — From Open Data to Sink

Wrapping up

How to leverage Neo4j Streams and build a just-in-time data warehouse

Leveraging Neo4j Streams

What is a Change Data Capture?

How does the Neo4j Streams CDC module deal with database changes?

Beyond the traditional Data Warehouse

A paradigm shift: Just-In-Time Data Warehouse

Schema-On-Read

Set-Up the Environment

The Use-Case:

Notebook 1: Create a Just-In-Time Data Warehouse

Notebook 2: Query The JIT-DWH

Conclusions

How to embrace event-driven graph analytics using Neo4j and Apache Kafka

Introduction

Overview of the situation

The Example Data Set

Summary

Monitoring the French Presidential Election on Twitter with Python

Storing the tweets in a graph database

Streaming from Twitter

CSS media query ‘`prefers-color-scheme'` draft