NoSQL - freeCodeCamp.org

Firestore Data Modeling Guide: Embedded Documents vs Referencing (with a Blog Case Study)

Caleb Mintoumba — Fri, 24 Jul 2026 15:19:10 +0000

When developers transition from the relational world (MySQL, PostgreSQL) to Firestore, Firebase's NoSQL document database, they often bring their old habits with them. They try to replicate tables, foreign keys, and joins.

The result? Complex queries, skyrocketing read costs, and a database structure that becomes a nightmare to maintain after just a few features.

To understand how Firestore works, we first need to look at our point of comparison: the relational model. Once we map out how SQL does things, we can see exactly where Firestore diverges, and how to structure NoSQL data correctly.

In this guide, we'll cover NoSQL design principles, embedding vs. referencing, and relationship modeling (1-1, 1-N, N-N). We'll also walk through a concrete blog case study.

Prerequisites
The Relational Mindset: How SQL Handles Data
The Firestore Paradigm: NoSQL with Relationships
The Core Building Blocks: Documents and Collections
The Golden Rule: Model for Reads, Not Writes
Embedding vs. Referencing (Denormalization)
How to Model Relationships (1-1, 1-N, N-N)
Best Practices and Pitfalls to Avoid
Case Study: Designing a Scalable Blog Database

Prerequisites

This guide is conceptual, so you don't need a running Firestore project to follow along. A little context is enough. You will need:

Basic JavaScript syntax, since every code example uses the modular Firebase JS SDK (v9+)
Basic familiarity with JSON objects (keys, values, nested objects, arrays)
Some exposure to SQL or relational databases helps, since the guide leans on that comparison throughout (but it's not required)
(Optional) A free Firebase project, if you want to try the examples yourself. The Firestore quickstart walks you through setting one up.

No prior NoSQL or Firestore experience is needed.

The Relational Mindset: How SQL Handles Data

In a relational database, data is organized into tables linked by explicit relationships. This approach relies on normalization to eliminate data redundancy.

For example, to store users and their respective countries, we split the data into two tables:

Users: columns id (PK), last_name, first_name, #country_id (FK a foreign key)
Countries: columns country_id (PK), country_name

With a row like 1, MINTOUMBA, Caleb, 1 in Users and 1, Canada in Countries, we automatically know that Caleb belongs to Canada through the foreign key #country_id. We never had to write the word "Canada" inside the Users table itself.

The SQL trade-off: writes are lightweight (you only update data in one place), but reads are heavier, because you have to perform a database join (JOIN) every time you want to display a user's country name.

That's exactly the opposite of how Firestore works, as we'll see next.

The Firestore Paradigm: NoSQL with Relationships

Firestore is a NoSQL document database – literally Not Only SQL. It stores JSON-like documents grouped into collections, with no enforced schema.

For most of Firestore's history, that also meant no native joins and no GROUP BY. The standard query engine simply didn't support them, and any aggregation beyond count(), sum(), and average() had to happen in your application code.

That's still true today for Standard edition, which remains the default and the one most mobile/web apps run on and the one this guide focuses on.

Google has since introduced Firestore Enterprise edition, built around a new Pipeline query engine that reached general availability in April 2026. Pipelines add a multi-stage query syntax and hundreds of new functions, including relational-style joins through correlated subqueries and a real aggregate(...) step with grouping Firestore's equivalent of SQL's GROUP BY.

Does this mean data modeling doesn't matter anymore? Not for most apps. Pipeline queries run within a 60-second timeout and a 128 MiB working-memory limit, can fall back to full collection scans when no index exists, and critically, Enterprise edition drops real-time listeners and offline support (which most Firestore client apps depend on).

Pipelines are a genuine escape hatch for analytical, admin, or reporting queries. They're not a drop-in replacement for the read-optimized structure your app's everyday screens still need.

If you're building a typical client-facing app on Standard edition, the embedding and denormalization strategies below are still how you model relationships.

But NoSQL doesn't mean "no relationships" even on Standard edition. You can and should build robust relationships between your collections. The difference is that Firestore won't enforce or resolve them for you the way a JOIN does by default. It's up to you, the developer, to build and query those relationships explicitly, and to maintain data integrity through your application code or Cloud Functions unless you've specifically opted into Enterprise edition for Pipeline-powered joins.

The Core Building Blocks: Documents and Collections

Before designing any schema, let's define Firestore's two core building blocks:

Document: the basic unit of storage. It's a JSON-like object, identified by a unique ID, containing typed fields (strings, numbers, booleans, timestamps, geopoints, or references to other documents).
Collection: a container for documents. Unlike SQL tables, documents in the same collection don't need to share the same structure.

What makes Firestore unique is its hierarchical nature: a document can contain sub-collections, which contain more documents, which can themselves contain more sub-collections, and so on.

In the diagram above, the root posts collection contains the document post_001, which itself hosts a comments sub-collection containing the individual comment documents comment_001 and comment_002. You can nest collections and documents several levels deep, but as we'll see later, it's best to do so sparingly.

Crucial rule: sub-collections are never retrieved automatically when you read a parent document. Unlike a SQL JOIN, you must always perform a separate, explicit query to read a sub-collection.

The Golden Rule: Model for Reads, Not Writes

This is the single most important concept in NoSQL modeling, and the one developers coming from SQL forget most often: structure your data based on how your app queries it, not on how it gets written.

Before writing any database code, ask yourself:

Which screens in my app will display this data?
Do I need this piece of data on its own, or always alongside another one?
Do I read this information significantly more often than I write or update it?

If your users view a writer's profile 10,000 times for every single time that writer updates their username, optimize for the reads: duplicate the username directly inside each post. That's the exact opposite of the SQL instinct we saw earlier, where you normalize first to avoid redundancy, even if it makes reads heavier.

Embedding vs. Referencing (Denormalization)

There are two primary strategies for representing a relationship in Firestore.

Option A: Embedding (Nesting)

You store the related data directly inside the parent document, as an array or a map (object).

// A post with its comments embedded
{
  title: "Introduction to Firestore",
  author: "Caleb",
  comments: [
    { user: "Ama", text: "Great post!" },
    { user: "Kofi", text: "Thanks for the examples" }
  ]
}

Pros: a single read retrieves everything, and consistency is guaranteed.
Cons: Firestore documents have a hard 1 MB size limit. If the nested list grows indefinitely (comments on a viral post, for instance), your writes will start failing once you hit that limit and every write to the parent document also re-sends the whole document to any client listening in real time.
Best for: small, bounded lists (tags on an article, a user's settings, a short list of favorites).

Option B: Referencing (Denormalization)

You split the entities into separate collections or sub-collections, and deliberately duplicate a few fields to avoid a second read.

// posts/post_001
{
  title: "Introduction to Firestore",
  authorId: "uid_123",
  authorName: "Caleb",      // denormalized: avoids a second read to "users"
  authorAvatar: "https://...",
  commentCount: 12          // denormalized counter
}

// posts/post_001/comments/comment_001
{
  userId: "uid_456",
  userName: "Ama",
  text: "Great post!",
  createdAt: Timestamp
}

Here, we duplicate the author's name and avatar into every post so we don't need an extra read to users every time the post list is displayed.

That's denormalization: we accept controlled redundancy in exchange for faster reads the exact opposite of SQL normalization. The cost is that these copies need updating if the user changes their name (usually handled by a Cloud Function triggered when the users document is updated).

Pros: no document size limits, and entities can be queried independently.
Cons: requires multiple reads if you didn't denormalize enough. If a duplicated value changes, you need code (often a Cloud Function) to propagate the update everywhere it's copied.
Best for: dynamic, fast-growing data (comments, order history, activity logs).

A more precise rule of thumb: whether to reference instead of embed depends on volume. Sub-collections handle unbounded growth (comments, order history) better than arrays.

Whether to denormalize a given field depends on the cost of keeping it in sync, not how often it changes: a counter you update in place with an atomic increment (commentCount, likeCount) has no other copy to synchronize, so it's cheap to denormalize regardless of frequency.

A copied value like authorName, on the other hand, is duplicated across every document that references it. It's safe to denormalize only if it changes rarely, since any change means propagating the update everywhere it's been copied.

How to Model Relationships (1-1, 1-N, N-N)

One-to-One (1-1)

Either embed the fields in the same document, or store them in a separate collection using the exact same document ID, for example users/uid_123 and privateProfiles/uid_123. This is perfect for separating public data from sensitive data that needs different security rules.

One-to-Many (1-N)

There are three main options, depending on volume and query direction:

A sub-collection (posts/post_001/comments/*) is ideal when you almost always query comments through their parent post, and volume can be large.
A root collection with a reference (comments with a postId field) is useful if you also need to query all comments by a given user, independently of the post (where("userId", "==", uid)).
Use an embedded array only if the volume stays small and bounded (see Option A above).

Many-to-Many (N-N)

This is the trickiest one in NoSQL, since there's no automatic join table like in SQL. There are three common patterns:

(1). Junction collection the equivalent of a SQL pivot table:

// memberships/{membershipId}
{
  userId: "uid_123",
  groupId: "group_789",
  role: "admin",
  joinedAt: Timestamp
}

You can then query .where("userId", "==", uid) to find all groups a user belongs to, or .where("groupId", "==", gid) to find all members of a group.

(2). ID arrays on both sides (cross-denormalization):

// users/uid_123      -> groupIds: ["group_789", "group_456"]
// groups/group_789   -> memberIds: ["uid_123", "uid_456"]

Fast to read from either side, but reserve this for lists that stay small the 1 MB document limit and the cost of atomically updating long arrays both work against you at scale.

(3). Hybrid approach, which is the most common pattern in practice: an array for a lightweight relationship rarely queried from the other side (a user's favorite posts), and a junction collection for a relationship queried frequently in both directions and prone to frequent changes (team memberships).

Best Practices and Pitfalls to Avoid

Limit nesting depth: Firestore allows sub-collections to be nested indefinitely, but beyond two or three levels, your queries and security rules become genuinely hard to maintain. Prefer flattening the structure with references when you can.
Avoid auto-incremented document IDs: Sequential IDs (user_1, user_2, user_3...) can cause hotspotting: writes pile up on a narrow range of the index, which degrades performance at scale. Let Firestore generate random, evenly distributed IDs unless you have a specific reason not to.
Watch out for composite indexes: Any query combining multiple .where() filters, or a .where() with an .orderBy() on a different field, requires a composite index. Plan for these during design rather than discovering them in production (Firestore's error messages include a direct link to auto-generate the missing index).
Mind the write rate on "hot" documents: The recommended maximum sustained write rate to a single document is about 1 write per second. A document updated very frequently by many different users a global like counter, for example becomes a bottleneck well before that. Firestore can absorb short bursts (5, 10, even 50 writes in one second) by queuing them, but sustained traffic above ~1 write/sec will start producing contention errors. The standard fix is a sharded counter: split the count across several sub-documents and sum them at read time.
Use sub-collections deliberately: They're convenient, but they always require a separate query. If you almost always need the data together, embedding or denormalization will perform better.
Design security rules alongside your data model: Firestore's security rules (firestore.rules) should be designed at the same time as your schema a poorly thought-out structure usually makes precise rules much harder to write.

Case Study: Designing a Scalable Blog Database

Let's bring every principle from this guide together with a concrete example: a blog with posts, comments, and likes.

// posts/{postId}
{
  title: "Modeling Firestore",
  slug: "modeling-firestore",
  authorId: "uid_123",
  authorName: "Caleb",         // denormalized: avoids a second read to "users"
  content: "...",
  tags: ["firebase", "nosql"], // embedded: small, bounded list
  commentCount: 3,             // denormalized counter
  likeCount: 47,               // denormalized counter (shard it if traffic is high)
  createdAt: Timestamp
}

// posts/{postId}/comments/{commentId}  → sub-collection: read together with the post
{
  userId: "uid_456",
  userName: "Ama",
  text: "Excellent article",
  createdAt: Timestamp
}

// likes/{likeId}  → root collection + reference
{                    // lets you quickly check if ONE user liked ONE post
  postId: "post_001",
  userId: "uid_456"
}

Each choice here answers a specific read pattern. Tags are always displayed alongside the post, so they're embedded. Comments can grow large in number and are almost always fetched together with their parent post, so they live in a sub-collection. Likes need to be queried both by post and by user to check whether this user already liked this post so they sit in a root collection with two indexable fields.

Conclusion

In SQL, you normalize to eliminate redundancy, and you pay for that choice at read time, via joins. In Firestore, it's the opposite: you accept controlled redundancy (denormalization) to make reads instant and cheap, at the cost of slightly heavier writes.

Modeling data in Firestore isn't about applying relational habits with a different syntax. It's a genuinely different way of thinking, centered on your app's read patterns.

Always ask "how will I read this data, and how often?" before choosing between embedding, referencing, or a sub-collection. Also, keep Firestore's concrete limits in mind (1 MB per document, composite indexes, hotspotting) from the design phase rather than discovering them in production.

That balance between read simplicity and write cost is what separates a Firestore database that scales gracefully from one you'll be rewriting six months from now.

How to Store Data Locally with Isar in Flutter

Atuoha Anthony — Fri, 19 Sep 2025 13:09:48 +0000

When building Flutter applications, managing local data efficiently is critical. You want a database that is lightweight, fast, and easy to integrate, especially if your app will work offline. Isar is one such database. It is a high-performance, easy-to-use NoSQL embedded database tailored for Flutter. With features like reactive queries, indexes, relationships, migrations, and transactions, Isar makes local data persistence both powerful and developer-friendly.

In this article, you’lll learn how to integrate Isar into a Flutter project, set up a data model, and perform the full range of CRUD (Create, Read, Update, Delete) operations. To make this practical, you’ll build a simple to-do app that allows users to create, view, update, and delete tasks.

Prerequisites
What We Are Building
How to Set Up Isar in a Flutter Project
How to Create the Task Model
How to Build the Repository for CRUD Operations
How to Integrate CRUD into the Flutter UI
Beyond CRUD: Advanced Features of Isar
Conclusion

Prerequisites

Before starting, ensure you have the following:

Flutter SDK installed (version 3.0 or above recommended).
Check your version with:
```
 flutter --version
```
Dart knowledge: Familiarity with Dart syntax, classes, and async programming.
Flutter basics: You should know how to set up a Flutter project, build widgets, and use FutureBuilder or setState for state management.
Code editor: VS Code or Android Studio is recommended.

If these are in place, we are ready to begin.

What We Are Building

We will create a Task Manager App that lets users:

Add new tasks.
View all tasks in a list.
Update existing tasks.
Delete tasks.

By the end, you will have a fully functioning CRUD app built with Flutter and Isar.

How to Set Up Isar in a Flutter Project

Step 1: Add dependencies

Open your pubspec.yaml file and add the following:

dependencies:
  flutter:
    sdk: flutter
  isar: ^3.1.0
  isar_flutter_libs: ^3.1.0

dev_dependencies:
  isar_generator: ^3.1.0
  build_runner: any

isar: The core Isar package.
isar_flutter_libs: Required for Flutter integration.
isar_generator: Used to generate code for your models.
build_runner: Runs the code generator.

Run:

flutter pub get

Step 2: Create and initialize Isar

Create a file named isar_setup.dart. This will handle the opening of the Isar database.

import 'package:isar/isar.dart';
import 'package:path_provider/path_provider.dart';
import 'task.dart'; // we will create this model soon

late final Isar isar;

Future<void> initializeIsar() async {
  final dir = await getApplicationDocumentsDirectory();
  isar = await Isar.open(
    [TaskSchema],
    directory: dir.path,
  );
}

Explanation:

getApplicationDocumentsDirectory() provides a storage location for the database file.
Isar.open() initializes the database and registers our Task schema.
late final Isar isar; ensures we can access the database instance globally after initialization.

How to Create the Task Model

Now let’s define our data model for tasks. Create a file named task.dart.

import 'package:isar/isar.dart';

part 'task.g.dart';

@Collection()
class Task {
  Id id = Isar.autoIncrement; // auto-incrementing primary key

  late String name;

  late DateTime createdAt;

  Task(this.name) : createdAt = DateTime.now();
}

Explanation:

@Collection() tells Isar this class represents a database collection.
Id id = Isar.autoIncrement; creates a unique identifier automatically.
late String name; stores the task name.
late DateTime createdAt; stores the creation timestamp.
part 'task.g.dart'; links to the generated code, which will be created after running the code generator.

Generate the code with:

flutter pub run build_runner build

This generates task.g.dart, which contains the necessary schema code.

How to Build the Repository for CRUD Operations

Create a new file called task_repository.dart. This will house the methods for interacting with the database.

import 'package:isar/isar.dart';
import 'task.dart';
import 'isar_setup.dart';

class TaskRepository {
  Future<void> addTask(String name) async {
    final task = Task(name);
    await isar.writeTxn(() async {
      await isar.tasks.put(task);
    });
  }

  Future<List> getAllTasks() async {
    return await isar.tasks.where().findAll();
  }

  Future<void> updateTask(Task task) async {
    await isar.writeTxn(() async {
      await isar.tasks.put(task);
    });
  }

  Future<void> deleteTask(Task task) async {
    await isar.writeTxn(() async {
      await isar.tasks.delete(task.id);
    });
  }
}

Explanation:

addTask: Creates a new task and saves it.
getAllTasks: Reads all tasks from the database.
updateTask: Updates an existing task by calling .put() again.
deleteTask: Removes a task by its id.
isar.writeTxn: Ensures operations run inside a transaction for safety and consistency.

How to Integrate CRUD into the Flutter UI

Now, let’s connect everything inside main.dart.

import 'package:flutter/material.dart';
import 'isar_setup.dart';
import 'task_repository.dart';
import 'task.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await initializeIsar(); // initialize Isar before runApp
  runApp(MyApp());
}

class MyApp extends StatelessWidget {
  @override
  Widget build(BuildContext context) {
    return MaterialApp(
      home: TaskListScreen(),
    );
  }
}

class TaskListScreen extends StatefulWidget {
  @override
  _TaskListScreenState createState() => _TaskListScreenState();
}

class _TaskListScreenState extends State<TaskListScreen> {
  final TaskRepository _taskRepository = TaskRepository();
  late Future<List> _tasksFuture;

  @override
  void initState() {
    super.initState();
    _tasksFuture = _taskRepository.getAllTasks();
  }

  Future<void> _addTask() async {
    await _taskRepository.addTask('New Task');
    setState(() {
      _tasksFuture = _taskRepository.getAllTasks();
    });
  }

  Future<void> _deleteTask(Task task) async {
    await _taskRepository.deleteTask(task);
    setState(() {
      _tasksFuture = _taskRepository.getAllTasks();
    });
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: Text('Isar CRUD Example')),
      body: FutureBuilder<List>(
        future: _tasksFuture,
        builder: (context, snapshot) {
          if (snapshot.connectionState == ConnectionState.waiting) {
            return Center(child: CircularProgressIndicator());
          } else if (snapshot.hasError) {
            return Center(child: Text('Error: ${snapshot.error}'));
          } else {
            final tasks = snapshot.data ?? [];
            if (tasks.isEmpty) {
              return Center(child: Text('No tasks yet.'));
            }
            return ListView.builder(
              itemCount: tasks.length,
              itemBuilder: (context, index) {
                final task = tasks[index];
                return ListTile(
                  title: Text(task.name),
                  subtitle: Text('Created at: ${task.createdAt}'),
                  trailing: IconButton(
                    icon: Icon(Icons.delete),
                    onPressed: () => _deleteTask(task),
                  ),
                );
              },
            );
          }
        },
      ),
      floatingActionButton: FloatingActionButton(
        onPressed: _addTask,
        child: Icon(Icons.add),
      ),
    );
  }
}

Explanation:

initializeIsar(): Ensures the database is ready before the app runs.
_tasksFuture: Holds a future of the list of tasks.
_addTask: Adds a new task and refreshes the list.
_deleteTask: Deletes a task and refreshes the list.
FutureBuilder: Automatically rebuilds the UI when the future completes.
ListView.builder: Displays all tasks dynamically.

This gives you a simple yet complete CRUD app using Isar.

Beyond CRUD: Advanced Features of Isar

Once you are comfortable with CRUD, Isar provides advanced tools to optimize and extend your application:

Reactive Queries:
Instead of using FutureBuilder, you can listen for changes directly.
```
 final stream = isar.tasks.where().watch(fireImmediately: true);
```

Indexes:
Improve query performance by indexing fields.

 @Collection()
 class Task {
   Id id = Isar.autoIncrement;

   @Index()
   late String name;
 }

Relations:
Link one collection to another (for example, Project with many Tasks).
Custom Queries:
Perform complex filtering, sorting, and pagination.
Migrations:
Safely evolve your schema as the app grows.
Batch Operations:
Insert or update many records in one transaction.

Conclusion

We built a simple Flutter to-do app with Isar that supports creating, reading, updating, and deleting tasks. Along the way, we learned how to:

Add Isar dependencies.
Define a model with annotations.
Generate schema code.
Implement CRUD operations in a repository.
Connect Isar to the Flutter UI.

With its performance, developer-friendly API, and advanced features, Isar is an excellent choice for local persistence in Flutter applications.

For further learning, consult the official docs:

SQL vs NoSQL: When to Use Which

Beau Carnes — Wed, 14 Sep 2022 03:38:22 +0000

When should you use a SQL database and when should you use a NoSQL database?

We just published a course on the freeCodeCamp.org YouTube channel that will teach you the differences between NoSQL and SQL databases as well as when and why to use each kind of database.

Ania Kubow developed this course. Ania is one of the most popular tutorial creators on the freeCodeCamp.org YouTube channel.

In this course, you are going to go back to basics to learn what exactly a database management system (DBMS) is and how it's defined. You are then going to learn database design and why it's important as well as what a database management system is.

You'll then learn about relational databases followed by a SQL crash course. You will learn about non-relational databases and then learn the pros and cons of using relational databases versus non-relational databases. Finally, you will learn some use cases followed by a NoSQL crash course.

Here are the sections in this course:

What actually is a database
What is a database management system
Demo: Creating a database
Common Database Models
Relational databases
SQL
Non-relational databases
Pros and Cons: Comparing RDBMS and NoSQL
Wide Column Database
Document Database
Key-Value Database
Multi-Model Databases
Use cases: When to use RDBMS or NoSQL

Watch the full course below or on the freeCodeCamp.org YouTube channel (1.5-hour watch).

How to Start Using MongoDB – Database Setup for Beginners

valentine Gatwiri — Mon, 25 Jul 2022 21:42:56 +0000

MongoDB is an increasingly popular open source NoSQL database. And it has many advantages over traditional SQL databases.

It offers high scalability, reliability, and performance even with a huge amount of data.

This article covers the basics that you need to know to get started with MongoDB and how to use it properly.

Prerequisites

A suitable IDE such as VS Code
A terminal

What You'll Learn

What is MongoDB?
What is NoSQL?
How to install MongoDB
Hoe to setup MongoDB
How to run MongoDB

What is a NoSQL Database?

A NoSQL database is a non-relational database that does not use the traditional table-based schema of a relational database.

NoSQL databases are often used for big data and real-time web applications. MongoDB is one of the most popular NoSQL databases. It's fast, scalable, and uses JSON documents to store data.

Why Should I Use No-SQL?

No-SQL databases are powerful tools that can help you work with large amounts of data. They're especially good at handling unstructured data, so they can be a good choice if you're dealing with a lot of data that doesn't fit into a traditional relational database.

No-SQL databases can also be more scalable than relational databases, which is important if you're expecting your data to grow over time.

How to Get Started with MongoDB – Install Guide

Install MongoDB using this link or use the instructions below if you are using Ubuntu:

Import the public key

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2930ADAE8CAF5059EE73BB4B58712A2291FA4AD5

Create a list file for Ubuntu

echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.6 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.6.list

Run the following command to update:

sudo apt-get update

Install the latest package

sudo apt-get install -y mongodb-org

Then run:

sudo service mongod start

How to Create and Populate the MongoDB Database

Once you have MongoDB installed, create a data directory where MongoDB will store its data files. By default, this is /data/db, but you can specify a different location if you prefer. Finally, start the MongoDB server by running mongod from the command line.

Make a directory for dbPath with the following command:

sudo mkdir -p /data/db 
sudo chown -R `id -un` /data/db

Then run sudo mongod --port 27017or mongod in a different terminal:

Your output format (also known as structured logging) for server logs in MongoDB 4.4+ should look like the above. Although the JSON format may initially seem intimidating, it is made to be used with common JSON tools and frameworks.

Enter the MongoDB shell using this command:

mongo

You will get the output shown below after running the following command:

How to Create a New MongoDB Database

The first step in using MongoDB is creating a new database with the command use mydatabase. You can then create collections inside this database. Finally, you can populate your new collection.

 use record
 db.users.insert({username: "myname", password: "mypassword"})

The use record command switches the database to record database. The db.users.insert(...) command adds an input to the users table within the record database.

Below is the output of the commands above:

WriteResult({ "nInserted" : 1 })

Run the following command to view the record you created in the previous step:

 db.users.find()

The db.users.find() command searches the users table for all entries.
Your output yields the following result:

{ "_id" : ObjectId("62dd6ab4a7d1ab0948574778"), "username" : "myname", "password" : "mypassword" }

How to Add New Records to Your Database

To add new records, do the following:

 use record
 db.commerce.save({scriptname: "dygraph.min.js", version: "2.1.0"})
 db.commerce.save({scriptname: "sortable.min.js", version: "0.8.0"})

We've added two records to the commerce table, each with data specified by the scriptname and version attributes.

You should get something like this:

WriteResult({ "nInserted" : 1 })

To view all the tables stored in your MongoDB database, run the following commands:

 use record
 show collections

You should see a similar output to the below:

commerce
users

Conclusion

MongoDB is a powerful database system you can use for a variety of applications. It is easy to set up and use, and its scalability makes it a good choice for large-scale projects.

If you are new to database systems, MongoDB is a good place to start.

Relational VS Nonrelational Databases – the Difference Between a SQL DB and a NoSQL DB

Dionysia Lemonaki — Mon, 18 Apr 2022 17:56:26 +0000

This article is an overview of relational and non-relational databases.

Besides learning the fundamental differences between the two types of databases, you will also learn how to decide which one to use for your next project by going over their strengths and weaknesses.

Here is what we'll cover:

Defining a database
1. What is SQL?
Relational databases
1. Characteristics
2. ACID properties
Non-relational databases
1. Types
2. BASE properties
Relational VS Non-relational databases
Further Learning

What Is A Database? A Definition for Beginners

When it comes to computing, data are pieces of information that come in different forms. Data can be text, numbers, images, audio snippets, or videos.

Collections of information need to be stored somewhere, processed, and interpreted.

You need a way to effortlessly search, access, extract and retrieve the saved resources whenever you need them.

This allows both computers and humans can analyze the accessed data, perform calculations and comparisons, make logical decisions, and reach a conclusion.

You can store the data in a file of some kind, using a software program like an Excel spreadsheet – and this can get the job done.

But what if there are large amounts of data, and you need to be sure they are accurate?

Or what if if you need to retrieve large data sets quickly?

Or what if if the data needs to have a predefined structure that it should adhere to?

Databases are a much more accessible, efficient, and organized way of storing and working with information over a long period of time.

The ability to store data logically and systematically and retrieve it for use at a later date makes databases a critical part of all web applications.

Databases power all applications. They save and store user information such as usernames, email addresses, encrypted passwords, and physical addresses.

They also store user behavior. For example, in an e-commerce store, the database saves and keeps track of the items you have marked as 'favorites'.

You'll need a Database Management System (or DBMS for short) to manage your databases.

A Database Management System is a software program that serves as an intermediary between end-users and the database itself.

It allows its users to create and manage databases. It also allows them to access, modify, and manipulate the data stored in the database by performing operations known as queries.

Users can easily store, retrieve, update, and delete data with the help of a few commands.

When it comes to Database Management Systems, there are generally two types to choose from:

Relational Databases (also known as SQL Databases)
Non-relational Databases (also known as NoSQL Databases)

What is SQL?

SQL is short for Structured Query Language.

You will likely hear it pronounced one of two ways – "S. Q. L." (ess-kew-ell), or "se-quel" (like a sequel to a movie).

https://i.imgur.com/NtGaNA8.png

Either way, SQL is a language used for dealing with databases.

Specifically, with SQL, you can write database queries to communicate with the database. These can be commands for performing any of the CRUD (Create Read Update Delete) operations.

SQL is the language of choice for Relational Database Management Systems, which you will learn all about in the following section.

What Is A Relational Database?

Relational databases (or SQL databases) have been around for a while. The first relational database appeared in 1970, and they are still popular to this day. Some of the most commonly used ones are:

A Relational database stores data in a structured and tabular way. That is, it stores information in tables, which you can think of as storage containers for the data. For example, a company could have an employees table to store data on its employees.

Relational databases have a strict, static, and pre-defined logical schema. You can think of a database schema as an organizational blueprint – a set of rules for what can and cannot enter the table and the conditions for how to configure data.

In each table, there is at least one column. These columns have a specific data type, such as INTEGER or VARCHAR. In the employees table, some columns could be employee_id, name, department, email, and salary.

The columns and the data types allowed in each column make up the schema.

             EMPLOYEES

+-------------+------+------------+-------+--------+
| employee_id | name | department | email | salary |
+-------------+------+------------+-------+--------+

A table will also have rows, or records. A record is a single data value entry that needs to adhere to the pre-defined schema. Essentially, it is a single item.

             EMPLOYEES
+-------------+------------------+------------+-----------------------+--------+
| employee_id |       name       | department |         email         | salary |
+-------------+------------------+------------+-----------------------+--------+
|           1 |  John Doe        | IT         | johndoe@company.com   |   3500 |
|           2 |  Kelly Kellinson | Marketing  | kelly@company.com     |   1500 |
|           3 |  Mike Manson     | Product    | mikekane@company.com  |   2300 |
+-------------+------------------+------------+-----------------------+--------+

And since Relational Databases support SQL, you can perform queries. For example, if you wanted to view the names of the employees , whose monthly salary is greater than 2000 dollars, then you would write the following SQL query:

SELECT name FROM employees
WHERE salary > 2000;

From the above query, you would get the following output:

+-------------+
|    name     |
+-------------+
| John Doe    |
| Mike Manson |
+-------------+

Characteristics of Relational Databases

So far, you know that Relational Databases:

are tabular in format,
are very organized, and the data stored is well-structured,
have a strict, rigid, and pre-defined schema,
use SQL for performing database queries and manipulating data.

Additionally, a relational database can have more than one table, and as the name of this type of Database Management System suggests, the tables are related to one another.

For example, an e-commerce company may have a products table, a users table, an emails table, and an orders table.

Since there is a link and connection between the tables and the information stored in them, you can even join tables using a few commands.

There is a primary key, which acts as an identifier and ensures that each item in the table is unique, therefore making sure there is no duplicate and redundant data in tables.

And there is a foreign key that creates those pre-established relationships between tables.

Data points in different tables can have distinct relationships:

One-to-one relationships. In such cases, a record in one table is related only to one record in another table. An example of a one-to-one relationship in an e-commerce store, is that one user can have only one email address, and one email address can belong only to one user.
One-to-many relationships. In such cases, one record in one table is related to many other records in another table. For example, in an e-commerce store, a single user can make many orders, but each of those orders is made by a single user.
Many-to-many relationships. In such cases, one or more records in one table can be related to one or more records in another table. For example, in an e-commerce store, one order can have many products and a product can be ordered many times.

ACID Properties in Relational Databases

Relational Databases offer the ACID database consistency model.

ACID is an acronym for Atomicity, Consistency, Isolation, Durability.

Atomicity means that transactions are atomic and take an "all or nothing" approach.

For example, either the entire operation is successful and is completed from start to finish, or it is unsuccessful, and there is an entire operation "rollback".

All operations are guaranteed to end with either a success or a failure, and none are just partially successful.

Consistency is the property that ensures that the database structure remains intact from the start of a transaction to the end. It makes sure that any data entering the database follows the rules and constraints that are set in place. It is what secures and maintains the integrity of data in relational databases.

Isolation means that despite the number of transactions taking place at any moment in time, each transaction is treated as an atomic, separate unit, and transactions seem to occur in sequential order.

For example, if two transactions are happening at the same time, this property ensures that one transaction, and the changes occurring there, will not affect in any way the other transaction.

And finally, Durability means that any results and changes from the transactions are committed and thus permanent and will persist, even if there is a system failure.

Tge ACID model ensures that databases are reliable and secure.

What Is A Non-Relational Database?

A non-relational Databases is also referred to as a NoSQL database. You will often see that NoSQL stands for both "Not only SQL" and also "Non-SQL".

Either way, a non-relational database refers to a database that doesn't use the relational data model.

Although this term and this type of database have been around for decades, NoSQL databases started gaining momentum in the late 1990s, when the Internet increased in popularity.

Relational databases alone could not handle the speed – along with the large amounts and size of diverse and complex data – that this rise in internet use and the newly developed web applications required and demanded.

Some of the most popular Non-relational databases are:

A non-relational database does not store and organize data in a tabular format. There are no tables, rows, columns, or relationships between different data points.

Instead, data is stored in collections. The database is typically unstructured and uses a dynamic schema.

Types of Non-Relationional Databases

There are four major types of non-relational databases:

Column oriented databases,
Key - value data stores,
Document - oriented stores,
Graph oriented databases.

Column-oriented databases are similar in concept to relational databases. But they use groups, or sets of columns (also known as column families) instead of rows to logically organize related data.

You can access a column family independently by using a unique row key associated with an individual column. Searching for specific data is much faster and saves significant time since there is no need to go through rows of unrelated information to find what you are searching for.

Key-value stores are one of the simplest types of non-relational databases.

Data is stored in dictionaries or hash tables in the form of key-value pair collections.

This type of database has keys that need to be unique.

Keys act as a pointer to a specific value and are associated with that value.

The value assigned to a key can be any piece of information and data type.

To retrieve and access the value, you use the unique key as a reference.

Document-oriented stores also store data in key-value pair fashion. But in this case, the value is a document that has a unique key as its identifier.

The document has any format, such as XML, YAML, or binary, but typically it has a JSON format.

This type of database stores data in a semi-structured way.

There is no schema or predefined structure. Because of this, it offers flexibility and the ability to re-arrange and re-work the structure of the database if the project's requirements change.

It also provides a SQL-like type of query language or an API to perform queries and CRUD operations on the data.

Graph databases are the most complex type of non-relational database, and they can handle large sets of data.

They focus on the connections and relationships between data elements and use graph theory to store, search, and manage those relations.

They use nodes to store data and represent an individual entity or piece of data. One node is connected and linked to another node.

To represent the connections or relationships between entities, graph databases use edges.

BASE Properties in Non-relational Databases

Non-relational databases offer the BASE database consistency model. This model is not as rigid as the ACID model of relational databases.

BASE is an acronym for:

Basic Availability. This model does not focus on the immediate consistency of data. However, the system appears to be continuously working and guarantees the availability of data at all times.
Soft state. Because of the lack of immediate consistency, the state of the system may change over time. A soft state means the system doesn't need to be write-consistent.
Eventual consistency. The main priority is the constant availability of data and not that of data consistency. However, eventually and at some point, you can expect data to be consistent. This may occur when the system stops receiving input.

How to Choose Between SQL and NoSQL Databases

After learning the basics of SQL and NoSQL databases, you might be wondering which one of the two to choose for your project.

Well, there isn't a clear answer to that question.

Both databases have advantages and disadvantages, and it largely depends on the type of application you are building, the kind of data you will be working with, and your future goals.

It is common for companies to use both types of databases for their products.

Below is a quick summary of their characteristics to help you decide which one might be the right fit for you.

When to use an SQL database:

You need highly structured data distributed across multiple tables. You need your data to adhere to a strict, predictable, predefined, and already planned schema.
Your data will remain relatively the same. SQL databases are convenient if you don't plan on frequently changing the structure of the database and don't need to regularly update items. Keep in mind that they offer little flexibility.
You need consistent data.
Data integrity and security are a priority.
You want accurate results for complex queries.

A disadvantage of SQL databases is that they scale vertically.

You will need to increase the hardware and computing power effort on your current machine as you gather and store more data.

This can be costly.

An increase in processing power and memory storage is needed to handle an increase in load to improve performance.

When to use a NoSQL database:

You are working in a fast development environment that requires frequent adaptations of requirements and constant changes to the database structure.
You are working with large amounts of data that are diverse in nature but do not require a lot of structure or accuracy.
You are working with data that needs frequent updates. NoSQL databases offer a loose, flexible, and dynamic schema that allows for regular changes to the data.
You want speedy query results and continuous availability of the system.
You don't want to perform any upfront planning, preparing, or designing of the database, but want to immediately start building instead.

A big advantage of NoSQL databases is that they scale horizontally.

They are designed in a way that more machines can be added to the existing machine (such as cloud servers). This behavior is more desirable compared to vertical scaling that requires additional CPU (Central Processing Unit) or RAM (Random Access Memory) resources.

But of course, a disadvantage of NoSQL databases is that they do not ensure data integrity and consistency.

Further Learning

This article has just scratched the surface, and the best way to learn is by doing.

Here are some learning resources to learn more about databases and SQL:

Learn SQL – Free Relational Database Courses for Beginners. Bookmark this article for a list of free SQL courses.
freeCodeCamp's Relational Database Certification. In this course, you will learn the necessary developer tools. Then you will learn how to use a code editor, the command line, and Git. You will also learn to work with PostgreSQL (a relational database management system) and SQL – its query language.
Learn About NoSQL Databases in This 3-hour Course. In this course, you will learn about the four different NoSQL database types. Besides just learning the theory, you will also practice building all four of them.
Conclusion

You have made it to the end of the article!

Hopefully, it has helped you understand the primary differences between Relational and Non-Relational databases. You also have some extra resources to start learning and to put your new skills to practice.

Thanks for reading, and happy coding!

AWS DynamoDB – NoSQL Database Guide for Beginners

Manish Shivanandhan — Tue, 11 Jan 2022 16:50:00 +0000

What is DynamoDB?

DynamoDB is a fully managed NoSQL database from AWS. DynamoDB is similar to other NoSQL databases like MongoDB, except for the fact that you don’t have to do any maintenance or scaling on your part.

DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second — via AWS Documentation.

DynamoDB offers built-in security, on-demand, and point-in-time backups, cross-region replication, in-memory caching, and many other features that support business-critical workloads.

Most importantly, DynamoDB works seamlessly with other AWS applications like S3 and Lambda.

But before we get into the article, it's important that you understand the concept of NoSQL databases.

What are NoSQL Databases?

NoSQL stands for “not only SQL”. Simply put, NoSQL databases store documents in a format similar to JSON, while relational databases store data in the form of a table.

NoSQL offers more flexibility in terms of data modeling and does not force you to have a schema to store documents.

A few types of NoSQL databases include pure document databases (like MongoDB), key-value stores (like DynamoDB), wide-column databases (like Cassandra), and graph databases (like Neo4j). Learn more about NoSQL databases here.

Great. Now let’s look at some of the features of DynamoDB.

Core Features of DynamoDB

Autoscaling

Probably the most important feature of DynamoDB is that it delivers automatic scaling of throughput and storage based on the performance or usage of your application.

In a typical database server, the sysadmin takes care of scaling when the application encounters higher than usual traffic.

With DynamoDB, you can create database tables that can store and retrieve any amount of data, and the scaling is automatically managed by AWS. This includes scaling up for higher traffic and scaling down for lower traffic, so you only pay for what you use.

Data Models

DynamoDB supports both key-value and document data models. This enables you to have a flexible schema, so each row can have any number of columns at any point in time. This is crucial for growing businesses that have ever-changing requirements.

Re-defining database schema is a nightmare that many developers/database admins go through in a growing application. This data model flexibility offers a robust database solution for small as well as large businesses.

Replication

AWS takes care of DynamoDB table replication automatically based on your choice of AWS regions (cross-region replication). Even distributed applications can have single-digit millisecond read and write performance using DynamoDB.

With replication in place, you don't have to worry about data availability. In the event of the primary source failure, you can easily access the data from a secondary reserve, reducing the probability of application downtime.

Backups & Recovery

DynamoDB provides on-demand backups for your tables that you can enable within the AWS console. You can also enable automatic backup and archiving of your data to other AWS solutions like S3.

DynamoDB also offers Point-in-time recovery. This protects your data from accidental write/delete operations.

With Point-in-time recovery, you can restore your database to any point in time for the last 35 days. Point-in-time recovery is achieved by storing incremental backups of your database and that is managed automatically by AWS.

Security

DynamoDB encrypts data at rest by default and also in transit using the keys stored in AWS Key Management Service (or customer-provided keys).

With encryption in place, you can build security-sensitive applications that meet compliance and regulatory requirements. DynamoDB also provides access control via AWS IAM roles.

Monitoring

Monitoring is crucial to any business-critical application. It helps maintain reliability and also notifies personnel in case of an event or failure.

AWS offers detailed monitoring tools like CloudWatch Logs, CloudWatch Events, and CloudTrail Logs that will help you to watch, notify, and debug all types of events in DynamoDB. You can also set custom triggers based on metrics like system errors, capacity usage, and so on.

Now let’s compare DynamoDB with two of the popular database alternatives — MySQL and MongoDB.

DynamoDB vs MySQL

There is a major difference between MySQL and MongoDB because MySQL is a relational database. In terms of benefits, I think MySQL is limited because of the requirement of having a schema before you can start pushing data.

But MySQL is great for many use cases as well. It is often called “The world’s most popular open-source database” and it delivers a fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server.

But being a NoSQL database gives DynamoDB much more flexibility in terms of data modeling.

Even though AWS provides managed services for MySQL and other relational databases, DynamoDB is a database designed by AWS and not just a hosted database solution. So this offers more improvements and features that MySQL and other relational databases can’t.

DynamoDB vs MongoDB

DynamoDB and MongoDB are closely related to each other since both are NoSQL databases. But since DynamoDB is built and maintained by AWS it offers many more features and integrations, especially with other Amazon services like S3, compared to MongoDB.

If I were running a growing company I would prefer using DynamoDB solely because of its scalability and cross-region replication features. AWS does not offer a managed MongoDB service but if you are looking for one, MongoDB Atlas would be a great alternative.

Another important feature of DynamoDB over MongoDB is that MongoDB is not secure by default and you have to configure security yourself. DynamoDB is secure by default, so it might be a better option if security is a deal-breaker for you.

Wrapping Up

AWS DynamoDB is a fully managed NoSQL database that can scale in and scale out based on demand. AWS takes care of typical functions including software patching, replication, and maintenance.

DynamoDB also offers encryption at rest, point-in-time snapshots, and powerful monitoring capabilities. In a nutshell, it is a great option when you are building an application that needs a high-performance scalable NoSQL database.

Loved this article? Join my Newsletter and get a summary of my articles and videos every Monday morning. You can also visit my blog here.

Learn About NoSQL Databases in This 3-hour Course

ania kubow — Mon, 29 Nov 2021 15:47:00 +0000

NoSQL Databases can sometimes seem confusing and overwhelming, partly because of their flexibility.

This is why we have put together a 3-hour video course to help you understand exactly what a NoSQL Database is, as well as the different types available to you.

By the end of this course, you will have built 4 databases based on the 4 main types, and you'll have practised your learnings by building out projects.

But first, let's start with the basics.

What is NoSQL?

So the first thing you need to know is that NoSQL is an approach to database management.

It’s considered to be super flexible as it allows for a variety of data models, such as 'key-value', 'document', 'wide-column or tabular' and 'graph' formats.

These are the 4 we will be looking at closely in the video course, as well as the new emerging trend of Multi Model Databases.

With each deep-dive on the 4 NoSQL database types, we will be approaching each learning as an explanation, example, and exercise – so the 3 E’s – in order to fully grasp the topic we are discussing.

How do Databases Work?

Databases have multiple layers. The first layer is an interface, or in other words a visual platform where you can visit and interact with data. This is where you'll find the format, the language, and the transport.

In this video course, the interface we are going to use is called Datastax Astra Database management system. This is where we will be creating all 4 of our database types for the example and exercise parts.

DataStax Astra DB is an autoscaling database-as-a-service built on Apache Cassandra, designed to simplify cloud-native application development.

Because it is built on Apache Cassandra, you will see us using the Cassandra Query Language, or CQL, a few times in this course. CQL offers a model close to SQL in the sense that data is put in tables containing rows of columns. These languages are how we interact with the data in our database.

The next layer of a database is the execution layer. This is where we parse the incoming queries, coming from our interface. It is also used as an analyzer and a dispatcher.

And finally we have the storage layer, where the indexing of data happens.

Using Datastax Astra will allow us to create all 4 types of database types for this tutorial, so I won’t have to sign up to separate database management systems for each section. But you don't have to use it. There are literally dozens and dozens to choose from, so feel free to take your pick.

Let's get to it!

Now that you know what NoSQL databases types we will be learning about, as well as how Databases work, let's get to learning more about each one in detail.

Here are the topics this course will cover:

What is NoSQL?
Why use NoSQL?
SQL vs NoSQL
How to set up our Database
Tabular Type
Document Type
Key-value Type
Graph Type
Multi-Model Type explained
Project – How to use the Document API
Project – How to use the GraphQL API
Where to go next

Watch the course below or on the freeCodeCamp.org YouTube channel (3-hour watch).

Follow me on Youtube for more videos on Software Development:

Code with Ania Kubów

Hello everyone. This channel is run by Ania Kubow. In this channel, I will be teaching you JavaScript,React, HTML, CSS, React-native, Node.js and so much more! A little bit about me:My background is in the financial markets, where I worked as a derivates broker our of University. After starting m…

YouTube

The Apache Cassandra Beginner Tutorial

freeCodeCamp — Thu, 15 Jul 2021 13:13:02 +0000

By Sebastian Sigl

There are lots of data-storage options available today. You have to choose between managed or unmanaged, relational or NoSQL, write- or read-optimized, proprietary or open-source — and it doesn't end there.

Once you begin your search, you will end up in the universe that is database marketing. All of the vendors will tell you why their database is fantastic.

Unfortunately, it's difficult to find out when not to use a specific database, because this is not an attractive selling point.

If you know what questions to ask, you will eventually understand all the essential properties of a given system. In the end, your choice will depend on your expertise and your requirements.

In this tutorial I will introduce you to Apache Cassandra, a distributed, horizontally scalable, open-source database. Or as Cassandra users like to describe Cassandra: "It's a database that puts you in the driver seat."

I will share the essential gotchas and provide references to documentation. I’ll also provide insights based on my experience of running Cassandra on a large scale at work, with executable examples wherever possible.

Here’s an overview of everything you'll learn:

Along the way, you will learn to ask fundamental questions that will help you to chose a database that suits your needs. You'll also learn about other popular databases like Spanner, Cockroach, or FaunaDB, and how they can serve different use-cases.

How to Set Up a Cassandra Cluster
Cassandra Architecture
Data Modeling
Running a Cluster
- Fully Managed Cassandra
- Self-Managed Cassandra
Other Learnings
Conclusion
References

How to Set Up a Cassandra Cluster

To execute the examples of this tutorial, you'll need a running Cassandra cluster. You can get this up and running quickly by using Docker.

Required Docker settings

Your device should have a minimum of 8GB of memory and at least 8GB of free disk space. Your Docker settings should be updated to be able to use at least 6GB of memory, or better, 8GB.

To apply these suggestions, open your Docker preferences, go to Resources, and increase your memory threshold.

Cassandra is built for scale, and some features only work on a multi-node Cassandra cluster, so let’s start one locally.

For Linux and Mac, run the following commands:

# Run the first node and keep it in background up and running
docker run --name cassandra-1 -p 9042:9042 -d cassandra:3.7
INSTANCE1=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-1)
echo "Instance 1: ${INSTANCE1}"

# Run the second node
docker run --name cassandra-2 -p 9043:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1 cassandra:3.7
INSTANCE2=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-2)
echo "Instance 2: ${INSTANCE2}"

echo "Wait 60s until the second node joins the cluster"
sleep 60

# Run the third node
docker run --name cassandra-3 -p 9044:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1,$INSTANCE2 cassandra:3.7
INSTANCE3=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-3)

For Windows, run the following commands in PowerShell:

# Run the first node and keep it in background up and running
docker run --name cassandra-1 -p 9042:9042 -d cassandra:3.7
$INSTANCE1=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-1)
echo "Instance 1: ${INSTANCE1}"

# Run the second node
docker run --name cassandra-2 -p 9043:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1 cassandra:3.7
$INSTANCE2=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-2)
echo "Instance 2: ${INSTANCE2}"

echo "Wait 60s until the second node joins the cluster"
sleep 60

# Run the third node
docker run --name cassandra-3 -p 9044:9042 -d -e CASSANDRA_SEEDS=$INSTANCE1,$INSTANCE2 cassandra:3.7
$INSTANCE3=$(docker inspect --format="{{ .NetworkSettings.IPAddress }}" cassandra-3)

The startup process can take a few minutes.

You can verify if everything is done and ready by executing a Cassandra utility tool called nodetool via docker exec on a node:

$ docker exec cassandra-3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.17.0.3  112.69 KiB  256          68.7%             bb5ef231-0dd2-4762-a447-806a45f710ac  rack1
UN  172.17.0.2  107.96 KiB  256          68.3%             d7392374-8daa-4292-b724-cb790b0ee6ad  rack1
UN  172.17.0.4  93.93 KiB  256          63.0%             386d094f-5483-4945-a1a7-2bb3975d6167  rack1

UN means Up and Normal. Here, all 3 nodes are running and healthy.

In this tutorial we will send lots of queries to Cassandra. I recommend starting a new shell and connecting to one node using cqlsh. Here's how to start a cqlsh shell in Docker:

$ docker exec -it cassandra-1 cqlsh

Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh>

And to execute your first query:

cqlsh> DESCRIBE keyspaces;

system_traces  system_schema  system_auth  system  system_distributed

The response shows all the existing keyspaces. Keyspaces group tables and are similar to a database in a traditional relational database system. In other systems, groups of certain items are also known as namespaces.

Before you begin creating tables and inserting data, first create a keyspace in your local datacenter, which should replicate data 3 times:

cqlsh> CREATE KEYSPACE learn_cassandra
  WITH REPLICATION = { 
   'class' : 'NetworkTopologyStrategy',
   'datacenter1' : 3 
  };

A keyspace with a replication factor of 3 using the NetworkTopologyStrategy was created. The strategy defines how data is replicated in different datacenters. This is the recommended strategy for all user created keyspaces.

Why should you start with 3 nodes?

It’s recommended to have at least 3 nodes or more. One reason is, in case you need strong consistency, you need to get confirmed data from at least 2 nodes. Or if 1 node goes down, your cluster would still be available because the 2 remaining nodes are up and running.

You don’t need to fully understand this yet. After reading through the rest of this tutorial, things should be more clear.

Now, all the nodes are up and healthy. You have a 3-node Cassandra setup listening on ports 9042, 9043, and 9044 for client requests. This is a realistic setup for a small cluster.

In production, the instances would run on different machines to maximize performance.

Before you start creating tables, reading, and writing data, it's helpful to understand the basics of designing tables for scalability.

In this tutorial, you will create tables with different settings for a to-do list application. If you want to get your hands dirty straight away, you can jump directly to the next cqlsh example.

Cassandra Architecture

Cassandra is a decentralized multi-node database that physically spans separate locations and uses replication and partitioning to infinitely scale reads and writes.

Decentralization

Cassandra is decentralized because no node is superior to other nodes, and every node acts in different roles as needed without any central controller. We'll get into examples of decentralization a bit later in this section.

Cassandra's decentralized property is what allows it to handle situations easily in case one node becomes unavailable or a new node is added.

Every Node Is a Coordinator

Data is replicated to different nodes. If certain data is requested, a request can be processed from any node.

This initial request receiver becomes the coordinator node for that request. If other nodes need to be checked to ensure consistency then the coordinator requests the required data from replica nodes.

The coordinator can calculate which node contains the data using a so-called consistent hashing algorithm.

Every node can be a coordinator

The coordinator is responsible for many things, such as request batching, repairing data, or retries for reads and writes.

Data Partitioning

“[Partitioning] is a method of splitting and storing a single logical dataset in multiple databases. By distributing the data among multiple machines, a cluster of database systems can store larger datasets and handle additional requests.

”How Sharding Works by Jeeyoung Kim

As with many other databases, you store data in Cassandra in a predefined schema. You need to define a table with columns and types for each column.

Additionally, you need to think about the primary key of your table. A primary key is mandatory and ensures data is uniquely identifiable by one or multiple columns.

The concept of primary keys is more complex in Cassandra than in traditional databases like MySQL. In Cassandra, the primary key consists of 2 parts:

a mandatory partition key and
an optional set of clustering columns.

You will learn more about the partition key and clustering columns in the data modeling section.

For now, let's focus on the partition key and its impact on data partitioning.

Consider the following table:

Table Users | Legend: p - Partition-Key, c - Clustering Column

country (p) | user_email (c)  | first_name | last_name | age
----------------------------------------------------------------
US          | john@email.com  | John       | Wick      | 55  
UK          | peter@email.com | Peter      | Clark     | 65  
UK          | bob@email.com   | Bob        | Sandler   | 23 
UK          | alice@email.com | Alice      | Brown     | 26

Together, the columns user_email and country make up the primary key.

The country column is the partition key (p). The CREATE-statement for the table looks like this:

cqlsh> 
CREATE TABLE learn_cassandra.users_by_country (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), user_email)
);

The first group of the primary key defines the partition key. All other elements of the primary key are clustering columns:

Let’s fill the table with some data:

cqlsh> 
INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('US', 'john@email.com', 'John','Wick',55);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'peter@email.com', 'Peter','Clark',65);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'bob@email.com', 'Bob','Sandler',23);

INSERT INTO learn_cassandra.users_by_country (country,user_email,first_name,last_name,age)
  VALUES('UK', 'alice@email.com', 'Alice','Brown',26);

If you’re used to designing traditional relational database tables like it’s taught in school or university, you might be surprised. Why would you use country as an essential part of the primary key?

This example will make sense after you understand the basics of partitioning in Cassandra.

Partitioning is the foundation for scalability, and it is based on the partition key. In this example, partitions are created based on country. All rows with the country US are placed in a partition. All other rows with the country UK will be stored in another partition.

In the context of partitioning, the words partition and shard can be used interchangeably.

Partitions are created and filled based on partition key values. They are used to distribute data to different nodes. By distributing data to other nodes, you get scalability. You read and write data to and from different nodes by their partition key.

The distribution of data is a crucial point to understand when designing applications that store data based on partitions. It may take a while to get fully accustomed to this concept, especially if you are used to relational databases.

Instead, think about how you read and write data and how partitioning should be done to scale horizontally.

What does horizontal scaling mean?

Horizontal scaling means you can increase throughput by adding more nodes. If your data is distributed to more servers, then more CPU, memory, and network capacity is available.

You might ask, then why do you even need email in the primary key?

The answer is that the primary key defines what columns are used to identify rows. You need to add all columns that are required to identify a row uniquely to the primary key. Using only the country would not identify rows uniquely.

The partition key is vital to distribute data evenly between nodes and essential when reading the data. The previously defined schema is designed to be queried by country because country is the partition key.

A query that selects rows by country performs well:

cqlsh> 
  SELECT * FROM learn_cassandra.users_by_country WHERE country='US';

In your cqlsh shell, you will send a request only to a single Cassandra node by default. This is called a consistency level of one, which enables excellent performance and scalability.

If you access Cassandra differently, the default consistency level might not be one.

What does consistency level of one mean?

A consistency level of one means that only a single node is asked to return the data. With this approach, you will lose strong consistency guarantees and instead experience eventual consistency.

We’ll dive deeper into consistency levels later on.

Let's create another table. This one has a partition defined only by the user_email column:

cqlsh> 
CREATE TABLE learn_cassandra.users_by_email (
    user_email text,
    country text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY (user_email)
);

Now let’s fill this table with some records:

cqlsh> 
INSERT INTO learn_cassandra.users_by_email (user_email, country,first_name,last_name,age)
  VALUES('john@email.com', 'US', 'John','Wick',55);

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('peter@email.com', 'UK', 'Peter','Clark',65); 

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('bob@email.com', 'UK', 'Bob','Sandler',23);

INSERT INTO learn_cassandra.users_by_email (user_email,country,first_name,last_name,age)
  VALUES('alice@email.com', 'UK', 'Alice','Brown',26);

This time, each row is put in its own partition.

This is not bad, per se. If you want to optimize for getting data by email only, it's a good idea:

cqlsh> 
  SELECT * FROM learn_cassandra.users_by_email WHERE user_email='alice@email.com';

If you set up your table with a partition key for user_email and want to get all users by age, you would need to get the data from all partitions because the partitions were created by user_email.

Talking to all nodes is expensive and can cause performance issues on a large cluster.

Cassandra tries to avoid harmful queries. If you want to filter by a column that is not a partition key, you need to tell Cassandra explicitly that you want to filter by a non-partition key column:

cqlsh> 
SELECT * FROM learn_cassandra.users_by_email WHERE age=26 ALLOW FILTERING;

Without ALLOW FILTERING, the query would not be executed to prevent harm to the cluster by accidentally running expensive queries. Executing queries without conditions (like without a WHERE clause) or with conditions that don’t use the partition key, are costly and should be avoided to prevent performance bottlenecks.

But how do you get all the rows from the table in a scalable way?

If you can, partition by a value like country. If you know all the countries, you can then iterate over all available countries, send a query for each one, and collect the results in your application.

In terms of scalability, it’s worse to just select all rows, because when you use a table partitioned by user_email, all the data is collected in 1 request in a single coordinator.

This is OK as long as you have no performance issues.

By comparison, sending multiple requests by country distributes the effort to different coordinator nodes, which scales a lot better.

If you still need access to all of the data, there is an excellent integration between Spark and Cassandra that allows efficient reads and writes for massive datasets. The Spark connector for Cassandra groups your data by partition key and can execute queries very efficiently.

Replication

Scalability using partitioning alone is limited.

Consider a lot of write requests arriving for a single partition. All requests would be sent to a single node with technical limitations such as CPU, memory, and bandwidth. Additionally, you want to handle read and write requests if this node is not available.

That is where the concept of replication comes in. By duplicating data to different nodes, so called replicas, you can serve more data simultaneously from other nodes to improve latency and throughput. It also enables your cluster to perform reads and writes in case a replica is not available.

In Cassandra, you need to define a replication factor for every keyspace. At the beginning of our example, you created a keyspace with a replication factor of 3 for our default datacenter:

cqlsh> CREATE KEYSPACE learn_cassandra
  WITH REPLICATION = { 
   'class' : 'NetworkTopologyStrategy',
   'datacenter1' : 3 
  };

A replication factor of one means there’s only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved.

A replication factor of two means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica.

As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.

Usually, it's recommended to use a replication factor of 3 for production use cases. It makes sure your data is very unlikely to get lost or become inaccessible because there are three copies available. Also, if data is not consistent between replicas at any point in time, you can ask what information state is held by the majority.

In your local cluster setup, the majority means 2 out of 3 replicas. This allows us to use some powerful query options that you will see in the next section.

Consistency Level

Now that you know about partitioning and replication, you are ready to think about consistency levels. Cassandra has a truly outstanding feature called tunable consistency.

You can define the consistency level of your read and write queries. You can check the Cassandra docs for all available settings.

Let’s focus on the most popular settings and try to understand when to choose each consistency level.

Let’s assume you have 3 replicas defined.

The first question you need to answer is, do you need strong consistency?

What does strong consistency mean?

In contrast to eventual consistency, strong consistency means only one state of your data can be observed at any time in any location.

For example, when consistency is critical, like in a banking domain, you want to be sure that everything is correct. You would rather accept a decrease in availability and increase of latency to ensure correctness.

It all comes down to the CAP theorem. You can not be available and consistent at the same time in case of connection issues between nodes of your cluster.

Let's think through the following example:

You want to write a single value to a table. The data is replicated in 2 nodes, and the connection between the nodes is interrupted. First, a write-request is sent to node 1. Then, data is read from node 2.

How do you manage this situation?

Should you disallow writes to all nodes to ensure consistency? This means availability would be sacrificed to ensure consistency and correctness.
Accept the write to node 1 and keep serving reads from both nodes. This would keep the system available but depending on what node you read from, the answer will be different, which means sacrificing consistency over availability.

You can simplify the problem to make crucial decisions for your application: Do you want consistency or availability?

Another factor is latency. By talking to more nodes to ensure consistency, you need to wait longer to receive all nodes’ responses.

Tune for Consistency by Setting up a Strong Consistency Application

There is a very important formula that if true guarantees strong consistency:

[read-consistency-level] + [write-consistency-level] > [replication-factor]

What does consistency level mean?

Consistency level means how many nodes need to acknowledge a read or a write query.

You can shift read and write consistency levels to your favor if you want to keep strong consistency. Or you even give up strong consistency for better performance, which is also called eventual consistency:

For a read-heavy system, it’s recommended to keep read consistency low because reads happen more often than writes. Let's say you have a replication factor of 3. The formula would look like this:

1 + [write-consistency-level] > 3

Therefore, the write consistency has to be set to 3 to have a strongly consistent system.

For a write-heavy system, you can do the same. Set the write consistency level to 1 and the read consistency level to 3.

You either check every node for a read to ensure all nodes have received the last updated state, or, for a write, you ensure that all nodes have written the update to their local storage. Both will make sure that data for reading and writing is correct.

This decision needs to be reflected in all the applications that access your Cassandra data because, on a query level, you need to set the required consistency level.

You set the replication factor of 3. Therefore, you can use a consistency level of ALL or THREE:

cqlsh> 
   CONSISTENCY ALL;
   SELECT * FROM learn_cassandra.users_by_country WHERE country='US';

If just one of your applications violates the required consistency strategy, you are quickly at the risk of either dropping consistency or pressuring the cluster more than required.

Tune for Performance by Using Eventual Consistency

If you don't need to be strongly consistent, you can reduce the consistency level for queries to 1 to gain performance:

cqlsh> 
   CONSISTENCY ONE;
   SELECT * FROM learn_cassandra.users_by_country WHERE country='US';

Eventually, the data will be spread to all replicas and this will ensure eventual consistency. How fast data will be made consistent depends on different mechanics that sync data between nodes.

Various features can be tuned in Cassandra, like read-repairs and external processes that repair data continuously.

Optimize Data Storage for Reading or Writing

Writes are cheaper than reads in Cassandra due to its storage engine. Writing data means simply appending something to a so-called commit-log.

Commit-logs are append-only logs of all mutations local to a Cassandra node and reduce the required I/O to a minimum.

Reading is more expensive, because it might require checking different disk locations until all the query data is eventually found.

But this does not mean Cassandra is terrible at reading. Instead, Cassandra's storage engine can be tuned for reading performance or writing performance.

Understanding Compaction

For every write operation, data is written to disk to provide durability. This means that if something goes wrong, like a power outage, data is not lost.

The foundation for storing data are the so-called SSTables. SSTables are immutable data files Cassandra uses to persist data on disk.

You can set various strategies for a table that define how data should be merged and compacted. These strategies affect read and write performance:

SizeTieredCompactionStrategy is the default, and is especially performant if you have more writes than reads,
LeveledCompactionStrategy optimizes for reads over writes. This optimization can be costly and needs to be tried out in production carefully
TimeWindowCompactionStrategy is for Time-series data

By default, tables use the SizeTieredCompactionStrategy:

cqlsh> 
   DESCRIBE TABLE learn_cassandra.users_by_country;

CREATE TABLE learn_cassandra.users_by_country (
    country text,
    user_email text,
    age smallint,
    first_name text,
    last_name text,
    PRIMARY KEY (country, user_email)
) WITH CLUSTERING ORDER BY (user_email ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

Although you can alter the compaction strategy of an existing table, I would not suggest doing so, because all Cassandra nodes start this migration simultaneously. This will lead to significant performance issues in a production system.

Instead, define the compaction strategy explicitly during table creation of your new table:

cqlsh> 
CREATE TABLE learn_cassandra.users_by_country_with_leveled_compaction (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), user_email)
) WITH
  compaction = { 'class' :  'LeveledCompactionStrategy'  };

Let’s check the result:

cqlsh> 
   DESCRIBE TABLE learn_cassandra.users_by_country_with_leveled_compaction;

CREATE TABLE learn_cassandra.users_by_country_with_leveled_compaction (
    country text,
    user_email text,
    age smallint,
    first_name text,
    last_name text,
    PRIMARY KEY (country, user_email)
) WITH CLUSTERING ORDER BY (user_email ASC)
    AND bloom_filter_fp_chance = 0.1
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

The strategies define when and how compaction is executed. Compaction means rearranging data on disk to remove old data and keep performance as good as possible when more data needs to be stored.

Check out the excellent DataStax documentation about compaction for details. There may even be better strategies in the future for the performance of your use-case.

Presorting Data on Cassandra Nodes

A table always requires a primary key. A primary key consists of 2 parts:

At least 1 column(s) as partition key and
Zero or more clustering columns for nesting rows of the data.

All columns of the partition key together are used to identify partitions. All primary key columns, meaning partition key and clustering columns, identify a specific row within a partition.

In Cassandra, data is already sorted on disk. So if you want to avoid sorting data later, you can make sure sorting is applied as needed. This can be ensured on the table level and avoids having to sort data in the client applications that query Cassandra.

In our users_by_country table, you can define age as another clustering column to sort stored data:

cqlsh> 
CREATE TABLE learn_cassandra.users_by_country_sorted_by_age_asc (
    country text,
    user_email text,
    first_name text,
    last_name text,
    age smallint,
    PRIMARY KEY ((country), age, user_email)
) WITH CLUSTERING ORDER BY (age ASC);

Let’s add the same data again:

cqlsh> 
INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('US','john@email.com', 'John','Wick',10);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'peter@email.com', 'Peter','Clark',30);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'bob@email.com', 'Bob','Sandler',20);

INSERT INTO learn_cassandra.users_by_country_sorted_by_age_asc (country,user_email,first_name,last_name,age)
  VALUES('UK', 'alice@email.com', 'Alice','Brown',40);

And get the data by country:

cqlsh> 
      SELECT * FROM learn_cassandra.users_by_country_sorted_by_age_asc WHERE country='UK';

 country | age | user_email       | first_name | last_name
---------+-----+------------------+------------+-----------
      UK |  20 | bob@email.com   |        Bob |   Sandler
      UK |  30 | peter@email.com |      Peter |     Clark
      UK |  40 | alice@email.com |      Alice |     Brown

(3 rows)

In this example, the clustering columns are age and user_email. So the data is first sorted by age and then by user_email. At its core, Cassandra is still like a key-value store. Therefore, you can only query the table by:

country
country and age
country, age, and user_email

But never by country and user_email.

After learning about partitioning, replication and consistency levels, let's head into data modeling and have more fun with the Cassandra cluster.

Data Modeling

You've already learned a lot about the fundamentals of Cassandra.

Let's put your knowledge into practice and design a to-do list application that receives many more reads than writes.

The best approach is to analyze some user stories you want to fulfill with your table design:

As a user, I want to create a to-do element

Note: This is only about creating data. For now, you can delay some decisions because you want to focus on how data is read.

As a user, I want to list all my to-do elements in ascending order

First, you need to query by user_email. Create a table called todos_by_user_email.

You need 1 table that contains all the information of a to-do element of a user. Data should be partitioned by user_email for efficient read and writes by user_email.

Also, the oldest records should be displayed first, which means using the creation date as a clustering column. The creation_date also ensures uniqueness.:

cqlsh> 
CREATE TABLE learn_cassandra.todo_by_user_email (
    user_email text,
    name text,
    creation_date timestamp,
    PRIMARY KEY ((user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };

As a user, I want to share a to-do element with another user

To get all the to-dos shared with a user, you need to create a table called todos_shared_by_target_user_email to display all shared to-dos for the target user.

The table contains the to-do name to display it.

But the user also wants to see the to-dos they shared with other users. This is another table, todos_shared_by_source_user_email.

Both tables have, according to the use-case, the required user_email as partition keys to allow efficient queries. Also, creation_date is added as a clustering column for sorting and uniqueness:

cqlsh> 
CREATE TABLE learn_cassandra.todos_shared_by_target_user_email (
    target_user_email text,
    source_user_email text,
    creation_date timestamp,
    name text,
    PRIMARY KEY ((target_user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };

CREATE TABLE learn_cassandra.todos_shared_by_source_user_email (
    target_user_email text,
    source_user_email text,
    creation_date timestamp,
    name text,
    PRIMARY KEY ((source_user_email), creation_date)
) WITH CLUSTERING ORDER BY (creation_date DESC)
AND compaction = { 'class' :  'LeveledCompactionStrategy'  };

This type of modeling is different than thinking about foreign keys and primary keys that you might know from traditional databases. In the beginning, it's all about defining tables and thinking about what values you want to filter and need to display.

You need to set a partition key to ensure the data is organised for efficient read and write operations. Also, you need to set clustering columns to ensure uniqueness, sort order, and optional query parameters.

Keep Data in Sync Using `BATCH` Statements

Due to the duplication, you need to take care to keep data consistent. In Cassandra, you can do that by using BATCH statements that give you an all-at-once guarantee, also called atomicity.

This might sound like a lot of work, and yes, it is a lot of work! If you have a table schema with many relationships, you will have more work compared to a normalized table schema.

What is a normalized table schema?

A normalized table schema is optimized to contain no duplications. Instead, data is referenced by ID and needs to be joined later.

In Cassandra, you try to avoid normalized tables. It is not even possible to write a query that contains a join.

Batch statements are cheap on a single partition, but dangerous when you execute them on different partitions, because:

Data mutations will not be applied at the same time to all partitions, with no isolation
It is expensive for the coordinator node, because you have to talk to multiple nodes and prepare for a rollback if something goes wrong
There is a batch query size limit of 50kb to avoid overloading the coordinator. This limit can be increased, but this is not recommended

In general, batches are costly.

There are other ways to apply changes eventually. If you need to execute them very often, consider using async queries instead with a proper retry mechanism.

Depending on the way you access your Cassandra, the driver might already offer you retry capabilities.

Still, this approach requires thinking about what will happen if a query is never executed. If every query really needs to be executed eventually, how can you make sure that it does not get lost if your service goes down?

The topic itself needs much more time to explain, and might be the main topic of another Cassandra tutorial.

The key learning here is:

Single partition batches are cheap and should be used
Batches that include different partitions are expensive, and if there are a lot of reads/writes, this might be the reason why a Cassandra cluster is exhausted.

Let’s create a BATCH statement that contains a to-do element that is shared with a user:

cqlsh> 

BEGIN BATCH
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('alice@email.com', toTimestamp(now()), 'My first todo entry')

  INSERT INTO learn_cassandra.todos_shared_by_target_user_email (target_user_email, source_user_email,creation_date,name) VALUES('bob@email.com', 'alice@email.com',toTimestamp(now()), 'My first todo entry')

  INSERT INTO learn_cassandra.todos_shared_by_source_user_email (target_user_email, source_user_email,creation_date,name) VALUES('alice@email.com', 'bob@email.com', toTimestamp(now()), 'My first todo entry')

APPLY BATCH;

Let’s look into one of the tables:

cqlsh>          
 SELECT * FROM learn_cassandra.todos_shared_by_target_user_email WHERE target_user_email='bob@email.com';

 target_user_email | creation_date   | name   | source_user_email
-------------------+-----------------+--------+-------------------
bob@email.com | 2021-05-24 ...| My first todo entry |   alice@email.com

All the data exists and can be accessed in a performant way using all the defined tables.

Use Foreign Keys Instead of Duplicating Data in Cassandra

You might consider using foreign keys instead of duplicating data.

Traditionally, foreign keys are ID-references of an entity that are located in another table and in relational database. They guarantee that the referenced ID exists.

In Cassandra, this might feel good because you have less duplicated data. At this point, think again about why you use Cassandra. Usually, the answer is high traffic and scalability.

Cassandra can scale enormously and comes with top performance when used correctly.

Normalizing tables is against a lot of principles in Cassandra. You can reference data by ID, but keep in mind this means you need to join the data yourself. This also means reading and writing data to multiple partitions at once.

Cassandra is built for scale. If you start normalizing your schema to reduce duplication, then you sacrifice horizontal scalability.

If you still want to use foreign keys instead of data duplication, you might want to use another database. But, everything comes with trade-offs.

Instead of using Cassandra, you could use a database that sacrifices performance and availability, and gives more consistency guarantees. In cases like this, I can recommend Cloud Spanner or Cockroach DB for a scalable relational database.

Indexes in Cassandra

There are index-like features in Cassandra that can reduce the number of tables you need to maintain on your own. One feature is called secondary indexes.

I cannot recommend them because they only operate locally to a node.

Using a secondary index means talking to all nodes because the coordinator doesn’t know which nodes contain the data if you use other columns to query data than the actual partition key.

Materialized Views

Materialized views were designed with scalability in mind.

They make it easier to duplicate tables with different partition keys so you can query data by different column combinations. They also simplify the process of creating a new table and ensuring data integrity for mutations.

There is only one drawback — the source table's full primary key needs to be part of the materialized view's primary key, and optionally, one other column.

The columns that act as partition keys can be different.

Running a Cluster

Running a Cassandra cluster can be intense. It contains your business-critical data and is usually under heavy pressure.

I won't go into details because I am more a Cassandra user than an expert in cluster maintenance. Still, I want to share my knowledge.

Fully Managed Cassandra

Datastax started a fully managed Cassandra product called Astra. They promise a lot:

Start in minutes with a free tier, no credit card needed.

Eliminate the overhead to install, operate, and scale Cassandra clusters.

Build faster with REST, GraphQL, CQL, and JSON/Document APIs.

Built on open-source Apache Cassandra™, used by the best of the internet.

Scale elastically — apps are viral ready from Day 1.

Deploy multi-cloud, multi-tenant or dedicated clusters on AWS, Azure, or GCP.

Ensure enterprise-level reliability, security, and management.

Quoted from the Astra docs

I have no experience with their offering. But I would give it a try! Their pricing sounds reasonable.

Self-Managed Cassandra

Cassandra is built with Java. So knowing the basics of running JVM applications is very beneficial.

If you run Kubernetes, then definitely check out K8ssandra. It bundles all the helpful tools around Cassandra like:

Stargate.io for REST, Graphql, and API Documentation
Reaper for easier repair management
Medusa for backups
Metrics collector for monitoring
Traefik for ingress

This stack of tools is fully open source and can be used without any additional monetary costs.

For developers, there is one very beneficial tool called nodetool. It can inspect and provide insights into how many nodes are up, what size certain tables have, how many SSTables and tombstones exist. Nodetool can also repair your data to enforce eventual consistency.

Other Learnings

Even after years of using Cassandra, there are still things to learn that let you use Cassandra more efficiently. In this section, I want to share various topics that you will experience eventually.

Data Migrations

If you have worked with other databases before, you might know database migration tools like flyway or liquibase. Since version 4.0 RC-1, there is basic liquibase support.

Additionally, the community worked on something similar with Cassandra-migration. It already supports advanced features such as leader election, for when multiple services start at the same time.

Any type of export and import can be done using DSBulk that allows loading and unloading data from and to Cassandra in CSV and JSON formats.

Tombstones

Cassandra is a multi-node cluster that contains replicated data on different nodes. Therefore, a delete can not simply delete a particular record.

For a delete operation, a new entry is added to the commit-log like for any other insert and update mutation. These deletes are called tombstones, and they flag a specific value for deletion.

Tombstones exist only on disk and can be analyzed and traced as described in this blog post: About Deletes and Tombstones in Cassandra.

In Cassandra, you can set a time to live on inserted data. After the time passed, the record will be automatically deleted. When you set a time to live (TTL), a tombstone is created with a date in the future.

In comparison, a regular delete query is the same with the difference that the time date of the tombstone is set to the moment the delete is executed.

Let’s create a tombstone by setting a TTL in seconds which basically function as a delayed delete:

cqlsh>     
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', toTimestamp(now()), 'This entry should be removed soon') USING TTL 60;

And the data is stored like regular data:

cqlsh>      
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

  user_email    | creation_date | name
----------------+---------------+--------------------
 john@email.com | 2021-05-30... | This entry should be removed soon

(1 rows)

You can also read the TTL from the database for a given column:

cqlsh> 
 SELECT TTL(name) FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

 ttl(name)
-----------
        43

(1 rows)

After 60 seconds, the row is gone.

cqlsh>  
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';                                  

 user_email | creation_date | todo_uuid | name
-----------+---------------+-----------+------

(0 rows)

Setting a TTL is one of many ways to create and execute tombstones.

Unfortunately, there are also others.

For example, when you insert a null value, a tombstone is created for the given cell. And as mentioned for delete requests, different types of tombstones are stored.

By default, after 10 days, data that is marked by a tombstone is freed with a compaction execution. This time can be configured and reduced using the gc_grace_seconds option in the Cassandra configuration.

When is a compaction executed?

When the operation is executed depends mainly on the selected strategy. In general, a compaction execution takes SSTables and creates new SSTables out of it.

The most common executions are:

When conditions for a compaction are true, that triggers compaction execution when data is inserted

A manually executed major compaction using the nodetool

Sometimes, tombstones not deleted for the following reasons:

Null values mark values to be deleted and are stored as tombstones. This can be avoided by either replacing null with a static value, or not setting the value at all if the value is null
Empty lists and sets are similar to null for Cassandra and create a tombstone, so don’t insert them if they’re empty. Take care to avoid null pointer exceptions when storing and retrieving data in your application
Updated lists and sets create tombstones. If you update an entity and the list or set does not change, it still creates a tombstone to empty the list and set the same values. Therefore, only update necessary fields to avoid issues. The good thing is, they are compacted due to the new values

If you have many tombstones, you might run into another Cassandra issue that prevents a query from being executed.

This happens when the tombstone_failure_threshold is reached, which is set by default to 100,000 tombstones. This means that, when a query has iterated over more than 100,000 tombstones, it will be aborted.

The issue here is, once a query stops executing, it’s not easy to tidy things up because Cassandra will stop even when you execute a delete, as it has reached the tombstone limit.

Usually you would never have that many tombstones. But mistakes happen, and you should take care to avoid this case.

There is a handy operation metric that you should observe called TombstoneScannedHistogram to avoid unexpected issues in production.

`UPDATE`s Are Just `INSERT`s, and Vice Versa

In Cassandra, everything is append-only. There is no difference between an update and insert.

You already learned that a primary key defines the uniqueness of a row. If there is no entry yet, a new row will appear, and if there is already an entry, the entry will be updated. It does not matter if you execute an update or insert a query.

The primary key in our example is set to user_email and creation_date that defines record uniqueness.

Let’s insert a new record:

cqlsh>      
  INSERT INTO learn_cassandra.todo_by_user_email (user_email, creation_date, name) VALUES('john@email.com', '2021-03-14 16:07:19.622+0000', 'Insert query');

And execute an update with a new todo_uuid:

cqlsh>    
  UPDATE learn_cassandra.todo_by_user_email SET 
    name = 'Update query'
  WHERE user_email = 'john@email.com' AND creation_date = '2021-03-14 16:10:19.622+0000';

2 new rows appear in our table:

cqlsh>    
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';                                                                                                            

  user_email     | creation_date                   | name
----------------+---------------------------------+--------------
 john@email.com | 2021-03-14 16:10:19.622000+0000 | Update query
 john@email.com | 2021-03-14 16:07:19.622000+0000 | Insert query

(2 rows)

So you inserted a row using an update, and you can also use an insert to update:

cqlsh>       
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', '2021-03-14 16:07:19.622+0000', 'Insert query updated');

Let’s check our updated row:

cqlsh>   
 SELECT * FROM learn_cassandra.todo_by_user_email WHERE user_email='john@email.com';

 user_email     | creation_date            | name
----------------+--------------------------+----------------------
 john@email.com | 2021-03-14 16:10:19.62   |         Update query
 john@email.com | 2021-03-14 16:07:19.62   | Insert query updated


(2 rows)

So UPDATE and INSERT are technically the same. Don’t think that an INSERT fails if there is already a row with the same primary key.

The same applies to an UPDATE — it will be executed, even if the row doesn’t exist.

The reason for this is because, by design, Cassandra rarely reads before writing to keep performance high. The only exceptions are described in the next section about lightweight transactions.

But, there are restrictions what actions you can execute based on an update or insert:

Counters can only be changed with UPDATE, not with Insert
IF NOT EXISTS can only be used in combination with an INSERT
IF EXISTS can only be used in combination with an UPDATE

You will learn more about conditions in queries within the next section.

Lightweight Transactions

You can use conditions in queries using a feature called lightweight transactions (LWTs), which execute a read to check a certain condition before executing the write.

Let’s only update if an entry already exists, by using IF EXISTS:

cqlsh>     
  UPDATE learn_cassandra.todo_by_user_email SET
    name = 'Update query with LWT'
  WHERE user_email = 'john@email.com' AND creation_date = '2021-03-14 16:07:19.622+0000' IF EXISTS;

 [applied]
-----------
      True

The same works for an insert query using IF NOT EXISTS:

cqlsh>      
  INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name) VALUES('john@email.com', toTimestamp(now()), 'Yet another entry') IF NOT EXISTS;

 [applied]
-----------
      True

Those executions are expensive compared to simple UPDATE and INSERT queries. Still, if it’s business-critical, they are an excellent way to achieve transactional safety.

Conclusion

I hope you enjoyed the article.

If you liked it and feel the need to give me a round of applause, or just want to get in touch, follow me on Twitter.

I work at eBay Kleinanzeigen, one of the world’s biggest classified companies. By the way, we are hiring!

Special thanks goes to Roger Sheen, Michael de la Fontaine, Christian Baer, Thomas Uebel and Swen Fuhrmann for excellent feedback and proof-reading.

References

How to Create a NoSQL Database with RavenDB

freeCodeCamp — Fri, 09 Jul 2021 15:06:24 +0000

By Nahla Davies

If you look at any website or application today, somewhere under the hood there is a database. After all, we live in the world of Big Data. And the volume of data is growing exponentially.

With so much data at hand, we need ever more sophisticated ways to store it and process it.

So job markets continue to be strong for most computer professionals working remotely from home, including database architects and database administrators.

There are even more opportunities in data science and analytics. But you need a solid foundation in database programming to take advantage of these opportunities.

In this article, I'll introduce you to the RavenDB database management system. We'll review some essential RavenDB features and after that I'll walk you through setting up your first RavenDB database.

What is RavenDB?

RavenDB is a cross-platform, distributed, ACID-compliant, document-based, NoSQL database that offers high performance while remaining fairly easy to use.

Knowledge of data programming is also crucial for web and software development, which has become one of the most lucrative remote working jobs in the United States today.

RavenDB Features

To use RavenDB effectively, you should understand how each of its features works and why they're important.

Cross-platform

RavenDB is available for Windows, Linux, and Raspberry Pi. Mac users can run RavenDB within the Docker container system.

This gives developers great flexibility when developing databases and associated applications.

Distributed database

Generally speaking, a distributed database hosts data in multiple physical locations (for example, different sites or computers).

While the specifics of RavenDB's distributed architecture are beyond the scope of this article, you should understand two of its fundamental elements: clusters and nodes.

Clusters are collections of an odd number of machines, with a minimum of three. Each machine in the cluster is a node. Databases can spread across one or more nodes in the cluster. In some instances, an entire database may be present on each node in a cluster.

In addition to data distribution, clusters self-manage distribution of work, along with failure and recovery efforts.

Distributed database architecture allows for high transaction throughput, that is, high performance. RavenDB can handle up to 150,000 writes and 1 million reads per second.

Distributed architecture also is more resilient when failures occur compared to traditional relational databases.

The distributed architecture of NoSQL databases (see below) makes them useful for developing mobile applications. Still, you should remain vigilant against mobile security risks, as 89% of mobile device vulnerabilities do not require physical access to the mobile device.

ACID-compliant

ACID is an acronym for a set of database properties that help ensure the reliable processing of database transactions:

Atomicity ensures that every database transaction is treated as a single unit, no matter how many statements the transaction includes. Atomicity prevents problematic partial updates. During processing, transactions either succeed or fail as units. If a single statement within the transaction fails, the entire transaction fails. Other database clients can never perceive a transaction to be partially resolved.
Consistency ensures that transactions comply with all data validation rules in the database. If a transaction generates non-compliant data, the database rolls back to the prior valid version.
Isolation ensures that when multiple transactions take place concurrently, the transactions do not affect each other and do not attempt to use data from an in-process transaction. The final database update for a set of concurrent transactions is the same as if each transaction was processed in series.
Durability prevents the loss of completed transaction data, even in the event of post-processing system failures. Completed transaction data becomes permanent in the database system, typically in non-volatile memory.

Most NoSQL databases are not ACID-compliant. RavenDB is an exception, using ACID principles to drive high performance while also ensuring data integrity and reliability.

NoSQL

The value of NoSQL versus SQL if often debated. For our purposes, we can simplify the difference.

In traditional relational databases, SQL programming dominates. In non-relational, distributed databases, NoSQL reigns.

SQL databases rely on tables. NoSQL databases can use other bases, including documents (as RavenDB does), dynamic tables, key-value pairs, and more.

NoSQL databases rely on distributed architecture to scale horizontally. As the database size increases, it is split among several different nodes in a cluster. SQL databases scale vertically – more data requires larger servers.

Searches are also frequently faster in NoSQL databases. Whereas SQL database queries rely on joins or combinations of data from multiple tables into a new table, NoSQL queries typically do not need joins.

Since many NoSQL implementations are cloud-based, developers must always keep encryption of their databases and applications front of mind for security purposes.

Document-based

Document-based does not mean that Raven only stores PDFs or word processing documents. For the purposes of NoSQL databases, a document is a collection of structured (actually semi-structured) self-contained data.

You can use one of several languages to code the documents that will eventually reside in the NoSQL database, including Extensible Markup Language (XML) and JavaScript Object Notation (JSON). RavenDB primarily uses JSON documents.

Document-based databases are generally more efficient than their relational counterparts because they store all information about an object in a single document instance rather than spread across multiple tables. This structure increases database efficiency, as it does not require object-relational mapping.

How to Create a New RavenDB Database

It's relatively simple to create a new RavenDB database. But before creating a database, you first need to install the RavenDB system.

You can download RavenDB on its website depending on your chosen operating system (Windows, Linux, or Raspberry Pi), and there's a Docker version for Mac users.

Installation is quick and easy. You must select whether you want to use a secure or non-secure version.

Secure versions require you to either have or obtain a security certificate, but getting one through RavenDB is also painless. Free certificate licenses are available for the entry-level version of RavenDB.

Once you have installed RavenDB, only a few steps remain before you are working in your first database:

Login to your RavenDB application and go to your dashboard.
You will see a menu item for Databases on the dashboard, which you will click to start the process.

The window that opens includes a dropdown to search for existing databases, a search box, and a New database button. Click on it.
Once you have opened the new database, you must name it. Names may be as long as 128 characters, including letters, numbers and a limited selection of special characters (“-”, “_”, “.”).
After naming your database, you must assign a replication factor, which specifies distribution of your data across nodes. A replication factor of one means all data is in a single node. For settings above 1, you can choose between dynamic distribution or manual replication node setting (with the appropriate license).

After completing these steps, you will return to the main database window. All that is left to do is click on the database name, and you are ready to begin creating documents.

For true beginners, RavenDB offers users the option to populate an empty database with sample data so that you can get a better feel for how to work in the database.

Wrap-up

RavenDB is a powerful, robust, easy-to-use and easy-to-learn NoSQL database system.

For users looking to improve their database design and administration skills, RavenDB is a user-friendly training ground.

How to Use Transactions in MongoDB to Prevent Inconsistencies in Your Java Code

freeCodeCamp — Tue, 12 Jan 2021 06:33:00 +0000

By Haritha Yahathugoda

The Latest MongoDB version 4.2 introduced multi-document transactions. This was a key feature that was missing from most NoSQL databases (and which SQL DBs bragged about).

A transaction, which can be composed of one or more operations, acts as an atomic operation. If all sub-operations succeed, that transaction is considered to be completed. Otherwise it fails.

This is called atomicity. This is an important concept to understand to keep your data consistent when reading/writing data concurrently.

Article Scope And Goals

The goal of this article is to present you with a real life example where data inconsistencies occur without transactions. Then we will build a solution in Java using MongoDB Transactions to prevent them.

By doing so, you will learn to:

Avoid Race Conditions that could result in data inconsistencies
Build more resilient applications by using Mongo's build-in Retryable Writes

Also, I added one wrapper function, static R withTransaction(final Function executeFn);, that you can use to improve code readability.

Example: How to Handle Concurrent Transactions Against the Same Bank Account

Assume you and your spouse share a joint bank account. Each of you goes to the ATM at the same time and starts withdrawing money.

t1 -> You: Press check balance. ATM shows 100 dollars
t2 -> Spouse: Press check balance. ATM shows 100 dollars
t3 -> You & Spouse: withdraw 10 dollars
t4 -> Bank: initializes P1 and P2 to handle your and your spouse's requests.
t5 -> P1 and P2 checked the balance and saw 100 dollars
t6 -> P1 and P2 subtracted 10 dollars from the balance
t7 -> P1 updated the DB with the new balance of 90
t8 -> P2 updated the DB with the new balance of 90

In the above example, operations did not occur sequentially. The bank's process P2 did not wait for P1 to complete its tasks. If the bank had waited for P1 to finish reading the balance, calculating the new balance, and writing the updated balance back to the DB before it reading the most up to date balance, it wouldn't have lost 10 dollars.

The solution to this problem is transactions. You can think of them as somewhat similar to Locks, Semaphores, and Synchronized blocks in Java. In Java, it guarantees that only the Lock holder executes the code protected by a lock.

How to Set Up Helper Functions

Now let's get to the coding part. I'm going to assume you already have a MongoClient setup. You will need Java Mongo Driver 3.8 or higher.

final static MongoClient client; // assumed you initialized this somewhere

public static ClientSession getNewClientSession() {
    return client.startSession();
}

public static TransactionOptions getTransactionOptions() {
    return TransactionOptions.builder()
        .readPreference(ReadPreference.primary())
        .readConcern(ReadConcern.LOCAL)
        .writeConcern(WriteConcern.MAJORITY)
        .build();
}

getNewClientSession simply returns a session for a transaction. ClientSession is an identifier for a particular transaction. This is an important piece of data that you pass into all following Mongo operations so that it can isolate the operations.

getTransactionOptions provides options for the Transaction. ReadPreference.primary() gives us the most up to date info on a cluster when we are reading data. WriteConcern.MAJORITY results in the DB acknowledging a commit after it successfully writes to the majority of the servers.

Instead of creating client sessions and transaction options everywhere, we should instead do it on a single method and just pass in the functions that need atomicity to it.

static  R withTransaction(final Function executeFn) {
    final ClientSession clientSession = getNewClientSession();
    TransactionOptions txnOptions = this.getTransactionOptions();

    TransactionBody txnBody = new TransactionBody() {
        public R execute() {
            return executeFn.apply(clientSession);
        }
    };

    try {
        return clientSession.withTransaction(txnBody, txnOptions);
    } catch (RuntimeException e) {
        e.printStackTrace();
    } finally {
        clientSession.close();
    }
    return null;
}

The above function runs operations inside a passed-in function, the executeFn argument, as an atomic operation or a transaction. Let's implement our money drawing function using transactions.

Note that I am returning null. You could just throw a new exception to let the caller know that the transaction has failed. For the sake of this example, returning null implies transaction failure.

Bank Account Example In Java

public class Account {
    @BsonId
    ObjectId _id;
    int balance;

    ... getters and setters
}

public class AccountService {
    public Collection getAccounts() {
        return dbClient.getCollection('account', Account.class);
    }

    private Account currentBalance(ClientSession session, Bson accountId) {
        return getAccounts().findOne(session, Filters.eq('_id', accountId)).first();
    }

    private int currentBalance(ClientSession session, Bson accountId) {
        Account account = getAccounts().findOne(session, Filters.eq('_id', accountId)).first();
        return account.balance;
    }

    private int updateBalance(ClientSession session, Bson accountId, int newBalance) {
        Account account = getAccounts().updateOne(session, Filters.eq('_id', accountId), Updates.set('balance', newBalance)).first();
        return account.balance;
    }

    public Account drawCash(ClientSession session, Bson accountId, int amount){
        int currentBalance = this.currentBalance(accountId);
        int newBalance = currentBalance - amount;
        return updateBalance(session, accountId, amount);
    }
}

In above code snippet, the Account class is a plain Java class model for the user's account. AccountService is a database accessor for the accounts collection. The drawCach method completes the set of operations executed by a single process (P1 or P2) described in the first example to dispense money to either you or your spouse.

Now we use this withTransaction function to call drawCache:

... Some REST API 
AccountService accountService = ...; // Dependency injected

@Path('/account/withdraw') // Endpoint to withdraw money
withdrawMoney() {
    ObjectId accountId = ...// some method to get current users account ID
    Account account = withTransaction(new Function() {
        @Override
        public Workflow apply(ClientSession clientSession) {
            // Everything inside this block run with in the same transaction as long as you pass the argument clientSession to mongo
            accountService.drawCash(clientSession, accountId, 10);
        }
    });

    if(Objects.isNull(account)){
        return "Failed to withdraw money";
    }
    return "New account balance is " + account.balance;
}

Now if you call this endpoint twice, concurrently, one user will see the final balance as 90 and the second one will see 80.

You might have guessed that the second user's transaction should have failed. Yes, it did. But MongoDB has a built-in retry mechanism and it automatically retried our second operation again and succeeded.

A Real-World Example Use Case

We use transactions on our PS2PDF.com online video converter to prevent one thread from overriding process states updated by another.

For example, for each video convert process, we create a document called Job on the DB. It has a status field which can take values such as STARTED, IN_PROGRESS, and COMPLETED.

Once the thread has updated the Job.status on the DB to COMPLETED, we don't want any slow thread reverting that message to IN_PROGRESS. Once a job has completed, it cannot be changed.

We use the above mentioned withTransaction method to guarantee that no operation overrides the COMPLETE status.

Conclusion

I hope you can now use transactions to avoid race conditions on your applications. Plus, use built-in retryWrite and retryRead to improve fault tolerance.

I should point out that, MongoDB Transactions are pretty new, and there are articles out there that identify some inconsistencies that occur in special circumstances. But it is highly unlikely that you will run into these issues.

The JavaScript + Firestore Tutorial for 2020: Learn by Example

Reed — Thu, 16 Jul 2020 13:00:00 +0000

Cloud Firestore is a blazing-fast, serverless NoSQL database, perfect for powering web and mobile apps of any size. Grab the complete guide to learning Firestore, created to show you how to use Firestore as the engine for your own amazing projects from front to back.

Getting Started with Firestore

What is Firestore? Why Should You Use It?
Setting Up Firestore in a JavaScript Project
Firestore Documents and Collections
Managing our Database with the Firebase Console

Fetching Data with Firestore

Getting Data from a Collection with .get()
Subscribing to a Collection with .onSnapshot()
Difference between .get() and .onSnapshot()
Unsubscribing from a collection
Getting individual documents

Changing Data with Firestore

Adding document to a collection with .add()
Adding a document to a collection with .set()
Updating existing data
Deleting data

Essential Patterns

Working with subcollections
Useful methods for Firestore fields
Querying with .where()
Ordering and limiting data

Note: you can download a PDF version of this tutorial so you can read it offline.

What is Firestore? Why Should You Use It?

Firestore is a very flexible, easy to use database for mobile, web and server development. If you're familiar with Firebase's realtime database, Firestore has many similarities, but with a different (arguably more declarative) API.

Here are some of the features that Firestore brings to the table:

⚡️Easily get data in realtime

Like the Firebase realtime database, Firestore provides useful methods such as .onSnapshot() which make it a breeze to listen for updates to your data in real time. It makes Firestore an ideal choice for projects that place a premium on displaying and using the most recent data (chat applications, for instance).

Flexibility as a NoSQL Database

Firestore is a very flexible option for a backend because it is a NoSQL database. NoSQL means that the data isn't stored in tables and columns as a standard SQL database would be. It is structured like a key-value store, as if it was one big JavaScript object.

In other words, there's no schema or need to describe what data our database will store. As long as we provide valid keys and values, Firestore will store it.

↕️ Effortlessly scalable

One great benefit of choosing Firestore for your database is the very powerful infrastructure that it builds upon that enables you to scale your application very easily. Both vertically and horizontally. No matter whether you have hundreds or millions of users. Google's servers will be able to handle whatever load you place upon it.

In short, Firestore is a great option for applications both small and large. For small applications it's powerful because we can do a lot without much setup and create projects very quickly with them. Firestore is well-suited for large projects due to it's scalability.

Setting Up Firestore in a JavaScript Project

We're going to be using the Firestore SDK for JavaScript. Throughout this cheatsheet, we'll cover how to use Firestore within the context of a JavaScript project. In spite of this, the concepts we'll cover here are easily transferable to any of the available Firestore client libraries.

To get started with Firestore, we'll head to the Firebase console. You can visit that by going to firebase.google.com. You'll need to have a Google account to sign in.

Once we're signed in, we'll create a new project and give it a name.

Once our project is created, we'll select it. After that, on our project's dashboard, we'll select the code button.

This will give us the code we need to integrate Firestore with our JavaScript project.

Usually if you're setting this up in any sort of JavaScript application, you'll want to put this in a dedicated file called firebase.js. If you're using any JavaScript library that has a package.json file, you'll want to install the Firebase dependency with npm or yarn.

// with npm
npm i firebase

// with yarn
yarn add firebase

Firestore can be used either on the client or server. If you are using Firestore with Node, you'll need to use the CommonJS syntax with require. Otherwise, if you're using JavaScript in the client, you'll import firebase using ES Modules.

// with Commonjs syntax (if using Node)
const firebase = require("firebase/app");
require("firebase/firestore");

// with ES Modules (if using client-side JS, like React)
import firebase from 'firebase/app';
import 'firebase/firestore';

var firebaseConfig = {
  apiKey: "AIzaSyDpLmM79mUqbMDBexFtOQOkSl0glxCW_ds",
  authDomain: "lfasdfkjkjlkjl.firebaseapp.com",
  databaseURL: "https://lfasdlkjkjlkjl.firebaseio.com",
  projectId: "lfasdlkjkjlkjl",
  storageBucket: "lfasdlkjkjlkjl.appspot.com",
  messagingSenderId: "616270824980",
  appId: "1:616270824990:web:40c8b177c6b9729cb5110f",
};
// Initialize Firebase
firebase.initializeApp(firebaseConfig);

Firestore Collections and Documents

There are two key terms that are essential to understanding how to work with Firestore: documents and collections.

Documents are individual pieces of data in our database. You can think of documents to be much like simple JavaScript objects. They consist of key-value pairs, which we refer to as fields. The values of these fields can be strings, numbers, Booleans, objects, arrays, and even binary data.

document -> { key: value }

Sets of these documents of these documents are known as collections. Collections are very much like arrays of objects. Within a collection, each document is linked to a given identifier (id).

collection -> [{ id: doc }, { id: doc }]

Managing our database with the Firestore Console

Before we can actually start working with our database we need to create it.

Within our Firebase console, go to the 'Database' tab and create your Firestore database.

Once you've done that, we will start in test mode and enable all reads and writes to our database. In other words, we will have open access to get and change data in our database. If we were to add Firebase authentication, we could restrict access only to authenticated users.

After that, we'll be taken to our database itself, where we can start creating collections and documents. The root of our database will be a series of collections, so let's make our first collection.

We can select 'Start collection' and give it an id. Every collection is going to have an id or a name. For our project, we're going to keep track of our users' favorite books. We'll give our first collection the id 'books'.

Next, we'll add our first document with our newly-created 'books' collection.

Each document is going to have an id as well, linking it to the collection in which it exists.

In most cases we're going to use an option to give it an automatically generated ID. So we can hit the button 'auto id' to do so, after which we need to provide a field, give it a type, as well as a value.

For our first book, we'll make a 'title' field of type 'string', with the value 'The Great Gatsby', and hit save.

After that, we should see our first item in our database.

Getting data from a collection with .get()

To get access Firestore use all of the methods it provides, we use firebase.firestore(). This method need to be executed every time we want to interact with our Firestore database.

I would recommend creating a dedicated variable to store a single reference to Firestore. Doing so helps to cut down on the amount of code you write across your app.

const db = firebase.firestore();

In this cheatsheet, however, I'm going to stick to using the firestore method each time to be as clear as possible.

To reference a collection, we use the .collection() method and provide a collection's id as an argument. To get a reference to the books collection we created, just pass in the string 'books'.

const booksRef = firebase.firestore().collection('books');

To get all of the document data from a collection, we can chain on the .get() method.

.get() returns a promise, which means we can resolve it either using a .then() callback or we can use the async-await syntax if we're executing our code within an async function.

Once our promises is resolved in one way or another, we get back what's known as a snapshot.

For a collection query that snapshot is going to consist of a number of individual documents. We can access them by saying snapshot.docs.

From each document, we can get the id as a separate property, and the rest of the data using the .data() method.

Here's what our entire query looks like:

const booksRef = firebase
  .firestore()
  .collection("books");

booksRef
  .get()
  .then((snapshot) => {
    const data = snapshot.docs.map((doc) => ({
      id: doc.id,
      ...doc.data(),
    }));
    console.log("All data in 'books' collection", data); 
    // [ { id: 'glMeZvPpTN1Ah31sKcnj', title: 'The Great Gatsby' } ]
  });

Subscribing to a collection with .onSnapshot()

The .get() method simply returns all the data within our collection.

To leverage some of Firestore's realtime capabilities we can subscribe to a collection, which gives us the current value of the documents in that collection, whenever they are updated.

Instead of using the .get() method, which is for querying a single time, we use the .onSnapshot() method.

firebase
  .firestore()
  .collection("books")
  .onSnapshot((snapshot) => {
    const data = snapshot.docs.map((doc) => ({
      id: doc.id,
      ...doc.data(),
    }));
    console.log("All data in 'books' collection", data);
  });

In the code above, we're using what's known as method chaining instead of creating a separate variable to reference the collection.

What's powerful about using firestore is that we can chain a bunch of methods one after another, making for more declarative, readable code.

Within onSnapshot's callback, we get direct access to the snapshot of our collection, both now and whenever it's updated in the future. Try manually updating our one document and you'll see that .onSnapshot() is listening for any changes in this collection.

Difference between .get() and .onSnapshot()

The difference between the get and the snapshot methods is that get returns a promise, which needs to be resolved, and only then we get the snapshot data.

.onSnapshot, however, utilizes synchronous callback function, which gives us direct access to the snapshot.

This is important to keep in mind when it comes to these different methods--we have to know which of them return a promise and which are synchronous.

Unsubscribing from a collection with unsubscribe()

Note additionally that .onSnapshot() returns a function which we can use to unsubscribe and stop listening on a given collection.

This is important in cases where the user, for example, goes away from a given page where we're displaying a collection's data. Here's an example, using the library React were we are calling unsubscribe within the useEffect hook.

When we do so this is going to make sure that when our component is unmounted (no longer displayed within the context of our app) that we're no longer listening on the collection data that we're using in this component.

function App() {
  const [books, setBooks] = React.useState([]);

  React.useEffect(() => {
    const unsubscribe = firebase
      .firestore()
      .collection("books")
      .onSnapshot((snapshot) => {
        const data = snapshot.docs.map((doc) => ({
          id: doc.id,
          ...doc.data(),
        }));
        setBooks(data);
      });
  }, []);

  return books.map(book => <BookList key={book.id} book={book} />)
}

Getting Individual Documents with .doc()

When it comes to getting a document within a collection., the process is just the same as getting an entire collection: we need to first create a reference to that document, and then use the get method to grab it.

After that, however, we use the .doc() method chained on to the collection method. In order to create a reference, we need to grab this id from the database if it was auto generated. After that, we can chain on .get() and resolve the promise.

const bookRef = firebase
  .firestore()
  .collection("books")
  .doc("glMeZvPpTN1Ah31sKcnj");

bookRef.get().then((doc) => {
  if (!doc.exists) return;
  console.log("Document data:", doc.data());
  // Document data: { title: 'The Great Gatsby' }
});

Notice the conditional if (!doc.exists) return; in the code above.

Once we get the document back, it's essential to check to see whether it exists.

If we don't, there'll be an error in getting our document data. The way to check and see if our document exists is by saying, if doc.exists, which returns a true or false value.

If this expression returns false, we want to return from the function or maybe throw an error. If doc.exists is true, we can get the data from doc.data.

Adding document to a collection with .add()

Next, let's move on to changing data. The easiest way to add a new document to a collection is with the .add() method.

All you need to do is select a collection reference (with .collection()) and chain on .add().

Going back to our definition of documents as being like JavaScript objects, we need to pass an object to the .add() method and specify all the fields we want to be on the document.

Let's say we want to add another book, 'Of Mice and Men':

firebase
  .firestore()
  .collection("books")
  .add({
    title: "Of Mice and Men",
  })
  .then((ref) => {
    console.log("Added doc with ID: ", ref.id);
    // Added doc with ID:  ZzhIgLqELaoE3eSsOazu
  });

The .add method returns a promise and from this resolved promise, we get back a reference to the created document, which gives us information such as the created id.

The .add() method auto generates an id for us. Note that we can't use this ref directly to get data. We can however pass the ref to the doc method to create another query.

Adding a document to a collection with .set()

Another way to add a document to a collection is with the .set() method.

Where set differs from add lies in the need to specify our own id upon adding the data.

This requires chaining on the .doc() method with the id that you want to use. Also, note how when the promise is resolved from .set(), we don't get a reference to the created document:

firebase
  .firestore()
  .collection("books")
  .doc("another book")
  .set({
    title: "War and Peace",
  })
  .then(() => {
    console.log("Document created");
  });

Additionally, when we use .set() with an existing document, it will, by default, overwrite that document.

If we want to merge, an old document with a new document instead of overwriting it, we need to pass an additional argument to .set() and provide the property merge set to true.

// use .set() to merge data with existing document, not overwrite

const bookRef = firebase
  .firestore()
  .collection("books")
  .doc("another book");

bookRef
  .set({
    author: "Lev Nikolaevich Tolstoy"
  }, { merge: true })
  .then(() => {
    console.log("Document merged");

    bookRef
      .get()
      .then(doc => {
      console.log("Merged document: ", doc.data());
      // Merged document:  { title: 'War and Peace', author: 'Lev Nikolaevich Tolstoy' }
    });
  });

Updating existing data with .update()

When it comes to updating data we use the update method, like .add() and .set() it returns a promise.

What's helpful about using .update() is that, unlike .set(), it won't overwrite the entire document. Also like .set(), we need to reference an individual document.

When you use .update(), it's important to use some error handling, such as the .catch() callback in the event that the document doesn't exist.

const bookRef = firebase.firestore().collection("books").doc("another book");

bookRef
  .update({
    year: 1869,
  })
  .then(() => {
    console.log("Document updated"); // Document updated
  })
  .catch((error) => {
    console.error("Error updating doc", error);
  });

Deleting data with .delete()

We can delete a given document collection by referencing it by it's id and executing the .delete() method, simple as that. It also returns a promise.

Here is a basic example of deleting a book with the id "another book":

firebase
  .firestore()
  .collection("books")
  .doc("another book")
  .delete()
  .then(() => console.log("Document deleted")) // Document deleted
  .catch((error) => console.error("Error deleting document", error));

Note that the official Firestore documentation does not recommend to delete entire collections, only individual documents.

Working with Subcollections

Let's say that we made a misstep in creating our application, and instead of just adding books we also want to connect them to the users that made them. T

The way that we want to restructure the data is by making a collection called 'users' in the root of our database, and have 'books' be a subcollection of 'users'. This will allow users to have their own collections of books. How do we set that up?

References to the subcollection 'books' should look something like this:

const userBooksRef = firebase
  .firestore()
  .collection('users')
  .doc('user-id')
  .collection('books');

Note additionally that we can write this all within a single .collection() call using forward slashes.

The above code is equivalent to the follow, where the collection reference must have an odd number of segments. If not, Firestore will throw an error.

const userBooksRef = firebase
  .firestore()
  .collection('users/user-id/books');

To create the subcollection itself, with one document (another Steinbeck novel, 'East of Eden') run the following.

firebase.firestore().collection("users/user-1/books").add({
  title: "East of Eden",
});

Then, getting that newly created subcollection would look like the following based off of the user's ID.

firebase
  .firestore()
  .collection("users/user-1/books")
  .get()
  .then((snapshot) => {
    const data = snapshot.docs.map((doc) => ({
      id: doc.id,
      ...doc.data(),
    }));
    console.log(data); 
    // [ { id: 'UO07aqpw13xvlMAfAvTF', title: 'East of Eden' } ]
  });

Useful methods for Firestore fields

There are some useful tools that we can grab from Firestore that enables us to work with our field values a little bit easier.

For example, we can generate a timestamp for whenever a given document is created or updated with the following helper from the FieldValue property.

We can of course create our own date values using JavaScript, but using a server timestamp lets us know exactly when data is changed or created from Firestore itself.

firebase
  .firestore()
  .collection("users")
  .doc("user-2")
  .set({
    created: firebase.firestore.FieldValue.serverTimestamp(),
  })
  .then(() => {
    console.log("Added user"); // Added user
  });

Additionally, say we have a field on a document which keeps track of a certain number, say the number of books a user has created. Whenever a user creates a new book we want to increment that by one.

An easy way to do this, instead of having to first make a .get() request, is to use another field value helper called .increment():

const userRef = firebase.firestore().collection("users").doc("user-2");

userRef
  .set({
    count: firebase.firestore.FieldValue.increment(1),
  })
  .then(() => {
    console.log("Updated user");

    userRef.get().then((doc) => {
      console.log("Updated user data: ", doc.data());
    });
  });

Querying with .where()

What if we want to get data from our collections based on certain conditions? For example, say we want to get all of the users that have submitted one or more books?

We can write such a query with the help of the .where() method. First we reference a collection and then chain on .where().

The where method takes three arguments--first, the field that we're searching on an operation, an operator, and then the value on which we want to filter our collection.

We can use any of the following operators and the fields we use can be primitive values as well as arrays.

<, <=, ==, >, >=, array-contains, in, or array-contains-any

To fetch all the users who have submitted more than one book, we can use the following query.

After .where() we need to chain on .get(). Upon resolving our promise we get back what's known as a querySnapshot.

Just like getting a collection, we can iterate over the querySnapshot with .map() to get each documents id and data (fields):

firebase
  .firestore()
  .collection("users")
  .where("count", ">=", 1)
  .get()
  .then((querySnapshot) => {
    const data = querySnapshot.docs.map((doc) => ({
      id: doc.id,
      ...doc.data(),
    }));
    console.log("Users with > 1 book: ", data);
    // Users with > 1 book:  [ { id: 'user-1', count: 1 } ]
  });

Note that you can chain on multiple .where() methods to create compound queries.

Limiting and ordering queries

Another method for effectively querying our collections is to limit them. Let's say we want to limit a given query to a certain amount of documents.

If we only want to return a few items from our query, we just need to add on the .limit() method, after a given reference.

If we wanted to do that through our query for fetching users that have submitted at least one book, it would look like the following.

const usersRef = firebase
  .firestore()
  .collection("users")
  .where("count", ">=", 1);

  usersRef.limit(3)

Another powerful feature is to order our queried data according to document fields using .orderBy().

If we want to order our created users by when they were first made, we can use the orderBy method with the 'created' field as the first argument. For the second argument, we specify whether it should be in ascending or descending order.

To get all of the users ordered by when they were created from newest to oldest, we can execute the following query:

const usersRef = firebase
  .firestore()
  .collection("users")
  .where("count", ">=", 1);

  usersRef.orderBy("created", "desc").limit(3);

We can chain .orderBy() with .limit(). For this to work properly, .limit() should be called last and not before .orderBy().

Want your own copy?

If you would like to have this guide for future reference, download a cheatsheet of this entire tutorial here.

https://reedbarger.com/resources/javascript-firestore-2020/

Become a Professional React Developer

React is hard. You shouldn't have to figure it out yourself.

I've put everything I know about React into a single course, to help you reach your goals in record time:

Introducing: The React Bootcamp

It’s the one course I wish I had when I started learning React.

Click below to try the React Bootcamp for yourself:

Click to get started

How looking back can help us move forward: a retrospective on software gems and fads

freeCodeCamp — Fri, 30 Aug 2019 18:30:49 +0000

By Pakal de Bonchamp

Maybe one of the most important qualities of a developer is the ability to pick the right tool for the right job, without hopping onto bandwagons or reinventing the wheel. This might require a bit of technology analysis, but even more, a touch of critical thinking.

Here is a review of a few exaggerated trends and underrated niceties, in different areas of the marvelous world of computer science: databases, asynchronicity, cryptocurrency, and data formats. I won't touch on the subject of REST webservices, which I already ranted about at great length.

As usual, your feedback is more than welcome if any factual errors slipped into this (not entirely unbiased) article.

Databases: NoSQL & ZODB

Few moments, in the history of computer science, were as ironically lit as the arrival of No-SQL databases, around 2009. A tidal wave struck the shores of backend development and system administration: SQL databases were too rigid, too slow, too hard to replicate.

So new projects massively ditched them in favor of key-value stores like Redis, document-oriented databases like MongoDB/CouchDB, or graph-oriented databases like Neo4j. And we must acknowledge one thing: these new databases shone in benchmarks; they shone about as much.... as would shine any SQL database dropping all its ACID constraints and query language flexibility.

But the horizon was grim for numerous programmers. They learned, the hard way, that data persistence was not a minor concern. And that they needed, for example, to explicitly activate "Write Concerns" in MongoDB, to ensure that data would not get lost before reaching disk oxide.

They learned that "eventual consistency" was a pretty word for "temporary inconsistency", opening the door to nasty, silent, hard-to-reproduce bugs in production. And that transactions - and their implicit locking - were precious features, and that mimicking them by hand, with awkward flags stuffed into documents, was all but easy and robust.

And they learned that data schemas, and referential integrity, were more than welcome to prevent databases from becoming heaps of incoherent objects. And that the lack of advanced indexing capabilities (on multiple keys, on deep document fields) in key-value stores could become quite embarrassing.

Thus, people began reinventing SQL features on top of NoSQL databases, by mimicking data schemas, foreign keys, advanced aggregation, in language-specific "ORM" libraries (mongoengine, mongoid, mongomapper...). In this context, this "Object-Relational Mapper" acronym should have, by itself, been a hint that something had gone wild.

There was something surreal in watching NoSQL databases, which were honed for specific use cases (highly replicated or heterogeneous data, capped-size collections or TTLs, pub/sub systems...), be used just to store a bunch of same-shape objects in a single server instance.

A standard SQL database would completely have done the job, and offered many more tooling options and plugins (different storage engines, Percona toolkit scripts, IDEs like HeidiSql or Mysql Workbench, DB schema migration processes integrated into web frameworks...). Even if it meant stuffing extra unstructured data into a serialized Text Field (or, nowadays, dedicated PostgreSQL Json Fields).

With time, NoSQL databases themselves improved a lot, among other things by borrowing features from the SQL world. But reinventing SQL is not an easy task. Relational databases deal with query language parsing, character sets and collations, data aggregation and conversion, transactions and isolation levels, views and query caches, triggers, embedded procedures, GIS, fine-grained permissions, replication and clustering... complex and sensitive features, driven by hundreds of settings spread on multiple levels (per database, per table, per connection).

So despite their great progress (multi-document transactions, better data aggregation, stored JavaScript functions, pluggable storage, role-based access control in MongoDB), NoSQL DBs still have trouble challenging major SQL databases, purely feature-wise.

Luckily, most projects only need a tiny subset of these SQL database features: a few schema validations, a few proper indices, and business can get rolling; so for teams lacking SQL expertise, the relative simplicity of many NoSQL DBs could indeed be, to be honest, a relevant factor.

The wave seems to have faded by now, and projects seem more inclined to combine different databases according to actual needs. They thus separate user accounts, job queues and similar caches, logging and stats data... each into the most relevant storage.

All these cited NoSQL databases, and their countless alternatives, are shining in their intended use cases. But I'd like to mention a too-little-known, too-little-used gem of the Python ecosystem. Have you already wanted to persist your data in a really, reaaaalllly easy way? Then I forward you to the ZODB. You open it like a dictionary, you push whatever data you want into it, you commit the transaction, and you're good to go.

Example of simple local ZODB instance:

from ZODB import FileStorage, DB
import transaction

storage = FileStorage.FileStorage('mydatabase.fs')
root = DB(storage).open().root()
print("ROOT:", root)
root['employees'] = ['Mary', 'Jo', 'Bob']
transaction.commit()

Graphs of data are handled gracefully (no recursion error), objects are lazily loaded on access, special "bucket tree" types are provided to browse huge amounts of data while keeping memory low, and several storage backends exist, including relstorage which leverages the power of SQL databases. Perfect, isn't it?

Alright, I'm lying, there are a few gotchas. There is no built-in indexing system (one must use Zcatalog or the likes instead). Using dedicated "persistent" types is highly advised, to automatically detect and persist mutations of objects. The overall tooling is quite limited compared to mainstream databases. And the concurrency model based on "optimistic locking" might force you, under heavy load, to retry an operation several times until it manages to get applied.

The extreme amount of integration with the Python language has an additional drawback: if you introduce breaking changes into your data model, your database might not load anymore, so you must handle schema migrations carefully.

But context is everything: ZODB is not meant for long term and interoperable data persistence, but for effortless storage of (possibly very heterogeneous) python objects. It can make long-running scripts able to resume after interruption, it can store player data of online game sessions... if you really want to store blog articles or personal accounts in ZODB, you had better limit yourself to native python types, and implement your own sanity checks. But whatever happens, do not use a very limited stdlib shelf, if you can have a nifty ZODB under the hand to store your work-in-progress data.

Asynchronicity: Asyncio, Trio and Green Threads

There has been an immemorial challenge between synchronous and asynchronous programming models, in all IO-bound programs. Kernels have provided asynchronous modes for disk operations, with more or less success (overlapped non-blocking IO on Windows, limited _iosubmit() API on Linux...).

Networking code has made the issue still more acute, with the need for huge numbers of long-term connections, each performing only minor CPU operations.

Some languages, like Erland, confronted this by being asynchronous from the start, and letting different tasks communicate by message passing (a.k.a Actor Model).

In other languages, several design patterns emerged to tackle the problem:

callbacks
async/await syntax
lightweight threads

Callbacks were previously the major solution in mainstream frameworks. For example in jQuery or Twisted, the developer would provide callables as arguments or as instance methods, and these would be called on IO completion/cancellation, in a pattern called Inversion of Control. It works, for sure, but it makes program flows quite hard to predict and debug, hence the term "callback soup" often used in this context.

For the last few years, the async/await syntax has become highly trendy, especially in the Python world. But there is a problem: like Inversion of Control, it's a whole new way of programming, almost a new language. The vast amount of packages currently available, made of modules, classes and methods, just does NOT work with async/await.

Any IO, any expensive operation, hidden deep inside a subdependency, could ruin your day. So we're currently gazing at thousands of great modules being happily reimplemented, with a whole new world of bugs and missing features.

Is it all worth it? Python developers have massively jumped onto the train of the asyncio package, which has become part of the stdlib. But this technology has scary issues, like the difficulty of socket backpressure, the fragile handling of exceptions and ctrl-C, the unsafe cancellation of (leaking) tasks, and the steep learning curve of an API full of gotchas and redundant concepts. Other frameworks like Trio/Curio, seemed much more careful on these subjects.

If we have to recode tons of existing libraries, why base new versions on an engine that some developers have - not without arguments - called a dumpster fire of bad design? But the network effect is huge in such cases, and alternative async/await-based frameworks will have a hard time challenging the standard.

And what about the third pattern quoted above, lightweight threads? Long before this async/await trend, Python developers thought: we already have some perfectly fine synchronous business code, so let's change the way it is run, not the way it is written. Thus appeared lightweight threads, or "greenlets". They work like a bunch of tiny tasks scheduled on top of a few native threads, tasks which yield control to each other only when they block on IO or explicitly do so; and with much greater performance than native threads, in terms of memory usage and switching delay.

In the end, this system can quickly boost about any existing codebase so that it supports thousands of long-term concurrent tasks. And this is not an isolated mad experiment: Python lightweight threads have originally been used in Eve Online game (via Stackless Python), and have since successfully been ported to CPython (Gevent, Eventlet...) and PyPy. And they have actually existed for a long time in lots of programming languages, under different names (green processes, green threads, fibers...).

The drawbacks of this system?

Libraries must play nice with green threads, by yielding control instead of blocking on IOs, and launching green threads instead of native threads. In python, main libraries (socket, time.sleep(), threading) are forcibly made green-friendly via monkey-patching; but compiled extensions must be especially checked, since they can bypass these patches and block on their own system calls.
No heavy computation, or otherwise time-consuming tasks, must be performed, else all other tasks get impacted by the delay. For such needs, just delegate work to a pool of native threads (or a celery-like worker queue).

As we see, these drawbacks are similar to those of async/await, except that you almost don't have to touch the original, synchronous code. An "except" which can mean months or years of work avoided ; your CTO and CEO should be highly pleased about this.

Now, you'll sometimes hear strange rationalizations from people who ditched lightweight threads in favor of a whole async/await reimplementation. Something in the lines of "Explicit is better than implicit, and all these awaits show me exactly where my code could switch context, whereas green threads might switch discreetly if a third-party function performs any kind of IO or explicit switch".

But the thing is...

FIRST, why do you need to know at which points exactly the program will switch to another task? For all the past years, with native (preemptive) thread, a switch could happen anywhere, anytime, even right in a middle of a simple increment.

But we learned to deal with this invisible threat properly, by protecting critical sections with locks and other synchronization primitives (Recursive Locks, Event, Condition, Semaphore...), keeping a proper order when nesting locks, and using thread-safe data structures (Queues and the likes) which handle concurrency for us.

Green threads are a middle ground between (implicit) preemptive threads and (explicit) async/await, but all of these technologies had better stick to the good old way of protecting concurrent operations.

Locks can be dangerous when misused (especially since most implementations stall, instead of detecting deadlock and reporting them as exceptions), but they are cheap and robust. What is the point of attempting to do lock-less concurrency, by checking the position of each potentially switch-triggering calls, when you could anytime have to add a new operation (even a simple logging output) in the middle of your carefully crafted lock-less sequence, and thus ruin its safety?

This naive code shows how a recently added call to log_counter_value() breaks an otherwise safe asynchronous code.


async def increment_counter(counter):
     current = counter.current_value
     await log_counter_value(current)  # Unwanted context switch happens here
     counter.current_value = current + 1

SECOND, do you really have to deal with synchronization? In the web world especially, where HTTP requests are not supposed to interact, we want parallelization, not concurrency. Persistent data (and transactions) are supposed to be handled by external databases and caches, not in process memory heap.

So usual thread-safety good practices (using thread-safe initialization of the process via locks, read-only structures for global data, and read-write data only local to stack frames) are enough to make the whole system "thread/greenlet/asynctask safe".

If one day you need to implement highly concurrent algorithms inside a process, you'll choose the best tool for that, but no need for hammer-building factories if all you have to do is thrust one nail.

Money: Bitcoins & Alternatives

Let's ponder for a moment. What are the biggest challenges of our 21st century? Climate change? Tax evasion? Legitimacy of state power? So candid minds could think that energetic sobriety, financial traceability, and (really) democratic organizations, would be goals to pursue.

But a group of smart hackers decided that current moneys were a major issue, and came up with Bitcoins: energy-devouring "proof of work" system, easy anonymity of money holders, and fuzzy (for the least) governance.

With such adequation between needs and demand, it's no wonder that Bitcoins became what they became: a product of (almost) pure speculation, praised by ransomwares and miscellaneous mafias, mass-mined by factories of graphics cards, with an especially high appetite for being stolen (or lost).

This money, and its soon-emerged siblings, have a history already full of bewildering moments, with accidental chain splits, soft forks blocked for political reasons, hard forks quite arbitrarily decided by miscellaneous people (or forced by cyber attacks), and endless battles between different currencies, or different versions of the same currency (Bitcoin Core, Cash, Gold, SV...). Algorithms (cryptography, consensus, transaction code...) were praised as the foundations of a bullet-proof and self-governing system, but some actors had to hack their own users to protect them from theft, while even the so glorified "smart contracts" showed loads of scary security weaknesses, and not as many use cases as some expected.

Let's make it clear: the blockchain, a public ledger based on Merkle trees, is far from a bad idea. But when decisions are not based on the needs of society, and carefulness regarding bugs, but on ideology and greed, the outcome can be predicted. And the decline in hype is proportional to unduly invested hopes.

What is the "better" counterpart of Bitcoin, Ethereum, and the like? Lots of alternative cryptocurrencies exist, with lighter forms of authorization, with different crypto algorithms, with different privacy settings, with different adoption rates too... But if you ask me, what we would really need is "an easily traceable money for State finances and NGOs"; a public ledger designed so that any citizen could easily audit how public money is used, from the moment it's gathered via taxes and donations, to the moment it gets back into private circuits by paying goods or employee salaries. Does anything like this exist yet, anyone? Couldn't find it...

One could also mention non-cryptographic but local moneys (ex. the "Gonette" in Lyon, France), kept on parity with national moneys, which have the advantage of favoring local businesses and thus lowering the collateral damages of international trade.

Data Formats: Text and Binary

A witty passerby once defined XML as "the readability of binary data with the efficiency of text". Indeed XML parsers tend to be sluggish, and to clutter memory (when in DOM mode), compared to binary data loaders; and editing XML configurations and documents by hand is not the best user experience one might have.

We easily understand why XML, as a metalanguage allowing to create new tags and properties for all kinds of uses, needs to be so verbose. But why such enthusiasm for text-based formats, when the goal is to transmit information between servers using well-defined data types ?

Parsing HTTP payloads into an internal representation, and then parsing, for example, its JSON body, ends up adding significant overhead to webservice requests. For what gain ? Binary formats like Bson would make the serialization/deserialization much more performant; and semantically equivalent text formats could be used for debugging (auto-converted by web browser dev tools, Wireshark, CURL and the likes), and for manually crafting test payloads.

For sure, handling these dual representations of the same data would add a bit of complexity to the system, but in an era when startups love exposing webservices to thousands simultaneous clients, the performance boost can be real, with not so much effort.

Conclusion

What's the moral of all this? Always the same, "use the right tool for the right job, and beware of irrational fads". It can take lots of reading before one has a sufficient depth of view, on a specific matter, to take educated decisions; but this investment quickly pays off.

Guessing how well a framework will be supported on the long-term, or which protocol/format will win a standardization war, is a different problem, but at least we can have our opinions firmly founded, when it comes to purely technical aspects, and this is Gold.

Powerful tools for Elasticsearch data visualization & analysis

freeCodeCamp — Tue, 13 Aug 2019 17:00:00 +0000

By Veronika Rovnik

The goal is to turn data into information, and information into insight.

―Carly Fiorina

About Kibana

Kibana is a piece of data visualization software that provides a browser-based interface for exploring Elasticsearch data and navigating the Elastic Stack — a collection of open-source products (Elasticsearch, Logstash, Beats, and others).

While Logstash and Bits deliver data to Elasticsearch, Kibana opens the window into the Elastic Stack, allowing you to track the health of your cluster, perform log and time-series analysis, detect anomalies in the data with unsupervised machine learning, discover relationships using graphs and, most importantly, extract insights from the Elasticsearch data with visualizations that can be combined together in a custom interactive dashboard.

Today I’d like to show you how to create a stunning dashboard and a tabular report based on the Elasticsearch data.

Roll up your sleeves and let’s start!

Where to start

The Home page is the place where everything starts.

Here you can decide which actions to take next. The available functionality can be divided into two logical sections:

Visualizing and exploring the data. Here you can create a new dashboard, visualization or presentation, build a machine learning model, analyze relationships in your data using graphs, and more.
Managing the Elastic Stack: configure your spaces, analyze logs of an application, configure security settings, etc.

We’ll focus on the process of creating visualizations and adding them to the dashboard.

How to create a dashboard in Kibana

Let me get you a feel for how easy it is to set up a rich dashboard and start reporting.

The first essential step to take is to import your data into Kibana. Multiple options for adding data are at your disposal — you can choose the one that works best for you:

For demonstration purposes, I’ve selected the sample data.

To design your first data visualizations and combine them into the dashboard, open the Visualize page. Here you can create, modify and view the existing visualizations.

What will strike you at once is the abundance of visualization types you can choose from.

After you’ve selected the one you need, choose an index pattern as a source so as to inform Kibana about your index. Let’s choose kibana_sample_data_flights and start creating a horizontal bar chart.

Now you can apply a metric aggregation for the Y-axis and a bucket aggregation for the X-axis. Here is a list of all available aggregations for charts.

Creating a horizontal bar chart in Kibana

Optionally, you can customize the colors of the visualization.

Filtering is another mighty feature of Elasticsearch and Kibana. It provides a way to visualize only a selected subset of documents.

See how you can apply filters to the fields based on logical conditions:

As you see, Kibana provides a straightforward way of filtering the data via the comfy interface. Along with that, you can choose how to filter the data — either by using the Kibana Query Language (a simplified query syntax) or Lucene.

To allow end-users to filter the data interactively, you can add control widgets — special elements of the dashboard which allow filtering the data simply by clicking them.

Another feature I’d like to highlight is the advanced filtering by dates and the ability to set time intervals for refreshing the data in the dashboard.

The good thing is that visualizations are reusable. After creating it, you can save your result and add it to the dashboard any time as well as share with your colleagues given they have access to your Kibana instance.

Saving a visualization in Kibana

After arranging all the visualization elements on a single page, you can export the final dashboard to PNG or PDF format. This is what makes the dashboards portable — it’s easy to share them across departments in no time.

Let’s look at an example of the dashboard you can create:

Interacting with the dashboard in Kibana

To my mind, the principal features which make each dashboard special are interactivity and expressiveness. With it, you can communicate business metrics efficiently.

Personal impression

The visualizations in Kibana ideally perform the tasks they are designed for. What is more, all the visualizations are eye-catching and you can tailor them according to your design ideas. The entire process of creating a dashboard in Kibana is meant to be fast and efficient — and it is so due to the Kibana’s user-friendly and intuitive interface.

On the other hand, I’ve felt that some functionality is missing here.

When working with data, one of the effective exploratory techniques you can apply is slicing and dicing your data before getting to know which aspects of the data to pay attention to. To my mind, the data table widget isn’t the best option — it presents the data in a flat table which doesn’t support a multi-dimensional view of the data. But playing with data should be done interactively and fast.

And this is where a pivot table control comes into play. After searching for available solutions, my choice fell on one open-source plugin called Flexmonster. It handles connecting to the Elasticsearch index and allows creating tabular reports based on the data from its documents. Along with that, integrating with Kibana is smooth — the only thing required to get started is to install a plugin by running one line of code in the command line. You can find more details on GitHub. Before using it, I recommend making sure that your Kibana and Elasticsearch instances are of the same version.

Once you set up a tool, you are ready to use all available features for searching in-depth insights.

Features for analytics and reporting

Flexmonster Pivot provides fast access to the most essential reporting functionality. Its toolbar allows connecting to the data source, loading previously saved reports, exporting reports to PDF, Excel, HTML, CSV, and images. Besides, I’ve managed to quickly switch between two different modes — the grid and the charts. Cells formatting options include conditional and number formatting. The field list deserves particular attention — here you can select hierarchies to rows, columns, measures, and report filters. There is also the search input field which is helpful if the index has a long list of fields.

One of the features I’d like to highlight is the ability to drag and drop the hierarchies right on the grid. Thereby, you can change the slice completely via the UI.

Another one is the drill-through feature — it helps to know which records stand behind the aggregated values.

Working with a pivot table

Let me show you how to create a report based on the Elasticsearch data:

While testing the tool, I’ve managed to aggregate and filter the data, sort the values on the grid and save the results to continue working with the report later. Plus, exporting works well — it’s easy to share the reports with teammates.

Bringing it all together

Today I’ve covered the benefits Kibana provides for visualization of Elasticsearch data. You’ve been able to make sure how dashboards can empower the analysis process.

To my mind, a pivot table is a good tool which enables you to benefit from exploring data before teasing out the answers to complex questions.

Flexmonster nicely complements the available functionality of Kibana - the reports you are creating with it are insightful, customizable and can be easily shared across departments.

Working together, both tools have all the potential to boost your storytelling.

I encourage you to give such a combination a try.

What’s next?

How to use the Xodus database in Kotlin applications

freeCodeCamp — Wed, 10 Apr 2019 16:12:41 +0000

By Mariya Davydova

I want to show you how to use one of my favorite database choices for Kotlin applications. Namely, Xodus. Why do I like using Xodus for Kotlin applications? Well, here are a couple of its selling points:

Transactional
Embedded
Schema-less
Pure JVM-based
Has an additional Kotlin DSL — Xodus-DNQ.

What does this mean to you?

ACID on-board — all database operations are atomic, consistent, isolated, and durable.
No need to manage an external database — everything is inside your application.
Painless refactorings — if you need to add a couple of properties you won’t have to then rebuild the tables.
Cross-platform database — Xodus can run on any platform that can run a Java virtual machine.
Kotlin language benefits — take the best from using types, nullable values and delegates for properties declaration and constraints description.

Xodus is an open-source product from JetBrains. Originally it was developed for internal use, but it was subsequently released to the public back in July 2016. YouTrack issue tracker and Hub team tool use it as their data storage. If you are curious about the performance, you can check out the benchmarks. As for the real-life example, take a look at the JetBrains YouTrack installation: which at the time of writing has over 1,6 million issues, and that is not even taking into account all the comments and time tracking entries all stored there.

Xodus-DNQ is a Kotlin library that contains the data definition language and queries for Xodus. It was also developed first as a part of the product and then later released publicly. YouTrack and Hub both use it for persistent layer definition.

Setup

Let’s write a small application which stores books and their authors.

I will use Gradle as a build tool, as it helps simplify all the dependencies management and project compilation stuff. If you have never worked with Gradle, I recommend taking a look at the official guides they have on installation and creating new builds.

So first, we need to start by creating a new directory for our example, and then run gradle init there. This will initialize the project structure and add some directories and build scripts.

Now, create a bookstore.kt file in src/main/kotlin directory. Fill it with the never-going-out-of-fashion classics:

fun main() {
  println("Hello World")
}

Then, update the build.gradle file using code similar to this:

plugins {
  id 'application'
  id 'org.jetbrains.kotlin.jvm' version '1.3.21'
}
group 'mariyadavydova'
version '1.0-SNAPSHOT'
sourceCompatibility = 1.8
targetCompatibility = 1.8
tasks.withType(org.jetbrains.kotlin.gradle.tasks.KotlinCompile).all {
  kotlinOptions {
    jvmTarget = "1.8"
  }
}
repositories {
  mavenCentral()
}
dependencies {
  implementation 'org.jetbrains.kotlin:kotlin-stdlib-jdk8:1.3.21'
  implementation 'org.jetbrains.xodus:dnq:1.2.420'
}
mainClassName = 'BookstoreKt'

There are a few things that are happening here:

We add the Kotlin plugin and claim that the compilation output is targeted for JVM 1.8.
We add dependencies to the Kotlin standard library and Xodus-DNQ.
We also add the application plugin and define the main class. In the case of the Kotlin application, we do not have a class with a static method main, like in Java. Instead, we have to define a standalone function main. However, under the hood, Kotlin still makes a class containing this function, and the name of the class is generated from the name of the file. For example, ‘bookstore.kt’ makes ‘BookstoreKt’.

We can actually safely remove settings.gradle, as we don’t need it in this example.

Now, execute ./gradlew run; you should see “Hello World” in your console:

> Task :run
Hello World

Data definition

_Photo by [Unsplash](https://unsplash.com/@alfonsmc10?utm_source=medium&utm_medium=referral" rel="noopener" target="_blank" title="">Alfons Morales on Environments, Entity Stores and the Virtual File System. However, Xodus-DNQ supports only the Entity Stores, which describe a data model as a set of typed entities with named properties (attributes) and named entity links (relations). It is similar to rows in the SQL database table.

As my goal is to demonstrate how simple it is to operate Xodus via Kotlin DSL, I’ll stick to the entity types API for this story.

Let’s start with an XdAuthor:

class XdAuthor(entity: Entity) : XdEntity(entity) {
  companion object : XdNaturalEntityType()
var name by xdRequiredStringProp()
  var countryOfBirth by xdStringProp()
  var yearOfBirth by xdRequiredIntProp()
  var yearOfDeath by xdNullableIntProp()
  val books by xdLink0_N(XdBook::authors)
}

From my point of view, this declaration looks pretty natural: we say that our authors always have names and year of birth, may have country of birth and year of death (the latter is irrelevant for the currently living authors); also, there could be any number of books from each author in our bookstore.

There are several things worth mentioning in this code snippet:

The companion object declares the entityType property for each class (which is used by the database engine).
The data fields are declared with the help of the delegates, which encapsulate the types, properties, and constraints for these fields.
Links are values, not variables; that is, you don’t set them with =, but access them as a collection. (Pay attention to val books versus var name; I spent quite a bit of time trying to figure out why the compilation with var books kept failing.)

The second type is an XdBook:

class XdBook(entity: Entity) : XdEntity(entity) {
  companion object : XdNaturalEntityType()
var title by xdRequiredStringProp()
  var year by xdNullableIntProp()
  val genres by xdLink1_N(XdGenre)
  val authors : XdMutableQuery by xdLink1_N(XdAuthor::books)
}

The thing to pay attention to here is the declaration of the authors’ field:

Notice that we write down the type explicitly (XdMutableQueryor>). For the bidirectional link, we have to help the compiler to resolve the types by leaving a hint on one of the link ends.


Also, notice that XdAuthor::books references XdBook::authors and vice versa. We have to add these references if we want the link to be bidirectional; so if you add an author to the book, the book will appear in the list of the books of this author, and vice versa.


The third entity type is an XdGenre enumeration, which is pretty trivial:
class XdGenre(entity: Entity) : XdEnumEntity(entity) {
 companion object : XdEnumEntityType() {
   val FANTASY by enumField {}
   val ROMANCE by enumField {}
 }
}

Database initialization
Now, when we have declared the entity types, we have to initialize the database:
fun initXodus(): TransientEntityStore {
  XdModel.registerNodes(
      XdAuthor,
      XdBook,
      XdGenre
  )
  val databaseHome = File(System.getProperty("user.home"), "bookstore")
  val store = StaticStoreContainer.init(
      dbFolder = databaseHome,
      environmentName = "db"
  )
  initMetaData(XdModel.hierarchy, store)
  return store
}
fun main() {
  val store = initXodus()
}
This code shows the most basic setup:

We define the data model. Here we list all entity types manually, but it is possible to auto scan the classpath as well.
We initialize the database store in {user.home}/bookstore folder.
We link the metadata with the store.

Filling the data in

_Photo by [Unsplash](https://unsplash.com/@anniespratt?utm_source=medium&utm_medium=referral" rel="noopener" target="_blank" title="">Annie Spratt on class XdAuthor(entity: Entity) : XdEntity(entity) { ... override fun toString(): String { val bibliography = books.asSequence().joinToString("\n") return "$name ($yearOfBirth-${yearOfDeath ?: "???"}):\n$bibliography" } } class XdBook(entity: Entity) : XdEntity(entity) { ... override fun toString(): String { val genres = genres.asSequence().joinToString(", ") return "$title (${year ?: "Unknown"}) - $genres" } } class XdGenre(entity: Entity) : XdEnumEntity(entity) { ... override fun toString(): String { return this.name.toLowerCase().capitalize() } }

Notice books.asSequence().joinToString("\n") and genres.asSequence().joinToString(", ") instructions: here we use asSequence() method to convert an XdQuery to a Kotlin collection.

Right, let’s now add several books from our collection inside the main function. All database operations (creating, reading, updating and removing entities) we do inside transactions — atomic database modifications, which guarantees to preserve the consistency.

In the case of our bookstore, there are plenty of ways to fill it with stuff:

Add an author and a book separately:

 val bronte = store.transactional {
   XdAuthor.new {
     name = "Charlotte Brontë"
     countryOfBirth = "England"
     yearOfBirth = 1816
     yearOfDeath = 1855
   } 
 }
 store.transactional {
   XdBook.new {
     title = "Jane Eyre"
     year = 1847
     genres.add(XdGenre.ROMANCE)
     authors.add(bronte)
   }
 }

Add an author and put several books in their list:

 val tolkien = store.transactional {
   XdAuthor.new {
     name = "J. R. R. Tolkien"
     countryOfBirth = "England"
     yearOfBirth = 1892
     yearOfDeath = 1973
   }
 }
 store.transactional {
   tolkien.books.add(XdBook.new {
     title = "The Hobbit"
     year = 1937
     genres.add(XdGenre.FANTASY)
   })
   tolkien.books.add(XdBook.new {
     title = "The Lord of the Rings"
     year = 1955
     genres.add(XdGenre.FANTASY)
   })
 }

Add an author with books:

 store.transactional {
   XdAuthor.new {
     name = "George R. R. Martin"
     countryOfBirth = "USA"
     yearOfBirth = 1948
     books.add(XdBook.new {
       title = "A Game of Thrones"
       year = 1996
       genres.add(XdGenre.FANTASY)
     })
   }
 }

To check that everything is created, all we need to do is to print the content of our database:

store.transactional(readonly = true) {     println(XdAuthor.all().asSequence().joinToString("\n***\n"))
 }

Now, if you execute ./gradlew run, you should see the following output:

Charlotte Brontë (1816-1855):
Jane Eyre (1847) - Romance
***
J. R. R. Tolkien (1892-1973):
The Hobbit (1937) - Fantasy
The Lord of the Rings (1955) - Fantasy
***
George R. R. Martin (1948-???):
A Game of Thrones (1996) - Fantasy

Constraints

As mentioned, the transactions guarantee data consistency. One of the operations which Xodus does before saving the changes is checking the constraints. In the DNQ, some of them are encoded in the name of the delegate which provides a property of a given type. For example, xdRequiredIntProp has to always be set to some value, whereas xdNullableIntProp can remain empty.

Despite this, Xodus-DNQ allows defining more complex constraints which are described in the official documentation. I have added several examples to the XdAuthor entity type:

  var name by xdRequiredStringProp { containsNone("?!") }
  var country by xdStringProp {
    length(min = 3, max = 56)
    regex(Regex("[A-Za-z.,]+"))
  }
  var yearOfBirth by xdRequiredIntProp { max(2019) }
  var yearOfDeath by xdNullableIntProp { max(2019) }

You may be wondering why I have limited the countryOfBirth property length to 56 characters. Well, the longest official country name which I found is “The United Kingdom of Great Britain and Northern Ireland” — precisely 56 characters!

Queries

We have already used database queries above. Do you remember? We printed the list of authors using XdAuthor.all().asSequence(). As you may guess, the all() method returns all the entries of a given entity type.

More often than not though, we will prefer filtering data. Here are some examples:

store.transactional(readonly = true) {
  val fantasyBooks = XdBook.filter { 
    it.genres contains XdGenre.FANTASY }
  val booksOf20thCentury = XdBook.filter { 
    (it.year ge 1900) and (it.year lt 1999) }
  val authorsFromEngland = XdAuthor.filter { 
    it.countryOfBirth eq "England" }

  val booksSortedByYear = XdBook.all().sortedBy(XdBook::year)
  val allGenres = XdBook.all().flatMapDistinct(XdBook::genres)
}

Again, there are plenty of options for building data queries, so I strongly recommend taking a look at the documentation.

I hope this story is as useful for you as it was for me when I wrote it :) Any feedback is highly appreciated!

You can find the source code for this tutorial here.

The basics of NoSQL databases — and why we need them

freeCodeCamp — Thu, 31 Jan 2019 18:33:53 +0000

By Nandhini Saravanan

A beginner’s guide to the NoSQL world

Organizing data is a very difficult task. When we say organise, we are actually categorising stuff depending on its type and function.

_[Source](https://bitnine.net/wp-content/uploads/2016/12/SQL-vs.-NoSQL-Comparative-Advantages-and-Disadvantages.jpg" rel="noopener" target="blank" title=")

One option is RDBMS is like an Excel Sheet — you categorise data in the form of tables. You can form relationships between the tables.

A query questions the database, which gives you a relevant answer in return. This querying language is SQL or Structured Query Language.

For example,

select * from Employee_Data;

selects all the Employee Data from the Employee_Data table.

Relational databases follow a schema, a detailed blueprint of how your tables work.

You use Amazon, Facebook and so many networking applications. They release updates, add new functionalities and even extra modules. So how does one change the schema each time? Isn’t it time consuming for such huge companies to devote their time and labour to changing the schema?

This is where SQL could not work.

The Cons of RDBMS

Relational databases aren’t as bad as people portray these days. They are still in use by plenty of organisations. The introduction of NoSQL into the picture is to fill up the spaces where RDBMS can’t be of use anymore.

I am going to show you examples so that you have a clear understanding.

1. RDBMS can not handle ‘Data Variety’.

The amount of unstructured data continues to increase yearly and managing it is hard. RDBMS can’t force all types of data under a unified schema of tables.

Data Silos are also a problem for developers.

According to Tech Target, a data silo is a repository of data that remains under the control of one department. It is isolated from the rest of the organisation.

This means that when more silos exist for the same data, their contents are likely to differ. It creates confusion on which repository represents the most up-to-date version.

The increase of data from the year 2013 to 2020 is visible in the image below.

About 44 Zeta bytes of data will be generated in the year 2020.

Handling such diverse data which aren’t related to each other could be much harder in RDBMS.

_[Source](https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm" rel="noopener" target="blank" title=")

Example: It is difficult to store the details of a patient, who has varying body conditions. Categorisation of such diverse data is difficult in RDBMS.

2. Difficult to change tables and relationships.

Alteration of the relationships between tables or addition of a new table could affect the existing relations. This means changing the schema.

Change of the schema would be like eliminating the existing one and devising a new schema.

Addition of a new functionality would need all the elements to support the new structure. Change is inevitable.

Example: Each extra column needs all the prior rows to have values for that column. Whereas in Cassandra (a NoSQL database), you can add a column to specific row partitions.

_In RDBMS, every entry should have the same number of columns. But in Cassandra, each row can have a different number of columns. As you can see, 104 has name only whereas 103 has email, name, tel and tel2. — [Markus Klems](https://www.slideshare.net/yellow7?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview" rel="noopener" target="blank" title=")

3. RDBMS follow the ACID properties of the database.

The ACID properties of a database are Atomicity, Consistency, Isolation and Durability. ‌

Atomicity — An “all or nothing” approach. If any statement in the transaction fails, the entire transaction is rolled back.

Consistency — The transaction must meet all protocols defined by the system. No half completed transactions.

Isolation — No transaction has access to any other transaction that is in an intermediate or unfinished state. Each transaction is independent.

Durability — Ensures that once a transaction commits to the database, it is preserved through the use of backups and transaction logs.

The ACID properties aren’t flexible.

For example, RDBMS follows Normalization or a single point of truth concept. For every change you make, you should ensure strict ACID properties. The entity integrity and referential integrity rules also apply.

The CAP Theorem

According to Wikipedia, the CAP theorem (Brewer’s theorem) states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

Consistency: Like the C in ACID.

Availability: ‌Resources should be always available. There should be a non error response.

Partition tolerance: No single point (or node) of failure.

It is difficult to achieve all the three conditions. One must compromise between the three.

_[Source](https://www.dummies.com/wp-content/uploads/423504.image0.jpg" rel="noopener" target="blank" title=")

BASE to the rescue!

‌NoSQL relies upon a softer model known as the BASE model. BASE (Basically Available, Soft state, Eventual consistency).

Basically Available: Guarantees the availability of the data . There will be a response to any request (can be failure too).

Soft state: The state of the system could change over time.

Eventual consistency: The system will eventually become consistent once it stops receiving input.

NoSQL databases give up the A, C and/or D requirements, and in return they improve scalability.

NoSQL

This is when NoSQL came to the rescue.‌ It is “Not Only SQL” or “Non-relational” databases.

Characteristics of NoSQL:

Schema free
Eventually consistent (as in the BASE property)
Replication of data stores to avoid Single Point of Failure.
Can handle Data variety and huge amounts of data.

Types of NoSQL databases

NoSQL databases fall into four main categories:

Key value Stores — Riak, Voldemort, and Redis

Wide Column Stores — Cassandra and HBase.

Document databases — MongoDB

Graph databases — Neo4J and HyperGraphDB.

The words to the right hand side are examples of the types of NoSQL database types.

_[Source](https://s3.amazonaws.com/dev.assets.neo4j.com/wp-content/uploads/nosql-quadrant.jpg" rel="noopener" target="blank" title=")

1. Key Value Stores

A key value store uses a hash table in which there exists a unique key and a pointer to a particular item of data.

Imagine key value stores to be like a phone directory where the names of the individual and their numbers are mapped together.

Key value stores have no default query language. You retrieve data using get, put, and delete commands. This is the reason it has high performance.

Applications: Useful for storage of Comments and Session information. ‌Pinterest uses Redis to store lists of users, followers, unfollowers, boards.

2. Wide column stores

In a column store database, the columns in each row are contained within that row.

Each column family is a container of rows in an RDBMS table. The key identifies the row consisting of multiple columns.

Rows do not need to have the same number of columns. Columns can be added to any row at any time without having to add it to other rows. It is a partitioned row store.

_[Source](https://studio3t.com/wp-content/uploads/2017/12/cassandra-column-family-example.png" rel="noopener" target="blank" title=")

How does a columnar database store data?

How columnar stores store data

Applications: Spotify uses Cassandra to store user profile attributes and metadata.

3. Document Databases

‌Document stores uses JSON, XML, or BSON (binary encoding of JSON) documents to store data.

It is like a key-value database, but a document store consists of semi-structured data.

A single document is to store records and its data.

‌It does not support relations or joins.

_An example of a JSON document — [Source](https://webassets.mongodb.com/_com_assets/cms/JSON_Example_Python_MongoDB-mzqqz0keng.png" rel="noopener" target="blank" title=")

If we want to store the customer details and their orders, we can use document stores to do it.

_The Customer database is stored as a set of documents(can be JSON) which is mapped to the Orders database. Source : [MSDN Microsoft Blog](https://blogs.msdn.microsoft.com/usisvde/2012/04/05/getting-acquainted-with-nosql-on-windows-azure/" rel="noopener" target="blank" title=")

Applications: ‌SEGA uses MongoDB for handling 11 million in-game accounts built on MongoDB.

4. Graph databases

‌Nodes and relationships are the essential constituents of graph databases. A node represents an entity. A relationship represents how two nodes are associated.

‌In RDBMS, adding another relation results in a lot of schema changes.

Graph database requires only storing data once (nodes). The different types of relationships (edges) are specified to the stored data.

The relationships between the nodes are predetermined, that is, it is not determined at query time.

Traversing persisted relationships are faster.

It is difficult to change a relation between two nodes. It would result in regressive changes in the database.

Example: This image is how MySQL works where it has to perform many operations to find a correct result for Alice.

_[Source](https://s3.amazonaws.com/dev.assets.neo4j.com/wp-content/uploads/from_relational_model.png" rel="noopener" target="blank" title=")

‌A graph database, which predetermines relationships.

_[Source](https://s3.amazonaws.com/dev.assets.neo4j.com/wp-content/uploads/relational_to_graph.png" rel="noopener" target="blank" title=")

This is some of the basic information you will need to start exploring NoSQL. New databases are being invented for specific uses.

Learn the type of data your application generates, and then it is easy to choose the right database.

NoSQL - freeCodeCamp.org

Firestore Data Modeling Guide: Embedded Documents vs Referencing (with a Blog Case Study)

Table of Contents

Prerequisites

The Relational Mindset: How SQL Handles Data

The Firestore Paradigm: NoSQL with Relationships

The Core Building Blocks: Documents and Collections

The Golden Rule: Model for Reads, Not Writes

Embedding vs. Referencing (Denormalization)

Option A: Embedding (Nesting)

Option B: Referencing (Denormalization)

How to Model Relationships (1-1, 1-N, N-N)

One-to-One (1-1)

One-to-Many (1-N)

Many-to-Many (N-N)

Best Practices and Pitfalls to Avoid

Case Study: Designing a Scalable Blog Database

Conclusion

How to Store Data Locally with Isar in Flutter

Table of Contents

Prerequisites

What We Are Building

How to Set Up Isar in a Flutter Project

Step 1: Add dependencies

Step 2: Create and initialize Isar

How to Create the Task Model

How to Build the Repository for CRUD Operations

How to Integrate CRUD into the Flutter UI

Beyond CRUD: Advanced Features of Isar

Conclusion

SQL vs NoSQL: When to Use Which

How to Start Using MongoDB – Database Setup for Beginners

Prerequisites

What You'll Learn

What is a NoSQL Database?

Why Should I Use No-SQL?

How to Get Started with MongoDB – Install Guide

How to Create and Populate the MongoDB Database

How to Create a New MongoDB Database

How to Add New Records to Your Database

Conclusion

Relational VS Nonrelational Databases – the Difference Between a SQL DB and a NoSQL DB

What Is A Database? A Definition for Beginners

What is SQL?

What Is A Relational Database?

Characteristics of Relational Databases

ACID Properties in Relational Databases

What Is A Non-Relational Database?

Types of Non-Relationional Databases

BASE Properties in Non-relational Databases

How to Choose Between SQL and NoSQL Databases

When to use an SQL database:

When to use a NoSQL database:

Further Learning

Conclusion

AWS DynamoDB – NoSQL Database Guide for Beginners

What is DynamoDB?

What are NoSQL Databases?

Core Features of DynamoDB

Autoscaling

Data Models

Replication

Backups & Recovery

Security

Monitoring

DynamoDB vs MySQL

DynamoDB vs MongoDB

Wrapping Up

Learn About NoSQL Databases in This 3-hour Course

What is NoSQL?

How do Databases Work?

Let's get to it!

The Apache Cassandra Beginner Tutorial

Table of Contents

How to Set Up a Cassandra Cluster

Cassandra Architecture

Decentralization

Every Node Is a Coordinator

Data Partitioning

Replication

Keep Data in Sync Using `BATCH` Statements

`UPDATE`s Are Just `INSERT`s, and Vice Versa