Anant Chowdhary - freeCodeCamp.org

How Message Queues Help Make Distributed Systems More Reliable

Anant Chowdhary — Mon, 28 Oct 2024 13:41:21 +0000

Reliable systems consistently perform their intended functions under various conditions while minimizing downtime and failures.

As internet users, we tend to take for granted that the systems that we use daily will operate reliably. In this article, we’ll explore how message queues enhance flexibility and fault tolerance. We’ll also discuss some challenges that we may face while using them.

After reading through, you’ll know how to implement reliable systems and what key performance factors to keep in mind.

Prerequisites

Before diving into this article, you should have a foundational understanding of cloud computing. Here are the key concepts:

Basic principles of Cloud Computing
Availability in Distributed Systems
An understanding of the CAP theorem.

Reliability in Distributed Systems
What Makes Software Reliable?
What is a Message Queue?
How Message Queues Help Make Distributed Systems More Reliable
Challenges with Message Queues
Summary

What Does Reliability Mean in the Context of Distributed Systems?

Reliability, according to the OED, is “the quality of being trustworthy or of performing consistently well”. We can translate this definition to the following in the context of distributed systems:

The ability of a technological system, device, or component to consistently and dependably perform its intended functions under various conditions over time. For instance, in the context of online banking, reliability refers to the consistent and secure processing of transactions. Users expect to complete transfers and access their accounts without errors or outages.
The system being resilient to unexpected or erroneous interactions by users / other systems interacting with it. For instance, if a user tries to access a deleted file on a cloud storage system, the system can gracefully notify them and suggest alternatives, rather than crashing.
The system performs satisfactorily under its expected conditions of operation, as well as in the case of unexpected load and/or disruptions. An example of this is a video streaming service during a major sporting event. The system is designed to perform well under normal traffic but must also handle sudden spikes in users when a popular game starts

This is quite a general view of what reliability is, and the definition changes with time, as systems change with changing technology.

What Makes Software Reliable?

There are various key components that are used industry wide to make distributed software reliable as used across large scale systems.

Data Replication

Data replication is a fundamental concept in system design where data is intentionally duplicated and stored in multiple locations or servers.

This redundancy serves several critical purposes, including enhancing data availability, improving fault tolerance, and enabling load balancing.

By replicating data across different nodes or data centers, we may be able to ensure that, in the event of a hardware failure or network issue, the data remains accessible. This reduces downtime and enhances system reliability.

It's essential to implement replication strategies carefully, considering factors like consistency, synchronization, and conflict resolution to maintain data integrity and reliability in distributed systems.

Let’s look at a concrete example. With a primary-secondary database model such as one used with e-commerce websites, we may have the following:

Replication: The primary database handles all the write operations, whereas the secondary database(s) handles all the reads. This ensures that reads are spread out across multiple databases, enhancing performance and lowering the probability of a crash.
Consistency: The system may use eventual consistency to maintain integrity, ensuring that all replicas eventually reflect the same data. But during high-traffic periods, the website may temporarily allow for slight inconsistencies, such as showing outdated inventory levels.
Conflict Resolution: If two users attempt to buy a single available item at the same time, a conflict resolution strategy may be used. For instance, the system could use timestamps to determine the customer who gets assigned the product, and this may dictate database updates eventually.

Load Distribution Across Machines

Load distribution involves distributing computational tasks and network traffic across multiple servers or resources to optimize performance and ensure system scalability.

By intelligently spreading workloads, load distribution prevents any single server from becoming overwhelmed, reducing the risk of bottlenecks and downtime.

Some very commonly used load distribution mechanisms are:

Using Load Balancers: A load balancer can evenly distribute incoming traffic across multiple servers, preventing any single server from becoming a bottleneck.
Dynamic Scaling: Dynamic or auto-scaling can be used to automatically adjust the number of active servers based on current demand, adding more resources during peak times and scaling down during low traffic.
Caching: Caching layers can be used to store frequently accessed data, reducing the load on backend servers by serving requests directly from the cache.

Capacity Planning

Capacity planning entails analyzing factors such as expected user growth, data storage requirements, and processing capabilities to ensure that the system can handle increased loads without performance degradation or downtime.

By accurately forecasting resource needs and scaling infrastructure accordingly, such planning helps optimize costs, maintain reliability, and provide a seamless user experience. Being proactive can help ensure a system is well-prepared to adapt to changing requirements and remains robust and efficient throughout its lifecycle.

A lot of modern systems can scale automatically with projected loads. When traffic or processing requirements increase, such auto scaling automatically provisions additional resources to handle the load. Conversely, when demand decreases, it scales down resources to optimize cost efficiency.

Metrics and Automated Alerting

Metrics involve collecting and analyzing data points that provide insights into various aspects of system behavior, such as resource utilization, response times, error rates, and more.

Automated alerting complements metrics by enabling proactive monitoring. This involves setting predefined thresholds or conditions based on metrics. When a metric crosses or exceeds these thresholds, automated alerts get triggered. These alerts can notify system administrators or operators, allowing them to take immediate action to address potential issues before they impact the system or users.

When used together, metrics and automated alerting create a robust monitoring and troubleshooting system, helping ensure that anomalies or problems are quickly detected and resolved.

Now that you know a bit about what reliability means in the context of Distributed Systems, we can move on to Message Queues.

What is a Message Queue?

A message queue is a communication mechanism used in distributed systems to enable asynchronous communication between different components or services. It acts as an intermediary that allows one component to send a message to another without the need for direct, synchronous communication.

Above, you can see that there are multiple nodes (called Producers) that create messages that are sent to a message queue. These messages are processed by a node called the Consumer node, which may perform a series of actions (for instance database reads, or writes) as a part of each message being processed.

Now let’s look at an actual example where a message queue may be useful. Let’s assume we have an e-commerce website that allows millions of orders to be processed.

Processing an order may take place in the following steps:

A user creates an order. This sets off a request to a web server, that in turn creates a message that is placed in the orders queue.
A consumer reads the message, and in turn calls different services while processing the message (for instance the inventory checks, the payment service, the shipping service)
Once all processing steps have completed, the consumer removes the message from the queue.

Note that in case there are parts of the system that fail, the message can be left in the queue to be re-processed.

Even in cases where there is a total outage on the processing side of things, messages can simply pile up in the queue and be consumed once services are functional again. This is an example of a queue being useful in multiple failure scenarios.

Let’s look at some code for this scenario using AWS SQS, which is a popular message queue service that allows users to create queues, send messages to the queue, and also consume messages from queues for processing.

The below example uses Boto3 which is a Python Client for AWS SQS.

First, we’ll place an order, assuming we already have an SQS queue called OrderQueue in place.

import boto3
import json

# Create an SQS client
sqs = boto3.client('sqs')

# Let's assume the queue is called OrderQueue
# This is the queue in which orders are placed
queue_url = 'https://sqs.us-east-1.amazonaws.com/2233334/OrderQueue'

# Function to send an order message
# This places an order in the queue, which can at any time be
# picked up by a consumer and then processed
def send_order(order_details):
    message_body = json.dumps(order_details)
    response = sqs.send_message(
        QueueUrl=queue_url,
        MessageBody=message_body
    )
    print(f'Order sent with ID: {response["MessageId"]}')

# Using the queue to place an order
# Defining a sample order

order = {
    'order_id': '12345',
    'customer_id': '67890',
    'items': [
        {'product_id': 'abc123', 'quantity': 2},
        {'product_id': 'xyz456', 'quantity': 1}
    ],
    'total_price': 59.99
}

# Sending the order to the queue which is expected to be picked up 
# by a consumer and processed eventually.
send_order(order)

Then once the order has been placed, here’s some code that illustrates how it’ll be picked up for processing:

import boto3
import json

# Create an SQS client
sqs = boto3.client('sqs')

# Processing orders from the same queue defined above
queue_url = 'https://sqs.us-east-1.amazonaws.com/2233334/OrderQueue'

# Function to receive and process orders
# Picking up a maximum of 10 messages at a time to process
def receive_orders():
    response = sqs.receive_message(
        QueueUrl=queue_url,
        MaxNumberOfMessages=10,  # Up to 10 messages
        WaitTimeSeconds=10
    )

    messages = response.get('Messages', [])

    for message in messages:
        order_details = json.loads(message['Body'])
        print(f'Processing order: {order_details}')

        # Processing the order with details such as 
        # processing payments, updating the inventory levels,
        # processing shipping etc.

        # Delete the message after processing
        # This is important since we don't want an
        # order to be processed multiple times.
        sqs.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message['ReceiptHandle']
        )

# Receive a batch of orders
receive_orders()

What is an Intermediary in a Distributed System?

In the context of what we’re discussing here, a message queue is an intermediary. Quoting Amazon AWS’ definition of a message queue:

“Amazon Simple Queue Service (Amazon SQS) lets you send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available.”

This is a wonderfully succinct and accurate description of why a message queue (an intermediary) is important.

In a message queue, messages are placed in a queue data structure, which you can think of as a temporary storage area. The producer places messages in the queue, and the consumer retrieves and processes them at its own pace. This decoupling of producers and consumers allows for greater flexibility, scalability, and fault tolerance in distributed systems.

How Message Queues Help Make Distributed Systems More Reliable

Now let's discuss how Message Queues help make Distributed Systems more reliable.

1. Message Queues Provide Flexibility

Message queues allow for asynchronous communication between components. This means that producers can send messages to the queue without waiting for immediate processing by consumers. This allows components to work independently and at their own pace, providing flexibility in terms of processing times. So this is a great way to make designs flexible, and as self contained as possible.

2. Message Queues Make Systems Scalable

Message queues are often the bread and butter of scalable distributed systems for the following reasons:

Multiple producers can add messages to a message queue. This raises the ceiling and allows us to easily horizontally scale applications.
Multiple consumers can read from a message queue. This again allows us to easily scale throughput if needed in a lot of scenarios.

3. Message Queues Make Systems Fault Tolerant

What happens if a distributed system is overwhelmed? We sometimes need to have the ability to cut the cord in order to get the system back to a working state. We’d ideally want the ability to process requests that weren’t processed when the system was down.

This is exactly what a message queue can help us with. We may have hundreds of thousands of requests that weren’t processed, but are still in the queue. These can be processed once our system is back online.

Challenges with Message Queues

As with life, using message queues in distributed systems isn’t a silver bullet to scaling problems.

Here are some situations where message queues may be useful:

Asynchronous Processing: Messages queues are generally an excellent choice in infrastructure wherever asynchronous processing is required. In workflows such as sending confirmation emails or generating reports after an order is placed, message queues can decouple these tasks from the primary application flow.
Load Balancing: As we saw in our example for message queues, in scenarios where traffic spikes occur, message queues can buffer incoming requests, allowing multiple consumers to process messages concurrently. This helps distribute the load evenly across available resources.
Fault Tolerance: In systems where reliability is crucial, message queues provide a mechanism for handling failures. If a service is temporarily down, messages can be retained in the queue until the service is available again, ensuring that no data is lost unless intended.

Here are a some situations where message queues may not be useful:

Message queues can be great in scenarios where ordering of messages does not matter. But in situations where order does matter, they can sometimes be slow and more expensive to use.
Designing systems with queues that have multiple consumers isn’t trivial. What happens if a message is processed twice? Is idempotency a requirement? Or does it break our use case? These complexities can often lead us to situations where message queues may not be the best solution.

Summary

In this article, you learned about reliability in distributed systems, and how message queues can help make such systems more reliable. Here’s a summary of the key takeaways:

Reliability is central to distributed systems and there are a few common ways this is handled across the tech industry. Data replication, load distribution, and capacity planning are some ways that can improve the reliability of a system.
Message Queues are intermediaries that can store messages from producers. They can be picked up by consumers at a rate that's generally independent of the rate of production.
Queues are flexible, allowing us to immediately stem the flow of unwanted event processing in case of an unforeseen event.
Despite the versatility of message queues, they're not a panacea for reliability issues. There are often multiple considerations to be kept in mind while processing messages in a message queue.

How to Use Time To Live in Event-Driven Architecture in AWS

Anant Chowdhary — Wed, 19 Jun 2024 18:08:45 +0000

Distributed systems generally involve the storage and exchange of huge amounts of data. Not all data is created the same, and some of it can even expire – by design.

As the Buddha said, "All conditioned things are impermanent."

In this article, we'll look at how the concept of time to live can help us with this type of data and when it makes sense to use it.

What is Time to Live (TTL) in Distributed Systems?
How to use TTL in Message Queues (AWS SQS)
How to use TTL in Object Storage Systems (AWS S3)
How to use TTL in Databases (AWS DynamoDB)
How to use TTL in Event Based Architecture
Summary

What is Time to Live (TTL) in Distributed Systems?

TTL, as the name suggests, is the amount of time a piece of data stays relevant or stays stored in a distributed system or a component of a distributed system. A TTL may be set on any piece of data that isn't needed indefinitely.

Knowing when and when not to use a TTL can sometimes be tricky. It can also affect the way a system is designed, cost and scaling considerations. In the following sections, we learn about when and when not to use TTL.

Where does TTL make sense?

As mentioned above, it makes sense to use TTL for any piece of data that is ephemeral. Some common examples of use cases where you can set a TTL on data are:

Cached data: Cached data is pretty much omnipresent in distributed systems. For instance, a very popular social media post's resources (image, video, audio) may be cached on a CDN (Content Delivery Network)'s servers. You don't want this data to live forever on the server, so in some cases it may make sense to add a TTL to this data, so that it is automatically removed after a certain period of time.
Analytics Data: Most if not all large scale systems store some form of metrics that help analyze things like latency, system health, and product metrics amongst others. In a large number of cases, you wouldn't want these metrics to be stored in systems forever. Only recent data (say 60 days or 180 days) may be useful in most cases. A TTL on data in this case makes sense, especially if you have constraints on memory.
Indexed data: Search is a feature that's ubiquitous across products. Be it social media apps, e-mail or search engines – indexed data is vital to blazing fast searches. Indexed data, however, can become stale after a while, so it makes sense for the index to expire after some time. Hence, a TTL here can be useful.
Social media apps with short lived content: Social media apps with short lived content are extremely popular and images/videos posted are often short lived. In case these images do not need to be stored for posterity, they can benefit from a TTL being set on them. In addition to being memory efficient, it also aids privacy.

Where does TTL not make sense?

In the above section we looked at a few cases where TTL makes sense. What about cases where TTL isn't common and isn't useful? Let's look at some examples:

Media stored for streaming platforms: Streaming platforms often use cloud storage solutions such as Amazon AWS S3 to store objects that correspond to the media they stream to customers. These forms of media are generally not ephemeral and are expected to stay on platforms for years if not decades. Since such data isn't expected to expire anytime, TTL does not make sense here.
Bank transactions: Bank transactions produce some of the most sensitive data that are stored in cloud-based and distributed systems. For audit and book-keeping purposes, these pieces of data are generally stored for decades. So, since this data seemingly never expires, there's generally no use for a TTL here. This isn't to say that this form of data can't be moved from fast access databases/caches to slower and cheaper forms of data storage, though.

How to Use TTL in Message Queues (AWS SQS)

AWS SQS is a distributed message queuing solution that is the backbone of many versatile distributed systems across the world. Message queues can process billions of messages and are used almost universally across distributed systems around the world.

In this section, we'll look at how TTLs can be useful while we consider design options with respect to message queues.

What happens if a message queue's consumers have been backed up for several days, or messages simply haven't been consumed for a while? We have the option of setting a custom Time To Live on SQS messages.

By default, the retention period is 4 days. The maximum TTL at the time of writing is 14 days. So it's important to be aware of constraints such as these while using AWS SQS to design systems.

Note that with AWS SQS, a retention period is a set on the queue itself, and not individually for each message.

Boto is an AWS SDK for Python that enables developers to create, configure, and manage AWS services and resources programmatically. Boto is widely used for prototyping, production systems, and in general offers a user-friendly interface for accessing services like S3, EC2, and DynamoDB.

Here's a code snippet using Boto that will help you set the MessageRetentionPeriod attribute which is the formal name for TTL in this context.

sqs = boto3.client('sqs', 
aws_access_key_id=your_aws_access_key_id, 
aws_secret_access_key=aws_secret_access_key, 
region_name='your_region')

# Set the desired retention period in seconds
retention_period_seconds = 86400  # Example: 1 day

# Set the queue attributes
response = sqs.set_queue_attributes(
    QueueUrl=your_queue_url,
    Attributes={
        'MessageRetentionPeriod': str(retention_period_seconds)
    }
)

Visibility Timeout in Message Queues (AWS SQS)

Note that while it's tempting to think of Visibility Timeout in SQS as Time To Live, these aren't the same. Time To Live or Retention Period is different from Visibility Timeout.

Visibility timeout instead refers to a generally shorter period of time by which a message should be processed (once picked up by a consumer). If not, it is back in the SQS queue and visible to consumers again, with its receive count having been increased by one.

How to Use TTL in Object Storage Systems (AWS S3)

The all-versatile AWS S3, which is an object storage solution, gives users the ability to set a Time To Live on objects stored in S3 buckets.

S3 is extremely flexible with the way TTLs are set on objects / buckets. You can set Lifecycle rules to specify what objects or what versions of an object you'd like to remove.

Managing your storage lifecycle is a great read on the AWS Documentation website.

How to Use TTL in Databases (AWS DynamoDB)

Some types of data in databases are prime candidates to have a TTL set on them. Pieces of data such as logs and analytics data may become stale very fast, and/or they may lose utility with time.

TTL in DynamoDB provides a cost-effective approach that lets you automatically remove items that are no longer relevant. It is supported natively and can be set on the whole DynamoDB table.

Here's a code snippet that lets you set the TTL on a DynamoDB table (again, using Boto):

ddb_client = boto3.client('dynamodb')

# Enable Time To Live (TTL) on an existing DynamoDB table
ttl_response = ddb_client.update_time_to_live(
    TableName=your_table_name,
    TimeToLiveSpecification={
        'Enabled': True,
        'AttributeName': your_ttl_attribute_name
    }
)

# Check for a successful status code in the response
if ttl_response['ResponseMetadata']['HTTPStatusCode'] == 200:
    print("Time To Live (TTL) has been successfully enabled.")
else:
    print(f"Failed to enable Time To Live (TTL)")

Here, the your_ttl_attribute_name attribute is the attribute that DynamoDB looks at to determine whether or not the item is to be deleted. The attribute is generally set to some timestamp in the future. When that timestamp is reached, DynamoDB removes the item from the table.

How to Use TTL in Event-Based Architecture

So far we've discussed Time To Live and where it can be useful. What about its implications? Lots of cloud based solutions provide notifications that can indicate that a piece of data has indeed reached it's expiration, and allow you to take actions based on the expiration of that data.

Let's look at a common use case. Suppose you have a social media app that you're building that lets users send each other ephemeral messages. Now while the contents of these messages themselves are ephemeral, you may still want to retain a log of what users a particular user exchanged messages with, even though the contents of the message (audio/video/image) may have expired.

The diagram below explains a possible architecture in a little more detail:

Social Media App Architecture Example

Suppose a user exchanges messages with another user. An entry corresponding to a message is stored in ActiveMessageDB which, for the purpose of simplicity, we'll suppose is a NoSQL database that stores messages.

If the app here allows for expiring messages, you could set a TTL on the entry. While the message entry itself is deleted after the TTL is reached, an event can be fired off to let a system know that the message is being deleted.

In the above diagram, the event is picked up by an AWS Lambda instance and a much smaller amount of data is written to another database MessageLogDB which isn't as frequently accessed as ActiveMessageDB. What we just saw is an instance of event-based architecture being coupled with TTL.

Summary

TTL is the amount of time a piece of data stays relevant or stays stored in a distributed system or a component of a distributed system.
TTL makes sense in use cases where data can be deleted, can expire, or its form can change after a certain period of time.
TTL is popular and generally easy to set on many distributed systems offerings.
TTL can be paired with event driven architecture to transform data.

How to Deal with Traffic Surges in Distributed Systems

Anant Chowdhary — Fri, 17 May 2024 06:50:28 +0000

Web and Distributed Systems can often get overwhelmed with traffic.

What leads to systems being overwhelmed, why does it happen, and what are some common strategies we can use to deal with this? We'll answer these questions in this article.

What is traffic in the context of distributed systems?
Why can traffic surges be problematic?
Ways to deal with high traffic loads
Exponential Backoff and Retries
Summary

What is Traffic in the Context of Distributed Systems?

Traffic in distributed systems generally refers to the exchange of data between end users and the entry point to a system that may rely on distributed components.

The patterns of traffic a system sees usually informs multiple design decisions since it impacts performance, scalability and reliability of a system.

Why Can Traffic Surges be Problematic?

Traffic surges can often cripple systems that aren’t equipped to deal with them.

You may have come across instances of social media services such as Instagram or TikTok being down. In some cases, this may be due to surges of traffic.

Here are some common problems a surge in traffic may cause:

Congestion: As traffic increases, network congestion may increase. This may in some cases, lead to packet loss, increased latency and impact performance of systems.
Imbalanced load: Not all distributed systems balance load well. A sudden spike in traffic may lead to failures in particular sub-systems. As an example, let’s think of a celebrity’s tweets being stored on a shard. In the scenario where an event leads to millions of people accessing the celebrity’s tweets, the shard that stores the celebrity’s tweets may get overwhelmed.
Cascading failures: Imagine a set of domninoes that are placed right next to each other. One domino falling may lead to the entire set of dominoes falling. Distributed systems are similar. If components aren't loosely coupled, a single point of failure may lead to cascading failures. It is therefore important to consider cascading failures when designing distributed systems with high traffic loads in mind.

Ways to Deal with High Traffic Loads

No system is immune to failure under an unspecified amount of traffic.

Fortunately, there are some design decisions you can take to ameliorate the problems discussed above, and make systems more resilient against failure when they see a sudden spike in traffic.

Now, let's cover some of the commonly used solutions that can help deal with surges in traffic.

Firstly, horizontal scaling is generally the process of adding resources to a system by adding more resources in parallel. For instance, adding more servers or adding more CDN nodes.

In effect, it is adding more resources instead of increasing the capacity of a single node in the network.

Distributing traffic across servers can lead to improved performance, lower latency, and improved response times in general.

Next, load balancing can sometimes be closely linked with horizontal scaling. However, load balancing by itself also can be very useful in situations where we see a sudden surge in traffic.

Load balancers can smartly route requests to servers so that traffic is well balanced across systems, and doesn't overwhelm one particular system.

In addition, caching can dramatically reduce the need for traffic (requests) to go all the way to a server for fulfilment. Some types of requests, such as those that access static content as great candidates for caching.

Similar to an example that we discussed in the above sections, let's assume that there's a sudden spike in people viewing a celebrity's tweets. The static content on the page, such as the celebrity's display picture, can be easily cached. This will prevent a request that goes all the way to the profile picture database, and therefore may help prevent read failures, and in turn cascading failures.

Lastly, consider a scenario where a client internal to a distributed system sends a request to the server and the request fails. Clients often retry requests, but this may lead to cascading retries.

This is a scenario where multiple clients (the original one, and the one downstream) may be retrying their requests, and as a result, a system downstream may be inundated with requests and that by itself may lead to cascaded failures.

A single retry request can lead to an exponential number of retries in other parts of a distributed system

In the figure above, we can see that two requests from the server (one retry in the event of a failure at the Queue), leads to :

1) Four Requests from the serverless component to the Notification Topic (two requests to the serverless component and two retries)

2) Eight Requests from the Notification Topic to the Queue (four requests to the queue, four retries).

Even a single failure at the end component (the Queue in this case), led to an exponential number of retry requests to it.

A common antidote to this problem is to use exponential backoff while retrying requests.

Exponential Backoff and Retries

Exponential backoff, as the name suggests, refers to introducing a delay before the next attempt instead of immediately retrying. We increase the delay time exponentially with each attempt.

For instance, the first retry might wait for one second, the second retry waits for two seconds, the third for four seconds, and so on.

Note that since a retry attempt isn't made immediately, the probability of cascading retries goes down compared to if retries were made immediately.

Here's some code that illustrates exponential backoff in action :


def exponential_backoff_retry(url, max_retries=5, initial_delay=1, backoff_factor=2):
    retry_count = 0
    while retry_count < max_retries:
        try:
            response = requests.get(url)
            # Check if the response was successful (status code 2xx)
            if response.status_code // 100 == 2:
                return response
            # If not successful, raise an exception to trigger a retry
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}")
            # Calculate the exponential backoff delay
            delay = initial_delay * (backoff_factor ** retry_count)
            print(f"Retrying in {delay} seconds...")
            time.sleep(delay)
            retry_count += 1
    # If max retries reached, raise an exception
    raise RuntimeError(f"Failed to fetch {url} after {max_retries} retries")

The code above retries HTTP GET requests using an exponential backoff strategy.

Inside the while loop, we make attempts to make the request and check if the response is successful (successful HTTP requests have a status code of 2xx).

Note that if the request fails, we raise an exception and retry the request after calculating an exponential delay.

This process is continued until either a request succeeds or the maximum number of retries is reached. If the maximum retries are exhausted without success, we raise a RuntimeError.

Summary

To summarize, we delved into the significance of traffic within distributed systems, emphasizing its influence on system performance and resilience.

We looked at the complications arising from traffic surges, including network congestion, load imbalances, and cascading failures, which can render systems vulnerable to collapse.

To address these challenges, we looked at some strategic measures such as horizontal scaling, load balancing, caching, and retry strategies. Particularly, we looked at the effectiveness of exponential backoff in mitigating cascading retries, thereby enhancing system robustness.

By keeping in mind some of these solutions, systems can better manage sudden spikes in traffic, ensuring sustained functionality and minimizing potential downtime, ultimately bolstering overall system reliability.

These are just a few of the numerous methods that are used industry wide to deal with surges in traffic.

Asynchronous vs Batch Data Processing in Distributed Systems – Explained with Examples

Anant Chowdhary — Wed, 20 Mar 2024 15:13:11 +0000

Distributed Systems often process and store huge amounts of data. Processing this data efficiently is typically an ongoing endeavor, and how it is designed almost always affects the end-user experience of a product.

Two popular modes of processing data are Batch Processing and Asynchronous Processing. We'll learn more about both in this article, along with when to use each approach.

Batch Processing of Data
What is Batch Processing?
When Do We Use Batch Processing?
Real World Example of Batch Processing of Data
What Does Batch Processing Look Like in Code?
Asynchronous Processing of Data
What is Asynchronous Processing?
When Do We Use Asynchronous Processing
Real World Example of Asynchronous Processing of Data
What Does Async Processing Look Like in Code?
Summary

Batch Processing of Data

What is Batch Processing?

Batch Processing, as you may have guessed, waits for a certain amount of data to be accumulated, and then processes this batch of data in one go. In other words, this means that in most scenarios we would wait for some number of events to complete and then process the data.

This is different from asynchronous processing of data, where we process an event and its associated data as soon as it occurs. More on that soon.

Now that you know a bit more about batch processing, it'll be useful to see a couple of real world examples.

When Do We Use Batch Processing?

Batch processing is used in lots of scenarios, such as:

Large volume of data: When we have a very large amount of data, it is often more resource-efficient to let the data collect over a period of time and then process it.
Data that isn't time sensitive: Since batch processing waits for data to collect, it is generally not suitable for processing data that's very time sensitive. On the other hand, it is possible to process batches of data within short intervals of time.
Scheduled Processing of data: In lots of instances, we need a large amount of data to be processed at regular intervals. Automated system backup and updates, for example, are generally scheduled for particular intervals. Batch processing can be very useful in such scenarios.

Real World Example of Batch Processing of Data

A popular real world use case for batch processing is credit card transactions.

Many financial institutions choose to settle credit card transactions in batches instead of settling them in real time. Since the settlement of transactions is generally not very time sensitive, this gives systems the time to run various other analyses / jobs on the transactions such as fraud detection, currency conversions etc.

Credit Card Transactions and Batch Processing

The diagram above shows a very high level example of a lifecycle of a credit card transaction. The steps are as follows:

The credit card transaction takes place at the Point of Sale (POS).
A gateway forwards the request to a serverless component that writes the transaction to a staging database where the transaction is stored temporarily.
At the end of the business day, the transactions in the staging database are reconciled and go through fraud detection. This is the component where batch processing takes place (note that we waited for some data to collect, and processed a large amount of data).

What Does Batch Processing Look Like in Code?

We saw an example of a distributed system in the above example. How would batch processing look like in code?

Below you'll see some code that lets you process a batch of SQS messages:

import boto3 

def process_batch_messages(sqs, queue_url):
    partial_response = sqs.receive_message(
        QueueUrl=queue_url
        MaxNumberOfMessages=10 # This sets the maximum batch size to 10
        WaitTimeSeconds=10 # We wait for a maximum of 10 seconds
    )
    if 'Messages' in partial_response:
        messages = partial_response['Messages']
        for each message in messages:
            # do something with each message

            # remove the message from the queue after processing
            sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'])


if __name__ == '__main__':
    # Initialize sqs client
        sqs = boto3.client(
        'sqs',
        aws_access_key_id='Your access key id',
        aws_secret_access_key='your secret access key,
        region_name='Your AWS region'
    )
    your_queue_url = 'your-queue-url'
    process_batch_messages(sqs, your_queue_url)

The above code waits for the earlier of two events: either 10 seconds having passed, or a batch size of 10 being reached within the queue.

Asynchronous Processing of Data

What is Asynchronous Processing?

The word asynchronous is generally defined as "events that are not coordinated in time". As the definition suggests, asynchronous processing of data does not rely on coordination of data events, and these events are processed as and when they occur.

This means that as soon as an event occurs, the event is processed and the data corresponding to the event may be stored in a sub system, passed on to another component in the system, or may simply lead to another event being fired off.

When Do We Use Asynchronous Processing?

You'll use asynchronous processing of data (sometimes also referred to as async) in various scenarios.

Microservices: Microservices often involve a request that needs an immediate response. Since this processing is done "per event", this would require async processing of data, so in most cases results are returned to clients within a very short period of time (low latency).
User Interfaces: Often, components in user facing UI components need to use async processing of data. For instance, multiple data fetches can be performed in the background using async calls when a user is using an application. This ensures that the application works smoothly and responsively without having the need for the UI components to "freeze".
Systems that require real time responses: Many interactive systems require real time processing of data. In the past few years, video calls and meetings have become increasingly popular. Since systems like these require immediate requests and responses (and in some cases streams of data being processed), async processing of data is used here.

Real World Example of Asynchronous Processing of Data

Chat apps are a great example of asynchronous processing of data. Here, if a user 1 types a message and sends it to user 2, the message must be written to the required databases / systems, delivered to user 2, and possibly read by user 2 without any delay.

Since this is real time processing of the event that occurred here (the event being that a message was sent), this is an example of asynchronous processing of data.

Exchange of messages in a chat app

In the above diagram we see that User 1 sends a message through their phone. The message gets routed to a message server which ultimately creates an entry in a messages database (Messages DB).

Now that MessagesDB has an entry, an event is fired off that is consumed by the Notification Pusher. This then communicates with User 2's notification queue to put a notification related to the message in their notification queue.

Whenever User 2's device comes online or has access to the internet, they receive a message notification.

Note that we did not wait for any data to collect, nor did we process this data after any specific time delay. We processed the event as soon as it happened. So this is an example of asynchronous processing of data.

What Does Async Processing Look Like in Code?

Can we modify the code that we saw in the section for batch processing to work for async processing? Remember that we said "this code waits for the earlier of two events: 10 seconds having passed, or a batch size of 10 being reached within the queue".

If we change the batch size to 1, we would effectively process a message as soon as it is received.

import boto3 

def process_async_messages(sqs, queue_url, batch_size):
    partial_response = sqs.receive_message(
        QueueUrl=queue_url
        MaxNumberOfMessages=batch_size # This sets the maximum batch size
        WaitTimeSeconds=10 # We wait for a maximum of 10 seconds
    )
    if 'Messages' in partial_response:
        messages = partial_response['Messages']
        for each message in messages:
            # do something with each message

            # remove the message from the queue after processing
            sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'])


if __name__ == '__main__':
    # Initialize sqs client
        sqs = boto3.client(
        'sqs',
        aws_access_key_id='Your access key id',
        aws_secret_access_key='your secret access key,
        region_name='Your AWS region'
    )
    your_queue_url = 'your-queue-url'
    process_batch_messages(sqs, your_queue_url, 1)

Note that in the above code we modified the process_batch_messages to accept a batch_size parameter and renamed the method to process_async_messages. This method processes a message as soon as the queue receives a method (assuming the queue has received a message within the wait time of 10 seconds)

Summary

Let's summarize batch and asynchronous data processing.

Batch Processing is a paradigm where you wait for an amount of data to collect or some time to pass before the data is processed.

Batch processing is often used in scenarios where you have large volumes of data, data that isn't time sensitive, and data that can be processed on a set schedule. The example we discussed above was that of a credit card transaction.

Asynchronous processing of data, on the other hand, is used to process data related to events as soon as they occur.

This approach is often used when dealing with data processed in microservices, user interfaces, and in general with systems needing real time request-response processing. We looked at an example of a chat app in the above discussion and learnt how asynchronous processing of data is applicable to the scenario.

Anant Chowdhary - freeCodeCamp.org

How Message Queues Help Make Distributed Systems More Reliable

Prerequisites

Table of Contents

What Does Reliability Mean in the Context of Distributed Systems?

What Makes Software Reliable?

Data Replication

Load Distribution Across Machines

Capacity Planning

Metrics and Automated Alerting

What is a Message Queue?

What is an Intermediary in a Distributed System?

How Message Queues Help Make Distributed Systems More Reliable

1. Message Queues Provide Flexibility

2. Message Queues Make Systems Scalable

3. Message Queues Make Systems Fault Tolerant

Challenges with Message Queues

Summary

How to Use Time To Live in Event-Driven Architecture in AWS

Table of Contents

What is Time to Live (TTL) in Distributed Systems?

Where does TTL make sense?

Where does TTL not make sense?

How to Use TTL in Message Queues (AWS SQS)

Visibility Timeout in Message Queues (AWS SQS)

How to Use TTL in Object Storage Systems (AWS S3)

How to Use TTL in Databases (AWS DynamoDB)

How to Use TTL in Event-Based Architecture

Summary

How to Deal with Traffic Surges in Distributed Systems

Table of Contents

What is Traffic in the Context of Distributed Systems?

Why Can Traffic Surges be Problematic?

Ways to Deal with High Traffic Loads

Exponential Backoff and Retries

Summary

Asynchronous vs Batch Data Processing in Distributed Systems – Explained with Examples

Table of Contents

Batch Processing of Data

What is Batch Processing?

When Do We Use Batch Processing?

Real World Example of Batch Processing of Data

What Does Batch Processing Look Like in Code?

Asynchronous Processing of Data

What is Asynchronous Processing?

When Do We Use Asynchronous Processing?

Real World Example of Asynchronous Processing of Data

What Does Async Processing Look Like in Code?

Summary