Atharva Shah - freeCodeCamp.org

The Python Decorator Handbook

Atharva Shah — Fri, 26 Jan 2024 17:17:03 +0000

Python decorators provide an easy yet powerful syntax for modifying and extending the behavior of functions in your code.

A decorator is essentially a function that takes another function, augments its functionality, and returns a new function – without permanently modifying the original function itself.

This tutorial will walk you through 11 handy decorators to help add functionality like timing execution, caching, rate limiting, debugging and more. Whether you want to profile performance, improve efficiency, validate data, or manage errors, these decorators have got you covered!

The examples here focus on the common usage patterns and utilities of decorators that can come in handy in your day-to-day programming and save you a lot of effort. Understanding the flexibility of decorators will help you write clean, resilient, and optimized application code.

Here are the decorators covered in this tutorial:

Log Arguments and Return Value of a Function
Get the Execution Time of a Function
Convert Function Return Value to a Specified Data Type
Cache Function Results
Validate Function Arguments Based on Condition
Retry a Function Multiple Times on Failure
Enforce Rate Limits on a Function
Handle Exceptions and Provide Default Response
Enforce Type Checking on Function Arguments
Measure Memory Usage of a Function
Cache Function Results with Expiration Time
Conclusion

But first, a little introduction.

How Python Decorators Work

Before diving in, let's understand some key benefits of decorators in Python:

Enhancing functions without invasive changes: Decorators augment functions transparently without altering the original code, keeping the core logic clean and maintainable.
Reusing functionality across places: Common capabilities like logging, caching, and rate limiting can be built once in decorators and applied wherever needed.
Readable and declarative syntax: The @decorator syntax simply conveys functionality enhancement at the definition site.
Modularity and separation of concerns: Decorators promote loose coupling between functional logic and secondary capabilities like performance, security, logging etc.

The takeaway is that decorators unlock simple yet flexible ways of transparently enhancing Python functions for improved code organization, efficiency, and reuse without introducing complexity or redundancy.

Here is a basic example of decorator syntax in Python with annotations:

# Decorator function
def my_decorator(func):

# Wrapper function
    def wrapper():
        print("Before the function call") # Extra processing before the function
        func() # Call the actual function being decorated
        print("After the function call") # Extra processing after the function
    return wrapper # Return the nested wrapper function

# Function to decorate
def my_function():
    print("Inside my function")

# Apply decorator on the function
@my_decorator
def my_function():
    print("Inside my function")

# Call the decorated function
my_function()

A decorator in Python is a function that takes another function as an argument and extends its behavior without modifying it. The decorator function wraps the original function by defining a wrapper function inside of it. This wrapper function executes code before and after calling the original function.

Specifically, when defining a decorator function such as my_decorator in the example, it takes a function as an argument, which we generally call func. This func will be the actual function that is decorated under the hood.

The wrapper function inside my_decorator can execute arbitrary code before and after calling func(), which invokes the original function. When applying @my_decorator before the definition of my_function, it passes my_function as an argument to my_decorator, so func refers to my_function in that context.

The wrapper function then returns the enhanced wrapped function. So now my_function has been decorated by my_decorator. When it is later called, the wrapper code inside my_decorator executes before and after my_function runs. This allows decorators to transparently extend the behavior of a function, without needing to modify the function itself.

And as you'll recall, the original my_function remains unchanged, keeping decorators non-invasive and flexible.

When my_function() is decorated with @my_decorator, it is automatically enhanced. The my_decorator function here returns a wrapper function. This wrapper function gets executed when the my_function() is called now.

First, the wrapper prints "Before the function call" before actually calling the original my_function() function being decorated. Then, after my_function() executes, it prints "After function call".

So, additional behavior and printed messages are added before and after the my_function() execution in the wrapper, without directly modifying my_function() itself. The decorator allows you to extend my_function() in a transparent way without affecting its core logic, as the wrapper handles the enhanced behavior.

Applying a Decorator to a Function

So let's start exploring the top 11 practical decorators that every Python developer should know.

Log Arguments and Return Value of a Function

The Log Arguments and Return Value decorator tracks the input parameters and output of functions. This supports debugging by logging a clear record of data flow through complex operations.

def log_decorator(original_function):
    def wrapper(*args, **kwargs):
        print(f"Calling {original_function.__name__} with args: {args}, kwargs: {kwargs}")

        # Call the original function
        result = original_function(*args, **kwargs)

        # Log the return value
        print(f"{original_function.__name__} returned: {result}")

        # Return the result
        return result
    return wrapper

# Example usage
@log_decorator
def calculate_product(x, y):
    return x * y

# Call the decorated function
result = calculate_product(10, 20)
print("Result:", result)

Output:

Calling calculate_product with args: (10, 20), kwargs: {}
calculate_product returned: 200
Result: 200

In this example, the decorator function is named log_decorator() and accepts a function, original_function, as its argument. Within log_decorator(), a nested function called wrapper() is defined. This wrapper() function is what the decorator returns and effectively replaces the original function.

When the wrapper() function is invoked, it prints logging statements pertaining to the function call. Then it calls the original function, original_function, captures its result, prints the outcome, and returns the result.

The @log_decorator syntax above the calculate_product() function is a Python convention to apply the log_decorator as a decorator to the calculate_product function. So when calculate_product() is invoked, it's actually invoking the wrapper() function returned by log_decorator(). Therefore, log_decorator() acts as a wrapper, introducing logging statements before and after the execution of the original calculate_product() function.

Usage and Applications

This decorator is widely adopted in application development for adding runtime logging without interfering with business logic implementation.

For example, consider a banking application that processes financial transactions. The core transaction processing logic resides in functions like transfer_funds() and accept_payment(). To monitor these transactions, logging can be added by including @log_decorator above each function.

Then when transactions are triggered by calling transfer_funds(), you can print the function name, arguments like the sender, receiver, and amount before the actual transfer. Then after the function returns, you can print the whether the transfer succeeded or failed.

This type of logging with decorators allows you to track transactions without adding any code to core functions like transfer_funds(). The logic stays clean while debuggability and observability improves. Logging messages can be directed to a monitoring dashboard or log analytics system as well.

Get the Execution Time of a Function

This decorator is your ally in the quest for performance optimization. By measuring and logging the execution time of a function, this decorator facilitates a deep dive into the efficiency of your code, helping you pinpoint bottlenecks and streamline your application's performance.

It's ideal for scenarios where speed is crucial, such as real-time applications or large-scale data processing. And it allows you to identify and address performance bottlenecks systematically.

import time

def measure_execution_time(func):
    def timed_execution(*args, **kwargs):
        start_timestamp = time.time()
        result = func(*args, **kwargs)
        end_timestamp = time.time()
        execution_duration = end_timestamp - start_timestamp
        print(f"Function {func.__name__} took {execution_duration:.2f} seconds to execute")
        return result
    return timed_execution

# Example usage
@measure_execution_time
def multiply_numbers(numbers):
    product = 1
    for num in numbers:
        product *= num
    return product

# Call the decorated function
result = multiply_numbers([i for i in range(1, 10)])
print(f"Result: {result}")

Output:

Function multiply_numbers took 0.00 seconds to execute
Result: 362880

This code showcases a decorator that's designed to measure the execution duration of functions.

The measure_execution_time() decorator takes a function, func, and defines an inner function, timed_execution(), to wrap the original function. Upon invocation, timed_execution() records the start time, calls the original function, records the end time, calculates the duration, and prints it.

The @measure_execution_time syntax applies this decorator to functions below it, such as multiply_numbers(). Consequently, when multiply_numbers() is called, it invokes the timed_execution() wrapper, which logs the duration alongside the function result.

This example illustrates how decorators seamlessly augment existing functions with additional functionality, like timing, without direct modification.

Usage and Applications

This decorator is helpful in profiling functions to identify performance bottlenecks in applications. For example, consider an e-commerce site with several backend functions like get_recommendations(), calculate_shipping(), and so on. By decorating them with @measure_execution_time, you can monitor their runtime.

When get_recommendations() is invoked in a user session, the decorator will time its execution duration by recording a start and end timestamp. After execution, it will print the time taken before returning recommendations.

Doing this systematically across applications and analyzing outputs will show you the functions that are taking an unusually long time. The development team can then optimize such functions through caching, parallel processing, and other techniques to improve overall application performance.

Without such timing decorators, finding optimization candidates would require tedious logging code additions. Decorators provide visibility easily without contaminating business logic.

Convert Function Return Value to a Specified Data Type

The Convert Return Value Type decorator enhances data consistency in functions by automatically converting the return value to a specified data type, promoting predictability and preventing unexpected errors. It is particularly useful for downstream processes that require consistent data types, reducing runtime errors.

def convert_to_data_type(target_type):
    def type_converter_decorator(func):
        def wrapper(*args, **kwargs):
            result = func(*args, **kwargs)
            return target_type(result)
        return wrapper
    return type_converter_decorator

@convert_to_data_type(int)
def add_values(a, b):
    return a + b

int_result = add_values(10, 20)
print("Result:", int_result, type(int_result))

@convert_to_data_type(str)
def concatenate_strings(str1, str2):
    return str1 + str2

str_result = concatenate_strings("Python", " Decorator")
print("Result:", str_result, type(str_result))

Output:

Result: 30 <class 'int'>
Result: Python Decorator <class 'str'>

The above code example shows a decorator that's designed to convert the return value of a function to a specified data type.

The decorator, named convert_to_data_type(), takes the target data type as a parameter and returns a decorator named type_converter_decorator(). Within this decorator, a wrapper() function is defined to call the original function, convert its return value to the target type using target_type(), and subsequently return the converted result.

The syntax @convert_to_data_type(int) that's applied above a function (such as add_values()) utilizes this decorator to convert the return value to an integer. Similarly, for concatenate_strings(), passing str formats the return value as a string.

This example also showcases how decorators seamlessly modify function outputs to desired formats without altering the core logic of the functions.

Usage and Application

This return value transformation decorator proves useful in applications where you need to automatically adapt functions to expected data formats.

For instance, you could use it in a weather API that returns temperatures by default in decimal format like 23.456 degrees. But the consumer front-end application expects an integer value to display.

Instead of changing the API function to return an integer, just decorate it with @convert_to_data_type(int). This will seamlessly convert the decimal temperature to the integer 23, in this example, before returning to the client app. Without any API function modification, you've reformatted the return value.

Similarly for backend processing expecting JSON, return values can be converted using the @convert_to_data_type(json) decorator. The core logic stays unchanged while the presentation format adapts based on your use case's needs. This avoids duplication of format handling code across functions.

Decorators externally impose required data representations for seamless integration and reusability across application layers with mismatched formats.

Cache Function Results

This decorator optimizes performance by storing and retrieving function results, eliminating redundant computations for repeated inputs, and improving application responsiveness, especially for time-consuming computations.

def cached_result_decorator(func):
    result_cache = {}

    def wrapper(*args, **kwargs):
        cache_key = (*args, *kwargs.items())

        if cache_key in result_cache:
            return f"[FROM CACHE] {result_cache[cache_key]}"

        result = func(*args, **kwargs)
        result_cache[cache_key] = result

        return result

    return wrapper

# Example usage

@cached_result_decorator
def multiply_numbers(a, b):
    return f"Product = {a * b}"

# Call the decorated function multiple times
print(multiply_numbers(4, 5))  # Calculation is performed
print(multiply_numbers(4, 5))  # Result is retrieved from cache
print(multiply_numbers(5, 7))  # Calculation is performed
print(multiply_numbers(5, 7))  # Result is retrieved from cache
print(multiply_numbers(-3, 7))  # Calculation is performed
print(multiply_numbers(-3, 7))  # Result is retrieved from cache

Output:

Product = 20
[FROM CACHE] Product = 20
Product = 35
[FROM CACHE] Product = 35
Product = -21
[FROM CACHE] Product = -21

This code sample showcases a decorator that's designed to cache and reuse function call results efficiently.

The cached_result_decorator() function takes another function and returns a wrapper. Within this wrapper, a cache dictionary (result_cache) stores unique call parameters and their corresponding results.

Before executing the actual function, the wrapper() checks if the result for the current parameters is already in the cache. If so, it retrieves and returns the cached result – otherwise, it calls the function, stores the result in the cache, and returns it.

The @cached_result_decorator syntax applies this caching logic to any function, such as multiply_numbers(). This ensures that, upon subsequent calls with the same arguments, the cached result is reused, preventing redundant calculations.

In essence, the decorator enhances functionality by optimizing performance through result caching.

Usage and Applications

Caching decorators like this are extremely useful in application development for optimizing performance of repetitive function calls.

For example, consider a recommendation engine calling predictive model functions to generate user suggestions. get_user_recommendations() prepares the input data and feeds into the model for every user request.Instead of re-running computations, it can be decorated with @cached_result_decorator to introduce caching layer.

Now the first time unique user parameters are passed, the model runs and the result caches. Subsequent calls with the same inputs directly return the cached model outputs, skipping the model recalculation.

This drastically improves latency for responding to user requests by avoiding duplicate model inferences. You can monitor cache hit rates to justify scaling down model server infrastructure costs.

Decoupling such optimization concerns through caching decorators rather than mixing them inside function logic improves modularity, readability and allows rapid performance gains. Caches will be configured, invalidated separately without intruding business functions.

Validate Function Arguments Based on Condition

This one checks if input arguments meet predefined criteria before execution, enhancing function reliability and preventing unexpected behavior. It is useful for parameters requiring positive integers or non-empty strings.

def check_condition_positive(value):
    def argument_validator(func):
        def validate_and_calculate(*args, **kwargs):
            if value(*args, **kwargs):
                return func(*args, **kwargs)
            else:
                raise ValueError("Invalid arguments passed to the function")
        return validate_and_calculate
    return argument_validator

@check_condition_positive(lambda x: x > 0)
def compute_cubed_result(number):
    return number ** 3

print(compute_cubed_result(5))  # Output: 125
print(compute_cubed_result(-2))  # Raises ValueError: Invalid arguments passed to the function

Output:

125Traceback (most recent call last):

  File "C:\\\\Program Files\\\\Sublime Text 3\\\\test.py", line 16, in <module>
    print(compute_cubed_result(-2))  # Raises ValueError: Invalid arguments passed to the function
  File "C:\\\\Program Files\\\\Sublime Text 3\\\\test.py", line 7, in validate_and_calculate
    raise ValueError("Invalid arguments passed to the function")
ValueError: Invalid arguments passed to the function

This code showcases how you can implement a decorator for validating function arguments.

The check_condition_positive() is a decorator factory that generates an argument_validator() decorator. This validator, when applied with @check_condition_positive() above the compute_cubed_result() function, checks if the condition (in this case, that the argument should be greater than 0) holds true for the passed arguments.

If the condition is met, the decorated function is executed – otherwise, a ValueError exception is raised.

This succinct example illustrates how decorators serve as a mechanism for validating function arguments before their execution, ensuring adherence to specified conditions.

Usage and Applications

Such parameter validation decorators are extremely useful in applications to help enforce business rules, security constraints, and so on.

For example, an insurance claims processing system would have a function process_claim() that takes details like claim id, approver name, and so on. Certain business rules dictate who can approve claims.

Rather than cluttering the function logic itself, you can decorate it with @check_condition_positive() which validates if the approver role matches the claim amount. If a junior agent tries approving a large claim (thus violating the rules), this decorator would catch it by raising exception even before process_claim() executes.

Similarly, input data validation constraints for security and compliance can be imposed without touching individual functions. Decorators externally ensure that violated arguments never reach application risks.

Common validation patterns should be reused across multiple functions. This improves security and promotes separation of concerns by isolating constraints from core logic flow in a modular way.

Retry a Function Multiple Times on Failure

This decorator comes handy when you want to automatically retry a function after failure, enhancing its resilience in situations involving transient failures. It is used for external services or network requests prone to intermittent failures.

import sqlite3
import time

def retry_on_failure(max_attempts, retry_delay=1):
    def decorator(func):
        def wrapper(*args, **kwargs):
            for _ in range(max_attempts):
                try:
                    result = func(*args, **kwargs)
                    return result
                except Exception as error:
                    print(f"Error occurred: {error}. Retrying...")
                    time.sleep(retry_delay)
            raise Exception("Maximum attempts exceeded. Function failed.")

        return wrapper
    return decorator

@retry_on_failure(max_attempts=3, retry_delay=2)
def establish_database_connection():
    connection = sqlite3.connect("example.db")
    db_cursor = connection.cursor()
    db_cursor.execute("SELECT * FROM users")
    query_result = db_cursor.fetchall()
    db_cursor.close()
    connection.close()
    return query_result

try:
    retrieved_data = establish_database_connection()
    print("Data retrieved successfully:", retrieved_data)
except Exception as error_message:
    print(f"Failed to establish database connection: {error_message}")

Output:

Error occurred: no such table: users. Retrying...
Error occurred: no such table: users. Retrying...
Error occurred: no such table: users. Retrying...
Failed to establish database connection: Maximum attempts exceeded. Function failed.

This example introduces a decorator that's designed for retrying function executions in the event of failures. It has a specified maximum attempt count and delay between retries.

The retry_on_failure() is a decorator factory, taking parameters for maximum retry count and delay, and returning a decorator() that manages the retry logic.

Within the wrapper() function, the decorated function undergoes execution in a loop, attempting a specified maximum number of times.

In case of an exception, it prints an error message, introduces a delay specified by retry_delay, and retries. If all attempts fail, it raises an exception indicating that the maximum attempts have been exceeded.

The @retry_on_failure() applied above establish_database_connection() integrates this retry logic, allowing for up to 3 retries with a 2-second delay between each attempt in case the database connection encounters failures.

This demonstrates the utility of decorators in seamlessly incorporating retry capabilities without altering the core function code.

Usage and Application

This retry decorator can prove extremely useful in application development for adding resilience against temporary or intermittent errors.

For instance, consider a flight booking app that calls a payment gateway API process_payment() to handle customer transactions. Sometimes network blips or high loads at payment provider end could cause transient errors in API response.

Rather than directly showing failures to customers, the process_payment() function can be decorated with @retry_on_failure to handle such scenarios implicitly. Now when a payment fails once, it will seamlessly retry sending the request up to 3 times before finally reporting the error if it persists.

This provides shielding from temporary hiccups without exposing users to unreliable infrastructure behavior directly.The application also remains available reliably even if dependent services fail occasionally.

The decorator helps confine the retry logic neatly without spreading it across the API's code. Failures beyond the app's control are handled gracefully rather than directly impacting users by application faults. This demonstrates how decorators lend better resilience without complicating business logic.

Enforce Rate Limits on a Function

By controlling the frequency of functions called, the Enforce Rate Limits decorator ensures effective resource management and guards against misuse. It is especially helpful in scenarios like API misuse or resource conservation where restricting function calls is essential.

import time

def rate_limiter(max_allowed_calls, reset_period_seconds):
    def decorate_rate_limited_function(original_function):
        calls_count = 0
        last_reset_time = time.time()

        def wrapper_function(*args, **kwargs):
            nonlocal calls_count, last_reset_time
            elapsed_time = time.time() - last_reset_time

            # If the elapsed time is greater than the reset period, reset the call count
            if elapsed_time > reset_period_seconds:
                calls_count = 0
                last_reset_time = time.time()

            # Check if the call count has reached the maximum allowed limit
            if calls_count >= max_allowed_calls:
                raise Exception("Rate limit exceeded. Please try again later.")

            # Increment the call count
            calls_count += 1

            # Call the original function
            return original_function(*args, **kwargs)

        return wrapper_function
    return decorate_rate_limited_function

# Allowing a maximum of 6 API calls within 10 seconds.
@rate_limiter(max_allowed_calls=6, reset_period_seconds=10)
def make_api_call():
    print("API call executed successfully...")

# Make API calls
for _ in range(8):
    try:
        make_api_call()
    except Exception as error:
        print(f"Error occurred: {error}")
time.sleep(10)
make_api_call()

Output:

API call executed successfully...
API call executed successfully...
API call executed successfully...
API call executed successfully...
API call executed successfully...
API call executed successfully...
Error occurred: Rate limit exceeded. Please try again later.
Error occurred: Rate limit exceeded. Please try again later.
API call executed successfully...

This code showcases the implementation of a rate-limiting mechanism for function calls using a decorator.

The rate_limiter() function, specified with maximum calls and a period in seconds to reset the count, serves as the core of the rate-limiting logic. The decorator, decorate_rate_limited_function(), employs a wrapper to manage the rate limits by resetting the count if the period has elapsed. It checks if the count has reached the maximum allowed, and then either raises an exception or increments the count and executes the function accordingly.

Applied to make_api_call() using @rate_limiter(), it restricts the function to six calls within any 10-second period. This introduces rate limiting without changing the function logic, ensuring that calls adhere to limits and preventing excessive use within set intervals.

Usage and Application

Rate limiting decorators like this are very useful in application development for controlling usage of APIs and preventing abuse.

For instance, a travel booking application may rely on third party Flight Search API for checking live seat availability across airlines. While most usage is legitimate, some users could potentially call this API excessively, degrading overall service performance.

By decorating the API integration module like @rate_limiter(100, 60), the application can restrict excessive calls internally, too. This would limit the booking module to make only 100 Flight API calls per minute. Additional calls get rejected directly through the decorator without even reaching actual API.

This saves downstream service from overuse enabling fairer distribution of capacity for general application functionality.

Decorators provide easy rate control for both internal and external facing APIs without changing functional code. This means you don't have to account for usage quotas while safeguarding services, infrastructure, and bounding adoption risk. And it's all thanks to application-side controls using wrappers.

Handle Exceptions and Provide Default Response

The Handle Exceptions decorator is a safety net for functions, gracefully handling exceptions and providing default responses when they occur. It shields the application from crashing due to unforeseen circumstances, ensuring smooth operation.

def handle_exceptions(default_response_msg):
    def exception_handler_decorator(func):
        def decorated_function(*args, **kwargs):
            try:
                # Call the original function
                return func(*args, **kwargs)
            except Exception as error:
                # Handle the exception and provide the default response
                print(f"Exception occurred: {error}")
                return default_response_msg
        return decorated_function
    return exception_handler_decorator

# Example usage
@handle_exceptions(default_response_msg="An error occurred!")
def divide_numbers_safely(dividend, divisor):
    return dividend / divisor

# Call the decorated function
result = divide_numbers_safely(7, 0)  # This will raise a ZeroDivisionError
print("Result:", result)

Output:

Exception occurred: division by zero
Result: An error occurred!

This code showcases exception handling in functions using decorators.

The handle_exceptions() decorator factory, accepting a default response, produces exception_handler_decorator(). This decorator, when applied to functions, attempts to execute the original function. If an exception arises, it prints error details, and returns the specified default response.

The @handle_exceptions() syntax above a function incorporates this exception-handling logic. For instance, in divide_numbers_safely(), division by zero triggers an exception, which the decorator catches, preventing a crash and returning the default "An error occurred!" response.

Essentially, these decorators adeptly capture exceptions in functions, providing a seamless means of incorporating handling logic and preventing crashes.

Usage and Applications

Exception handling decorators greatly simplify application error management and help hide unreliable behavior from users.

For example, an e-commerce website may rely on payment, inventory, and shipping services to complete orders. Instead of complex exception blocks everywhere, core order processing function like place_order() can be decorated to achieve resilience.

The @handle_exceptions decorator applied above it would absorb any third party service outage or intermittent issue during order finalization. On exception, it logs errors for debugging while serving a graceful "Order failed, please try again later" message to the customer. This avoids expose complex failure root causes like payment timeouts to end user.

Decorators shield customers from unreliable service issues without changing business code. They provide friendly default responses when errors happen. This improves customer experience

Also, decorators give developers visibility into those errors behind the scenes. So they can focus on systematically fixing the root causes of failures. This separation of concerns through decorators reduces complexity. Customers see more reliability, and you get actionable insights into faults – all while keeping business logic untouched.

Enforce Type Checking on Function Arguments

The Enforce Type Checking decorator ensures data integrity by verifying function arguments conform to specified data types, preventing type-related errors, and promoting code reliability. It is particularly useful in situations where strict data type adherence is crucial.

import inspect

def enforce_type_checking(func):
    def type_checked_wrapper(*args, **kwargs):
        # Get the function signature and parameter names
        function_signature = inspect.signature(func)
        function_parameters = function_signature.parameters

        # Iterate over the positional arguments
        for i, arg_value in enumerate(args):
            parameter_name = list(function_parameters.keys())[i]
            parameter_type = function_parameters[parameter_name].annotation
            if not isinstance(arg_value, parameter_type):
                raise TypeError(f"Argument '{parameter_name}' must be of type '{parameter_type.__name__}'")

        # Iterate over the keyword arguments
        for keyword_name, arg_value in kwargs.items():
            parameter_type = function_parameters[keyword_name].annotation
            if not isinstance(arg_value, parameter_type):
                raise TypeError(f"Argument '{keyword_name}' must be of type '{parameter_type.__name__}'")

        # Call the original function
        return func(*args, **kwargs)

    return type_checked_wrapper

# Example usage
@enforce_type_checking
def multiply_numbers(factor_1: int, factor_2: int) -> int:
    return factor_1 * factor_2

# Call the decorated function
result = multiply_numbers(5, 7)  # No type errors, returns 35
print("Result:", result)

result = multiply_numbers("5", 7)  # Type error: 'factor_1' must be of type 'int'

Output:

Result:Traceback (most recent call last):
  File "C:\\\\Program Files\\\\Sublime Text 3\\\\test.py", line 36, in <module>
 35
    result = multiply_numbers("5", 7)  # Type error: 'factor_1' must be of type 'int'
  File "C:\\\\Program Files\\\\Sublime Text 3\\\\test.py", line 14, in type_checked_wrapper
    raise TypeError(f"Argument '{parameter_name}' must be of type '{parameter_type.__name__}'")
TypeError: Argument 'factor_1' must be of type 'int'

The enforce_type_checking decorator validates whether the arguments passed to a function match the specified type annotations.

Inside the type_checked_wrapper, it examines the signature of the decorated function, retrieves parameter names and type annotations, and ensures that the provided arguments align with the expected types. This includes checking positional arguments against their order, and keyword arguments against parameter names. If a type mismatch is detected, a TypeError is raised.

This decorator is exemplified by its application to the multiply_numbers function, where arguments are annotated as integers. Attempting to pass a string results in an exception, while passing integers executes the function without issues. This type checking is enforced without altering the original function body.

Usage and Applications

Type checking decorators are applied to detect issues early and improve reliability. For example, consider a web application backend with a data access layer function get_user_data() annotated to expect integer user IDs. Its queries would fail if string IDs flow into it from frontend code.

Rather than add explicit checks and raise exceptions locally, you can use this decorator. Now any upstream or consumer code passing invalid types will be automatically caught during function execution. The decorator examines annotations versus argument types and throws errors accordingly before reaching the database layer.

This runtime protection for components through decorators ensures that only valid data shapes flow across layers, preventing obscure errors. Type safety is imposed without extra checks cluttering cleaner logic.

Measure Memory Usage of a Function

When it comes to large dataset-intensive applications or resource-constrained environments, the Measure Memory Usage Decorator is a memory detective that offers insights into function memory consumption. It does this by optimising memory usage.

import tracemalloc

def measure_memory_usage(target_function):
    def wrapper(*args, **kwargs):
        tracemalloc.start()

        # Call the original function
        result = target_function(*args, **kwargs)

        snapshot = tracemalloc.take_snapshot()
        top_stats = snapshot.statistics("lineno")

        # Print the top memory-consuming lines
        print(f"Memory usage of {target_function.__name__}:")
        for stat in top_stats[:5]:
            print(stat)

        # Return the result
        return result

    return wrapper

# Example usage
@measure_memory_usage
def calculate_factorial_recursive(number):
    if number == 0:
        return 1
    else:
        return number * calculate_factorial_recursive(number - 1)

# Call the decorated function
result_factorial = calculate_factorial_recursive(3)
print("Factorial:", result_factorial)

Output:

Memory usage of calculate_factorial_recursive:
C:\\\\Program Files\\\\Sublime Text 3\\\\test.py:29: size=1552 B, count=6, average=259 B
C:\\\\Program Files\\\\Sublime Text 3\\\\test.py:8: size=896 B, count=3, average=299 B
C:\\\\Program Files\\\\Sublime Text 3\\\\test.py:10: size=416 B, count=1, average=416 B
Memory usage of calculate_factorial_recursive:
C:\\\\Program Files\\\\Sublime Text 3\\\\test.py:29: size=1552 B, count=6, average=259 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:226: size=880 B, count=3, average=293 B
C:\\\\Program Files\\\\Sublime Text 3\\\\test.py:8: size=832 B, count=2, average=416 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:173: size=800 B, count=2, average=400 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:505: size=592 B, count=2, average=296 B
Memory usage of calculate_factorial_recursive:
C:\\\\Program Files\\\\Sublime Text 3\\\\test.py:29: size=1440 B, count=4, average=360 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:535: size=1240 B, count=3, average=413 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:67: size=1216 B, count=19, average=64 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:193: size=1104 B, count=23, average=48 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:226: size=880 B, count=3, average=293 B
Memory usage of calculate_factorial_recursive:
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:558: size=1416 B, count=29, average=49 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:67: size=1408 B, count=22, average=64 B
C:\\\\Program Files\\\\Sublime Text 3\\\\test.py:29: size=1392 B, count=3, average=464 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:535: size=1240 B, count=3, average=413 B
C:\\\\Program Files\\\\Python310\\\\lib\\\\tracemalloc.py:226: size=832 B, count=2, average=416 B
Factorial: 6

This code showcases a decorator, measure_memory_usage, designed to measure the memory consumption of functions.

The decorator, when applied, initiates memory tracking before the original function is called. Once the function completes its execution, a memory snapshot is taken and the top 5 lines consuming the most memory are printed.

Illustrated through the example of calculate_factorial_recursive(), the decorator allows you to monitor memory usage without altering the function itself, offering valuable insights for optimization purposes.

In essence, it provides a straightforward means to assess and analyze the memory consumption of any function during its runtime.

Usage and Applications

Memory measurement decorators like these are extremely valuable in application development for identifying and troubleshooting memory bloat or leak issues.

For example, consider a data streaming pipeline with critical ETL components like transform_data() that processes large volumes of information. Though the process seems fine during regular loads, high volume data like Black Friday sales could cause excessive memory usage and crashes.

Rather than manual debugging, decorating processors like @measure_memory_usage can reveal useful insights. It will print the top memory intensive lines during peak data flow without any code change.

You should aim to pinpoint specific stages eating up memory rapidly and address through better algorithms or optimization.

Such decorators help bake diagnostics perspectives across critical paths to recognize abnormal consumption trends early. Instead of delayed production issues, problems can be preemptively identified through profiling before release. They reduce debugging headaches and minimize runtime failures via easier instrumentation for memory tracking.

Cache Function Results with Expiration Time

Specifically designed for outdated data, the Cache Function Results with Expiration Time Decorator is a tool that combines caching with a time-based expiration feature to make sure that cached data is regularly refreshed to prevent staleness and maintain relevance.

import time

def cached_function_with_expiry(expiry_time):
    def decorator(original_function):
        cache = {}

        def wrapper(*args, **kwargs):
            key = (*args, *kwargs.items())

            if key in cache:
                cached_value, cached_timestamp = cache[key]

                if time.time() - cached_timestamp < expiry_time:
                    return f"[CACHED] - {cached_value}"

            result = original_function(*args, **kwargs)
            cache[key] = (result, time.time())

            return result

        return wrapper

    return decorator

# Example usage

@cached_function_with_expiry(expiry_time=5)  # Cache expiry time set to 5 seconds
def calculate_product(x, y):
    return f"PRODUCT - {x * y}"

# Call the decorated function multiple times
print(calculate_product(23, 5))  # Calculation is performed
print(calculate_product(23, 5))  # Result is retrieved from cache
time.sleep(5)
print(calculate_product(23, 5))  # Calculation is performed (cache expired)

Output:

PRODUCT - 115
[CACHED] - PRODUCT - 115
PRODUCT - 115

This code showcases a caching decorator that has an automatic cache expiration time.

The function cached_function_with_expiry() generates a decorator that, when applied, utilizes a dictionary called cache to store function results and their corresponding timestamps. The wrapper() function checks if the result for the current arguments is in the cache. If present and within the expiry time, it returns the cached result – otherwise, it calls the function.

Illustrated using calculate_product(), the decorator initially calculates and caches the result. Subsequent calls retrieve the cached result until the expiry period, at which point the cache is refreshed through a recalculation.

In essence, this implementation prevents redundant calculations while automatically refreshing results after the specified expiry period.

Usage and Applications

Automatic cache expiry decorators are very useful in application development for optimizing performance of data fetching modules.

For example, consider a travel website that calls backend API get_flight_prices() to show live prices to users. While caches reduce calls to expensive flight data sources, static caching leads to displaying stale prices.

Instead, you can use @cached_function_with_expiry(60) to auto-refresh every minute. Now, the first user call fetches live prices and caches them, while subsequent requests in a 60s window efficiently reuse the cached pricing. But caches automatically invalidate after the expiry period to guarantee fresh data.

This allows your to optimize flows without worrying about corner cases related to outdated representations. This decorator handles the situation reliably, keeping caches in sync with upstream changes through configurable refreshing. There's zero redundancy of recalculations, and you still get the best possible updated information to end users. Common caching patterns get packaged conveniently for reuse across codebase with customized expiry rules.

Conclusion

Python decorators continue to see widespread usage in application development for cleanly inserting common cross-cutting concerns. Authentications, monitoring, and restrictions are some standard examples of use cases that use decorators in frameworks like Django and Flask.

The popularity of web APIs has also lead to common adoption of rate limiting and caching decorators for performance.

Decorators have actually been around since early Python releases. Guido van Rossum wrote about enhancement with decorators in a 1990 paper on Python. Later when function decorators syntax stabilized in Python 2.4 in 2004, it opened the doors for elegant solutions through oriented programming. From web to data science, they continue to empower abstraction and modularity across Python domains.

The examples in this handbook only scratch the surface of what custom tailored decorators can enable. Based on any specific objective like security, throttling user requests, transparent encryption, and so on, you can create innovative decorators to address your needs. Structuring logic processing pipelines using a composition of specialized single-responsibility decorators also encourages reuse over redundancy.

Understanding decorators not only improves development skills but unlocks ways to dictate program behaviour flexibly. I encourage you to assess common needs across your codebases that can be abstracted into standalone decorators. With some practice, it becomes easy to spot cross-cutting concerns and extend functions efficiently without breaking a sweat.

If you liked this lesson and would like to explore more insightful tech content, including Python, Django, and System Design reads, check out my Blog. You can also view my projects with proof of work on GitHub and connect with me on LinkedIn for a chat.

How to Use Databricks Delta Lake with SQL – Full Handbook

Atharva Shah — Tue, 05 Sep 2023 13:57:32 +0000

Welcome to the Databricks Delta Lake with SQL Handbook! Databricks is a unified analytics platform that brings together data engineering, data science, and business analytics into a collaborative workspace.

Delta Lake, a powerful storage layer built on top of Databricks, provides enhanced reliability, performance, and data quality for big data workloads.

This is a hands-on training guide where you will get a chance to dive into the world of Databricks and learn how to effectively use Delta Lake for managing and analyzing data. It'll provide you with the essential SQL skills to efficiently interact with Delta tables and perform advanced data analytics.

Prerequisites

This handbook is designed for beginner-level SQL users who have some experience with cloud platforms and clusters. Although no prior experience with Databricks is required, it is recommended that you have a basic understanding of the following concepts:

Databases: Familiarity with the basic structure and functionality of databases will be helpful.
SQL Queries: Knowledge of SQL syntax and the ability to write basic queries is essential.
Jupyter Notebooks: Understanding how Jupyter notebooks work and being comfortable with running code cells is recommended.

While this handbook assumes a certain level of familiarity with databases, SQL, and Jupyter notebooks, it will guide you step-by-step through each process, ensuring that you understand and follow along with the material.

As such, no installation is necessary, as all the work will be done on Databricks Delta Notebooks running in the cluster. Everything has already been provisioned, eliminating the need for any setup or configuration.

By the end of this handbook, you would have gained a solid foundation in using SQL with Databricks, enabling you to leverage its powerful capabilities for data analysis and manipulation.

Let's get started!

Here are the sections of this tutorial:

Introduction to Databricks

What is Databricks?
Key features and benefits
Getting started with Databricks Workspace
Notebook basics and interactive analytics

Introduction to Delta

Understanding Delta Lake
Advantages of using Delta
Use cases of Delta in real-world scenarios
Supported languages and platforms for Delta

How to Create and Manage Tables

Creating tables from various data sources
SQL Data Definition Language (DDL) commands
SQL Data Manipulation Language (DML) commands
Creating tables from a Databricks dataset
Saving the loaded CSV file to Delta using Python

Delta SQL Command Support

Delta SQL commands for data management
Performing UPSERT (UPDATE and INSERT) operations

Advanced SQL Queries

Handling data visualization in Delta
Advanced aggregate queries in Delta
Counting diamonds by clarity using SQL
Adding table constraints for data integrity

How to Work with DataFrames

Creating a DataFrame from a Databricks dataset
Data manipulation and displaying results using DataFrames

Version Control and Time Travel in Delta

Understanding version control and time travel in Delta
Restoring data to a specific version
Utilizing autogenerated fields for metadata tracking

Delta Table Cloning

Deep and shallow copying of Delta tables
Efficiently cloning Delta tables for data exploration and analysis

Conclusion

Introduction to Databricks

Databricks is a unified analytics platform that combines data engineering, data science, and machine learning into a single collaborative environment. Leveraging Apache Spark, it processes and analyzes vast amounts of data efficiently.

Databricks offers benefits like seamless scalability, real-time collaboration, and simplified workflows, making it a favored choice for data-driven enterprises.

Its versatility suits various use cases: from ETL processes and data preparation to advanced analytics and AI model development. Databricks aids in uncovering insights from structured and unstructured data, empowering businesses to make informed decisions swiftly.

You can see its application in finance for fraud detection, healthcare for predictive analytics, e-commerce for recommendation engines, and so on. Basically, Databricks accelerates data-driven innovation, transforming raw information into actionable intelligence.

To follow along this tutorial, you should first create a Community Edition account so you can create your clusters.

Create a Databricks Community Edition Account

Once you've created your account, head over to the Community Edition login page. Once you have signed in, you'll be greeted with a screen very similar to the one shown below.

Databricks User Dashboard with options to create workspaces, notebooks, and import data

From the sidebar on the left, you can create your workspaces, and upload datasets and files that you wish to process.

To follow along, click on the link highlighted in the image above (the one that says "create a notebook"). It will launch a new notebook on Databricks platform where we'll be writing all the code.

You can also access all your notebooks from the left sidebar or from the "Recents" tab on the home screen once you login.

You can find all the code, instructions, and steps used in this handbook with explanations in one of the public notebooks I have created here.

On creating a new notebook, you should create a cluster to run your commands and process the data. Clusters in the Databricks Delta platform are groups of computing resources that drive efficient data processing. They execute tasks in parallel, speeding up tasks like ETL and analysis.

Clusters offer tailored resource allocation, ensuring optimal performance and scalability. Supporting multiple users and tasks concurrently, clusters encourage collaboration. Leveraging Apache Spark, they enable advanced analytics and machine learning.

Integral to Databricks Delta's ACID transactions, clusters ensure data integrity. Overall, clusters empower seamless, high-performance data handling, essential for tasks ranging from data preparation to sophisticated analytics and AI model training.

Provision a cluster by creating a new resource to run commands in the notebook

Proceed with the standard configuration

Now that we have the notebook and clusters set up, we can start with the code. But before we do that, here are a few key terms to know. Awareness of these is more about the platform and less about SQL syntax which will be covered below.

Data Ingestion

Data ingestion in Delta involves loading data from third-party sources, such as Fivetran. The most efficient storage medium for data in Delta is Parquet, which is a columnar storage format. To load data into Delta, we can use Spark or PySpark Python and specify the storage location. The loaded data can be accessed and queried using SQL syntax with the COPY INTO command.

Dashboards

Visualizations created in SQL notebooks within Delta can be added to custom dashboards for BI/Analytics. These dashboards are lightweight and provide real-time updates based on data refreshment. This enables users to create insightful and interactive dashboards for data analysis and reporting. You need not create your dashboards from scratch. Popular Dashboard templates are available.

Policies

Delta provides data governance through the Unity Catalog, ensuring that users only have access to databases and tables they are permitted to view or edit. This granular control over data access enhances security and data privacy within the system.

History

Moderators or superusers can access the history of each query run against all databases, along with timestamps and query execution times. This feature helps in understanding query patterns and optimizing database performance based on usage insights.

Optimization

To improve query performance, Delta offers various optimization techniques, such as database indexing, clustering, Bloom filter indexing, and leveraging MPP paradigms like MapReduce. Knowledge of normalization and schema design also contributes to writing efficient SQL queries.

Alerts

Delta allows users to set alerts based on comparison operators applied to query results. For example, when a sales count query returns a value below a threshold, an alert can be triggered via Slack, ticketing tools, or emails. Customizable alerts ensure timely notifications for critical data events.

Persona-Based Design

The Databricks Platform is designed to cater to different personas, including Data Science/Analytics and BI/MLOps specialists. Users get segregated interfaces tailored to their roles. However, the Unity Catalog can aggregate all these views, providing a cohesive experience.

SQL Workspace

The SQL Workspace in Delta provides an interface similar to MySQL Workbench or PgAdmin. Users can perform SQL queries on datasets without the need to load the data repeatedly, as done in notebooks. This efficient querying enhances the SQL-based data analysis experience.

Integration with other BI Tools

Databricks integrates well with Tableau and PowerBI. You can import your data points and visualizations seamlessly and get consistent and synced results in the BI tools of your choice. With the click of a button, live queries are generated against the Databricks datasets.

Introduction to Delta

Delta Lake is an open storage format used to save your data in your Lakehouse. Delta provides an abstraction layer on top of files. It's the storage foundation of your Lakehouse.

Why Delta Lake?

Running an ingestion pipeline on Cloud Storage can be very challenging. Data teams typically face the following challenges:

Hard to append data (Adding newly arrived data leads to incorrect reads).
Modification of existing data is difficult (GDPR/CCPA requires making fine-grained changes to the existing data lake).
Jobs failing mid-way (Half of the data appears in the data lake, the rest may be missing).
Data quality issues (It’s a constant headache to ensure that all the data is correct and high quality).
Real-time operations (Mixing streaming and batch leads to inconsistency).
Costly to keep historical versions of the data (Regulated environments require reproducibility, auditing, and governance).
Difficult to handle large metadata (For large data lakes, the metadata itself becomes difficult to manage).
“Too many files” problems (Data lakes are not great at handling millions of small files).
Hard to get great performance (Partitioning the data for performance is error-prone and difficult to change).

These challenges have a real impact on team efficiency and productivity, spending unnecessary time fixing low-level, technical issues instead of focusing on high-level, business implementation.

Because Delta Lake solves all the low-level technical challenges of saving petabytes of data in your lakehouse, it lets you focus on implementing a simple data pipeline while providing blazing-fast query answers for your BI and analytics reports.

In addition, Delta Lake is a fully open source project under the Linux Foundation and is adopted by most of the data players. You know you own your data and won't have vendor lock-in.

Features and Capabilities

You can think about Delta as a file format that your engine can leverage to bring the following capabilities out of the box:

ACID transactions
Support for DELETE/UPDATE/MERGE
Unify batch & streaming
Time Travel
Clone zero copy
Generated partitions
CDF - Change Data Flow (DBR runtime)
Blazing-fast queries

This hands-on quickstart guide is going to focus on:

Loading Databases and Tabular Data from a variety of sources
Writing DDL, DML, and DTL queries on these datasets
Visualizing Datasets to get conclusive results
Time travel and Restoring database
Performance Optimization

How to Create and Manage Tables

Okay, time to code! If you still have the notebook that we created earlier along with the clusters open, you can start by following along with the code below. Don't worry, explanations for every step will follow.

Select the dropdown next to the notebook title and ensure SQL is selected since this handbook is all about Delta Lakes with SQL.

Select Notebook language to be SQL

How to Create Tables from a Databricks Dataset

Databricks notebooks are very much like Jupyter Notebooks. You have to insert your code into cells and run them one by one or together. All the output is shown cell by cell, progressively.

Databricks notebook interface

Here's the code from the image above:

DROP TABLE IF EXISTS diamonds; 
CREATE TABLE diamonds 
USING csv 
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")

In the code above, the two SQL statements (CREATE TABLE) are used to create a table named diamonds in a database. The table is based on data from a CSV file located at the specified path.

If a table with the same name already exists, the DROP TABLE IF EXISTS diamonds statement ensures it is deleted before creating a new one. The table will have the same schema as the CSV file, with the first row assumed to be the header containing column names ("header 'true'").

Here's a command that returns all the records from the diamonds table:

SELECT * from diamonds

The above query returns all the records from the diamonds table

Here's another command:

describe diamonds;

Table metadata returned by the describe command

In SQL, the DESCRIBE statement is used to retrieve metadata information about a table's structure. The specific syntax for the DESCRIBE statement can vary depending on the database system being used.

However, its primary purpose is to provide details about the columns in a table, such as their names, data types, constraints, and other properties.

Saving the loaded CSV file to Delta using Python

The best part about using the Databricks platform is that it allows you to write Python, SQL, Scala, and R interchangeably in the same notebook.

You can switch up the languages at any given point by using the "Delta Magic Commands". You can find a full list of magic commands at the end of this handbook.

%python

diamonds = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")

diamonds.write.format("delta").mode("overwrite").save("/delta/diamonds")

Data is read from a CSV file located at /databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv into a Spark DataFrame named diamonds. The first row of the CSV file is treated as the header, and Spark infers the schema for the DataFrame based on the data.

The DataFrame diamonds is written in a Delta Lake table format. If the table already exists at the specified location (/delta/diamonds), it will be overwritten. If it does not exist, a new table will be created.


DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING DELTA LOCATION '/delta/diamonds/'

The SQL statements above drops any existing table named diamonds and creates a new Delta Lake table named diamonds using the data stored in the Delta Lake format at the /delta/diamonds/ location.

You can run a SELECT statement to ensure that the table appears as expected:

SELECT * from diamonds

The same diamonds table result set once restored from Delta Lake

Delta SQL Command Support

In the world of databases, there are two fundamental types of commands: Data Manipulation Language (DML) and Data Definition Language (DDL). These commands play a crucial role in managing and organizing data within a database. In this article, we will explore what DML and DDL commands are, their key differences, and provide examples of how they are used.

Databricks Notebooks support all the SQL commands including DDL and DML commands highlighted here

Data Manipulation Language (DML)

It is used to manipulate or modify data stored in a database. These commands allow users to insert, retrieve, update, and delete data from database tables. Let's take a closer look at some commonly used DML commands:

SELECT: The SELECT command is used to retrieve data from one or more tables in a database. It allows you to specify the columns and rows you want to extract by using conditions and filters. For example, SELECT * FROM Customers retrieves all the records from the Customers table.

INSERT: The INSERT command adds new data into a table. It allows you to specify the value for each column or select values from another table. For example, INSERT INTO Customers (Name, Email) VALUES ('John Doe', 'john@example.com') adds a new customer record to the Customers table.

UPDATE: The UPDATE command is used to modify existing data in a table. It allows you to change the values of specific columns based on certain conditions. For example, UPDATE Customers SET Email = 'new@example.com' WHERE ID = 1 updates the email address of the customer with ID of 1.

DELETE: The DELETE command is used to remove data from a table. It allows you to delete specific rows based on certain conditions. For example, DELETE FROM Customers WHERE ID = 1 deletes the customer record with ID of 1 from the Customers table.

Data Definition Language (DDL) Commands

DDL commands are used to define the structure and organization of a database. These commands allow users to create, modify, and delete database objects such as tables, indexes, and constraints.

Let's explore some commonly used DDL commands:

CREATE: Creates a new database object, such as a table or an index. It allows you to define the columns, data types, and constraints for the object. For example, CREATE TABLE Customers (ID INT, Name VARCHAR(50), Email VARCHAR(100)) creates a new table named Customers with three columns.

ALTER: Modifies the structure of an existing database object. It allows you to add, modify, or delete columns, constraints, or indexes. For example, ALTER TABLE Customers ADD COLUMN Phone VARCHAR(20) adds a new column named Phone to the Customers table.

DROP: Deletes an existing database object. It permanently removes the object and its associated data from the database. For example, DROP TABLE Customers deletes the Customers table from the database.

TRUNCATE: The TRUNCATE command is used to remove all the data from a table, while keeping the table structure intact. It is faster than the DELETE command when you want to remove all records from a table. For example, TRUNCATE TABLE Customers removes all records from the Customers table.

Delta Lake supports standard DML including UPDATE, DELETE and MERGE INTO, providing developers with more control to manage their big datasets.

Here's an example that uses the INSERT, UPDATE, and SELECT commands:

INSERT INTO diamonds(_c0, carat, cut,    color,    clarity,    depth,    table,    price,    x,    y,    z) values (53941, 0.22,    'Premium', 'I',    'SI2',    '60.3',    '62.1',    '334',    '3.79',    '3.75',    '2.27');

UPDATE diamonds SET carat = 0.20 WHERE _c0 = 53941;

select * from diamonds where _c0=53941;

Fetching a unique record from the table

In the example above, an initial row is inserted into the diamonds table with specific values for each column.

Then the carat value for the row with _c0 equal to 53941 is updated to 0.20.

The final SELECT statement retrieves the row with _c0 equal to 53941, showing its current state after the INSERT and UPDATE operations. This shows that the record insertion was successful.

DELETE FROM diamonds where _c0=53941;

select * from diamonds where _c0=53941;

The above DELETE command paired with the WHERE clause removes the row from the database and the subsequent SELECT query validates this by returning a null result set.

UPSERT Operation

The "upsert" operation updates if the record exists, and inserts the record doesn't exist.

CREATE TABLE  diamond__mini(_c0 int, carat double, cut string,    color string,    clarity string,    depth double, table double,    price int,    x double,    y double,    z double);

delete from diamond__mini;

INSERT INTO diamond__mini(_c0, carat, cut,    color,    clarity,    depth,    table,    price,    x,    y,    z) values (1, 0.22,    'Premium', 'I',    'SI2',    '60.3',    '62.1',    '334',    '3.79',    '3.75',    '2.27');
INSERT INTO diamond__mini(_c0, carat, cut,    color,    clarity,    depth,    table,    price,    x,    y,    z) values (2, 0.22,    'Premium', 'I',    'SI2',    '60.3',    '62.1',    '334',    '3.79',    '3.75',    '2.27');
INSERT INTO diamond__mini(_c0, carat, cut,    color,    clarity,    depth,    table,    price,    x,    y,    z) values (90000, 0.22,    'Premium', 'I',    'SI2',    '60.3',    '62.1',    '334',    '3.79',    '3.75',    '2.27');

select * from diamond__mini;

Creating a subset diamonds_mini to demonstrate the UPSERT operation

In this scenario, we have created a table named diamond__mini to test upsert (that is, insert or update) operations into the diamonds table.

diamond__mini is a subset of the diamonds table, containing only 3 records. Two of these rows (with _c0 values 1 and 2) already exist in the diamonds table, and one row (with _c0 value 90000) does not exist.

Therefore, the code will drop and create the diamond__mini table with a specific schema to match the diamonds table.

Then clear the diamond__mini table by deleting all existing records, ensuring that we have a clean slate for the upsert test.

It'll then perform three INSERT statements to the diamond__mini table, attempting to add three new records with different _c0 values, including one with _c0 = 90000.

Lastly, we'll select all records from the diamond__mini table to observe the changes and verify if the upsert worked correctly.

Since the _c0 values 1 and 2 already exist in the diamonds table, the corresponding rows in diamond__mini will be considered as updates for the existing rows.

On the other hand, the row with _c0 = 90000 is new and does not exist in the diamonds table, so it will be treated as an insert.

The describe command shows the metadata of the new table:

describe diamond__mini

Fetching metadata of the newly created table

Here's another example that uses the upsert operation:

upsert operation on diamond and diamond_mini tables

-- perform UPSERT operation based on matching column and row criteria from diamond__mini to diamonds table. If a match is found, record will update otherwise it will be inserted.

MERGE INTO diamonds as d USING diamond__mini as m
  ON d._c0 = m._c0
  WHEN MATCHED THEN 
    UPDATE SET *
  WHEN NOT MATCHED 
    THEN INSERT * ;

select * from diamonds where _c0 in (1 ,2, 90000)

UPSERT operation successful. Values for records with _c0 = [1,2] were updated and 90,000 was inserted

In this example, a MERGE operation is performed between two tables: diamonds (target table) and diamond__mini (source table). The MERGE statement compares the records in both tables based on the common _c0 column.

Here's a concise explanation:

The MERGE statement matches records with the same _c0 value in both tables (diamonds and diamond__mini).
When a match is found (based on _c0), it performs an UPDATE on the target table (diamonds) using the values from the source table (diamond__mini). This is done for all columns using UPDATE SET *.
If no match is found for a record from the source table (diamond__mini), it performs an INSERT into the target table (diamonds) using the values from the source table for all columns (using INSERT *).
After the MERGE operation, a SELECT statement retrieves the records from the target table (diamonds) with _c0 values 1, 2, and 90000 to observe the changes made during the merge.

The MERGE statement is used to synchronize data between the diamondsand diamond__mini tables based on their common _c0column, updating existing records and inserting new ones.

Advanced SQL Queries

Data Visualization in Delta

In Databricks Delta platform, you can leverage SQL queries to visualize data and gain valuable insights without the need for complex programming. Here are some ways to visualize data using SQL queries in Databricks Delta:

Basic SELECT Queries: Retrieves data from your Delta tables. By selecting specific columns or applying filters with WHERE clauses, you can quickly get an overview of the data's characteristics.
Aggregate Functions: SQL provides a variety of aggregate functions like COUNT, SUM, AVG, MIN, and MAX. By using these functions, you can summarize and visualize data at a higher level. You perform operations such as counting the number of records, calculating the average values, or finding the maximum and minimum values.
Grouping and Aggregating Data: The GROUP BY clause in SQL allows you to group data based on specific columns, and then apply aggregate functions to each group. This enables generation of meaningful insights by analyzing data on a category-wise basis.
Window Functions: SQL window functions, like ROW_NUMBER, RANK, and DENSE_RANK, are valuable for partitioning data and calculating rankings or running totals. These functions enable analyzing data in a more granular way and help discover patterns.
Joining Tables: Helps to combine data from multiple Delta tables using SQL JOIN operations. Merging related data, performing cross-table analysis, and advanced visualizations is possible through joins.
Subqueries and CTEs: SQL subqueries and Common Table Expressions (CTEs) allow you to break down complex problems into manageable parts. These techniques can simplify analysis and make SQL queries more organized and maintainable.
Window Aggregates: SQL window aggregates, such as SUM, AVG, and ROW_NUMBER with the OVER clause, enable you to perform calculations on specific windows or ranges of data. This is useful for analyzing trends over time or within specific subsets of your data.
CASE Statements: CASE statements in SQL help you create conditional expressions, allowing you to categorize or group data based on certain conditions. This can aid in creating custom labels or grouping data into different categories for visualization purposes.

The platform's powerful SQL capabilities empower data analysts and developers to extract meaningful insights from their Delta Lake data, all without the need for additional programming languages or tools.

-- aggregate query to get average price based on diamond colors
SELECT color, avg(price) AS avg_price FROM diamonds GROUP BY color ORDER BY color

Tabular View for the Query

This SQL query above is used to retrieve the average price of diamonds based on their colors.

Let's break down the code:

SELECT color, avg(price) AS avg_price specifies the columns that will be selected in the result set. It selects the color column and calculates the average price using the avg() function. The calculated average is aliased as avg_price for easier reference in the result set.

The FROM diamonds command specifies the table from which data will be retrieved. In this case, the table is named diamonds.

GROUP BY color groups the data by the color column. The result set will contain one row for each unique color, and the average price will be calculated for each group separately.

ORDER BY color arranges the result set in ascending order based on the color column. The output will be sorted alphabetically by color.

Visualized Results for the Query

Count of Diamonds by Clarity

SELECT clarity, COUNT(*) AS count
FROM diamonds
GROUP BY clarity
ORDER BY count DESC;

This SQL query above calculates the count of diamonds for each clarity level and presents the results in descending order. It selects the clarity column and uses the COUNT() function to count the number of occurrences for each clarity value.

The result set is grouped by clarity and sorted in descending order based on the count of diamonds.

Pie Chart visualization based on the above query

Average Price by Depth Range

-- This SQL query calculates the average price of diamonds grouped into depth ranges (60-62 and 62-64), and 'Other' for all other depth values, from the 'diamonds' table. The results are ordered in descending order based on the average price.

SELECT CASE 
         WHEN depth BETWEEN 60 AND 62 THEN '60-62'
         WHEN depth BETWEEN 62 AND 64 THEN '62-64'
         ELSE 'Other'
       END AS depth_range,
       AVG(CAST(price AS DOUBLE)) AS avg_price
FROM diamonds
GROUP BY depth_range
ORDER BY avg_price DESC;

Here, we are calculating the average price of diamonds grouped into depth ranges. It uses a CASE statement to categorize the diamonds into three depth ranges: '60-62' for depths between 60 and 62, '62-64' for depths between 62 and 64, and 'Other' for all other depth values.

The AVG() function is then used to calculate the average price for each depth range. The result set is grouped by the depth_range column and ordered in descending order based on the average price.

Average price based on the grouped depth range, achieved using CASE syntax

Price Distribution by Table

--  Calculate the median, first quartile (q1), and third quartile (q3) prices for each unique 'table' in the 'diamonds' table based on the 'price' column. The results are grouped by 'table' and provide valuable statistical insights into the price distribution within each category.

SELECT table, 
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY CAST(price AS DOUBLE)) AS median_price,
       PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY CAST(price AS DOUBLE)) AS q1_price,
       PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY CAST(price AS DOUBLE)) AS q3_price
FROM diamonds
GROUP BY table;

This SQL query calculates the median, first quartile (q1), and third quartile (q3) prices for each unique table value in the diamonds table. It uses the PERCENTILE_CONT() function to calculate these statistical measures.

The function is applied to the price column, which is cast as a double for accurate calculations. The result set is grouped by the table column, providing insights into the price distribution within each table category.

Casting media, Q1 and Q3 figures based on the price

Price Factor by X, Y and Z

-- Calculate the average price of diamonds grouped by their x, y, and z values from the 'diamonds' table. The results are ordered in descending order based on the average price, providing valuable insights into the average price of diamonds with different x, y, and z dimensions.

SELECT x, y, z, AVG(CAST(price AS DOUBLE)) AS avg_price
FROM diamonds
GROUP BY x, y, z
ORDER BY avg_price DESC;

This query will calculate the average price of diamonds grouped by their x, y, and z values from the diamonds table. It selects the columns x, y, z, and uses the AVG() function to calculate the average price for each combination of x, y, and z values.

The result set is then ordered in descending order based on the average price, providing insights into the average price of diamonds with different dimensions.

Visualization showing average price of diamonds grouped by their x, y, and z values from the 'diamonds' table

Add Constraints

-- This SQL code snippet alters the 'diamonds' table by dropping the existing constraint 'id_not_null' if it exists. Then, it adds a new constraint named 'id_not_null' to ensure that the column '_c0' must not contain null values, enforcing data integrity in the table.

ALTER TABLE diamonds DROP CONSTRAINT IF EXISTS id_not_null;
ALTER TABLE diamonds ADD CONSTRAINT id_not_null CHECK (_c0 is not null);

-- This command will fail as we insert a user with a null id::
INSERT INTO diamonds(_c0, carat, cut,    color,    clarity,    depth,    table,    price,    x,    y,    z) values (null, 0.22,    'Premium', 'I',    'SI2',    '60.3',    '62.1',    '334',    '3.79',    '3.75',    '2.27');

Note that this won't actually yield any output. Guess why? Because it does not stick to the NOT NULL constraint. So, whenever constraints are not fulfilled an error will be thrown. In this case, this exact error is shown:

Error in SQL statement: DeltaInvariantViolationException: CHECK constraint id_not_null (_c0 IS NOT NULL) violated by row with values:
 - _c0 : null

This SQL code snippet demonstrates the alteration of the diamonds table to enforce data integrity.

The first line of code, ALTER TABLE diamonds DROP CONSTRAINT IF EXISTS id_not_null;, checks if a constraint named id_not_null exists in the diamonds table and drops it if it does. This step ensures that any existing constraint with the same name is removed before adding a new one.

The second line of code, ALTER TABLE diamonds ADD CONSTRAINT id_not_null CHECK (_c0 is not null);, adds a new constraint named id_not_null to the diamonds table. This constraint specifies that the column _c0 must not contain null values. It ensures that whenever data is inserted or updated in this table, the '_c0' column cannot have a null value, maintaining data integrity.

However, the subsequent command, INSERT INTO diamonds(_c0, carat, cut, color, clarity, depth, table, price, x, y, z) VALUES (null, 0.22, 'Premium', 'I', 'SI2', '60.3', '62.1', '334', '3.79', '3.75', '2.27');, attempts to insert a row into the diamonds table with a null value in the _c0 column.

Since the newly added constraint prohibits null values in this column, the INSERT operation will fail, preserving the data integrity specified by the constraint.

How to Work with Dataframes

The best part is that you are not just restricted to using SQL to achieve this. Below, the same thing is done by first loading the dataset into diamonds with Python and then using pyspark library functions to do complex queries.

%python
diamonds = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")

In the Databricks Delta Lake platform, the spark object represents the SparkSession, which is the entry point for interacting with Spark functionality. It provides a programming interface to work with structured and semi-structured data.

The spark.read.csv() function is used to read a CSV file into a DataFrame. In this case, it reads the diamonds.csv file from the specified path. The arguments passed to the function include:

"/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv": This is the path to the CSV file. You can replace this with the actual path where your file is located.
header="true": This specifies that the first row of the CSV file contains the column names.
inferSchema="true": This instructs Spark to automatically infer the data types of the columns in the DataFrame.

Once the CSV file is read, it is stored in the diamonds variable as a DataFrame. The DataFrame represents a distributed collection of data organized into named columns. It provides various functions and methods to manipulate and analyze the data.

By reading the CSV file into a DataFrame on the Databricks Delta Lake platform, you can leverage the rich querying and processing capabilities of Spark to perform data analysis, transformations, and other operations on the diamonds data.

Manipulate the data and displays the results

The below example showcases that on the Databricks Delta Lake platform, you are not limited to using only SQL queries. You can also leverage Python and its rich ecosystem of libraries, such as PySpark, to perform complex data manipulations and analyses.

By using Python, you have access to a wide range of functions and methods provided by PySpark's DataFrame API. This allows you to perform various transformations, aggregations, calculations, and sorting operations on your data.

Whether you choose to use SQL or Python, the Databricks Delta Lake platform provides a flexible environment for data processing and analysis, enabling you to unlock valuable insights from your data.

%python
from pyspark.sql.functions import avg

display(diamonds.select("color","price").groupBy("color").agg(avg("price")).sort("color"))

Firstly, the from pyspark.sql.functions import avg statement imports the avg function from the pyspark.sql.functions module. This function is used to calculate the average value of a column.

Next, the diamonds.select("color", "price").groupBy("color").agg(avg("price")).sort("color") expression performs the following operations:

diamonds.select("color", "price") selects only the color and price columns from the diamonds DataFrame.

groupBy("color") groups the data based on the color column.

agg(avg("price")) calculates the average price for each group (color). The avg("price") argument specifies that we want to calculate the average of the "price" column.

sort("color") sorts the resulting DataFrame in ascending order based on the color column.

Finally, the display() function is used to visualize the resulting DataFrame in a tabular format.

Version Control and Time Travel in Delta

Databricks Delta’s time travel capabilities simplify building data pipelines. It comes handy when auditing data changes, reproducing experiments and reports or performing database transaction rollbacks. It is also useful for disaster recovery and allows us to undo changes and shifting back to any specific version of a database.

As you write into a Delta table or directory, every operation is automatically versioned. Query a table by referring to a timestamp or a version number.

The command below returns a list of all the versions and timestamps in a table called diamonds:

DESCRIBE HISTORY diamonds;

DESCRIBE HISTORY table_name returns a list of all the versions of the table along with their timestamps, operations. It also includes which user ran the query.

Restore Setup

Delta provides built-in support for backup and restore strategies to handle issues like data corruption or accidental data loss. In our scenario, we'll intentionally delete some rows from the main table to simulate such situations.

We'll then use Delta's restore capability to revert the table to a point in time before the delete operation. By doing so, we can verify if the deletion was successful or if the data was restored correctly to its previous state. This feature ensures data safety and provides an easy way to recover from undesirable changes or failures.

Here's the code:

-- Delete 10 records from the main table
DELETE FROM diamonds where `_c0`in (1,2,3,4,5,6,7,8,9,10);
SELECT COUNT(*) from diamonds;

Row count after deleing 10 records from the main table

SELECT COUNT(*) FROM diamonds VERSION AS OF 19;

Row count by referencing a previous version of the table

Restoring From A Version Number

Illustration of how a Version Restore works in Databricks Notebooks

The code below restores the diamonds table to the version that existed at version number 19, using a database versioning or historical data feature. After the restoration, a SELECT statement is executed to retrieve all data from the diamonds table as it existed at version 19.

This process allows you to view the historical state of the table at that specific version, enabling data analysis or comparisons with the current version.

-- restore the state of diamonds table to that of version 19 (refer the database images in the previous cell)

RESTORE TABLE diamonds TO VERSION AS OF 19;
SELECT * from diamonds;

SELECT query running against a restored version of the database

Autogenerated Fields

Let us see how to use auto-increment in Delta with SQL. The code below demonstrates the creation of a table called test__autogen with an "autogenerated" field named id. The id column is defined as BIGINT GENERATED ALWAYS AS IDENTITY, meaning its values will be automatically generated by the database engine during the insertion process.

The id serves as an auto-incrementing primary key for the table, ensuring each new record receives a unique identifier without any manual input. This feature simplifies data insertion and guarantees the uniqueness of records within the table, enhancing database management efficiency.

This auto-incrementing feature is commonly used for primary keys, as it guarantees the uniqueness of each record in the table. It also saves developers from having to manage the generation of unique identifiers manually, providing a more streamlined and efficient workflow.

%sql 
CREATE TABLE IF NOT EXISTS test__autogen (
  id BIGINT GENERATED ALWAYS AS IDENTITY ( START WITH 10000 INCREMENT BY 1 ), 
  name STRING, 
  surname STRING, 
  email STRING, 
  city STRING) ;

-- Note that we don't insert data for the id. The engine will handle that for us:
INSERT INTO test__autogen (name, surname, email, city) VALUES ('Atharva', 'Shah', 'highnessatharva@gmail.com', 'Pune, IN');
INSERT INTO test__autogen (name, surname, email, city) VALUES ('James', 'Dean', 'james@proton.mail', 'Tokyo, JP');

-- The ID is automatically generated!
SELECT * from test__autogen;

Records with an autogenerated id

Delta Table Cloning

Cloning Delta tables allows you to create a replica of an existing Delta table at a specific version. This feature is particularly valuable when you need to transfer data from a production environment to a staging environment or when archiving a specific version for regulatory purposes.

There are two types of clones available:

Deep Clone: This type of clone copies both the source table data and metadata to the clone target. In other words, it replicates the entire table, making it independent of the source.
Shallow Clone: A shallow clone only replicates the table metadata without copying the actual data files to the clone target. As a result, these clones are more cost-effective to create. However, it's crucial to note that shallow clones act as pointers to the main table. If a VACUUM operation is performed on the original table, it may delete the underlying files and potentially impact the shallow clone.

It's important to remember that any modifications made to either deep or shallow clones only affect the clones themselves and not the source table.

Cloning Delta tables is a powerful feature that simplifies data replication and version archiving, enhancing data management capabilities within your Delta Lake environment.

Difference between a Shallow Clone and a Deep Clone

The code below shows how to clone a table using shallow and deep clones:

-- Shallow clone (zero copy)
CREATE TABLE IF NOT EXISTS diamonds__shallow__clone
  SHALLOW CLONE diamonds
  VERSION AS OF 19;

SELECT * FROM diamonds__shallow__clone;

-- Deep clone (copy data)
CREATE TABLE IF NOT EXISTS diamonds__deep__clone
  DEEP CLONE diamonds;

SELECT * FROM diamonds__deep__clone;

Selecting records from the deep cloned table

Delta Magic Commands

There are convenient shortcuts in Databricks notebooks for managing Delta tables. They simplify common operations like displaying table metadata and running optimization.

You can use these shortcut commands to improve productivity by streamlining Delta table management tasks within a notebook environment.

%run: runs a Python file or a notebook.
%sh: executes shell commands on the cluster nodes.
%fs: allows you to interact with the Databricks file system.
%sql: allows you to run SQL queries.
%scala: switches the notebook context to Scala.
%python: switches the notebook context to Python.
%md: allows you to write markdown text.
%r: switches the notebook context to R.
%lsmagic: lists all the available magic commands.
%jobs: lists all the running jobs.
%config: allows you to set configuration options for the notebook.
%reload: reloads the contents of a module.
%pip: allows you to install Python packages.
%load: loads the contents of a file into a cell.
%matplotlib: sets up the matplotlib backend.
%who: lists all the variables in the current scope.
%env: allows you to set environment variables.

Conclusion

This in-depth handbook explored the power of Databricks, a platform that unifies analytics and data science in a single workspace. We went through Databricks Workspace, interactive analytics, and Delta Lake, emphasizing its data manipulation and analysis capabilities.

Delta, a data integrity and agility engine, supports SQL commands as well as sophisticated queries. Data frames are used to shape and display data to improve insights. Retrospection and accuracy are enabled through version control and time travel. Delta's table cloning provides innovation by permitting analytical studies into previously undiscovered territory.

Your pursuit of data excellence doesn't end here. Let's stay connected: explore more insights on my blog, consider supporting me with a cup of coffee, and join the conversation on Twitter and LinkedIn. Keep the momentum going by checking out a few of my other posts.

References

FastAPI Handbook – How to Develop, Test, and Deploy APIs

Atharva Shah — Tue, 25 Jul 2023 20:54:10 +0000

Welcome to the world of FastAPI, a sleek and high-performance web framework for constructing Python APIs. Don't worry if you're new to API programming – we'll start at the beginning.

An API (Application Programming Interface) connects several software programs allowing them to converse and exchange information. APIs are essential in modern software development as they are an application's backend architecture.

After reading this quick start guide, you will be able to develop a course administration API using FastAPI and MongoDB. The best part is that you will not only be writing APIs but also testing and containerizing the app.

In this walkthrough project, we'll create a Python backend system using FastAPI, a fast web framework, and a MongoDB database for course information storage and retrieval.

The system will allow users to access course details, view chapters, rate individual chapters, and aggregate ratings.

The project is designed for Python developers with basic programming knowledge and some NoSQL knowledge. Familiarity with MongoDB, Docker, and PyTest is not required since I will be highlighting everything you need to know for the scope of this project.

What We'll Build

Here's what we are going to be building:

FastAPI Backend: It will serve as the interface for handling API requests and responses. FastAPI is chosen for its ease of use, performance, and intuitive design.

MongoDB Database: A NoSQL database to store course information. MongoDB's flexible schema allows us to store data in JSON-like documents, making it suitable for this project.

Course Information: Users will be able to view various course details, such as course name, description, instructor, etc.

Chapter Details: The system will provide information about the chapters in a course, including chapter names, descriptions, and any other relevant data.

Chapter Rating: Users will have the ability to rate individual chapters. We will implement functionality to record and retrieve chapter ratings.

Course Aggregated Rating: The system will calculate and display the aggregated rating for each course based on the ratings of its chapters.

This walkthrough shows how to set up a development environment, build a FastAPI backend, integrate MongoDB, define API endpoints, add chapter rating functionality, and compute aggregate course ratings. It covers fundamental project concepts as well as Python, MongoDB, and NoSQL databases.

By the end, this useful backend system will manage chapter details, course information, and user ratings, serving as the basis for a complex and rewarding project.

The goal is to create a system that processes course-related queries. The course information must then be retrieved from MongoDB depending on the request. Lastly, this answer data must be returned in a standard format (JSON).

We'll begin with a script that reads the course information from courses.json. This data will be stored in the MongoDB instance. Once the data has been loaded, our API code may connect to this database to allow for simple data retrieval.

The interesting aspect is creating several endpoints with FastAPI. Our API will be able to:

Fetch a list of all courses
Show a comprehensive course overview
List detailed information about certain chapters
Record user scores for each chapter.

Additionally, for each course, we will aggregate all reviews, providing visitors with relevant information regarding course popularity and quality.

This tutorial focuses on building a scalable, efficient, and user-friendly API. Once we've tested everything, we'll containerize the application using Docker. This will greatly simplify deployment, maintenance, and installation.

Here are the sections of this tutorial:

API Methods
Client and Server
How to Set Up the MongoDB Database
How to Parse and Insert Course Data into MongoDB
How to Design the FastAPI Endpoints
Automated API Endpoint Testing with PyTest
How to Containerize the Application with Docker
Conclusion

API Methods

HTTP (Hypertext Transfer Protocol) methods specify the action to be taken on a resource. The following are the most often used API development methods:

GET: Requests information from a server. When a client submits a GET request, it is requesting data from the server.

POST: Sends data to the server for processing. When a client submits a POST request, it is often delivering data to the server to create or update a resource.

PUT: Updates server data. When a client submits a PUT request, the resource indicated in the request is updated.

DELETE: A client sending a DELETE request is asking for the removal of the specified resource.

Client and Server

The client is often a front-end application that sends requests to the server, such as a web browser or a mobile app. The server, on the other hand, is the back-end application in charge of processing client requests and responding appropriately.

A request is a communication delivered by the client to the server that specifies the intended action and any required data. The HTTP method, URL (Uniform Resource Locator), headers, and, in the case of POST or PUT requests, the data payload are all part of a request.

After the server gets the request, it processes it and returns a response. The response is the message given back to the client by the server that contains the requested data or the outcome of the activity.

A response generally comprises an HTTP status code indicating the success or failure of the request, as well as any data sent back to the client by the server.

Diagram showing how APIs work

How to Set Up the MongoDB Database

MongoDB is a type of NoSQL database. It is non-relational and saves information as collections and documents.

Install MongoDB for your operating system from the official website.

Now run the mongosh command for your terminal to verify if the installation was successful.

Running the mongosh command should yield this output

Connect to the MongoDB server with MongoDB Compass. I recommend that you set up MongoDB by specifying settings such as port number, storage engine, authentication, and so forth.

Create a new MongoDB connection

Now that the connection is established, the next step is to create a database or a "document". Call this database "courses". It will be empty for you currently. In just a minute we'll insert the documents using a Python script.

How to Parse and Insert Course Data into MongoDB

You could insert records one by one, but it is best to use a JSON file to simplify that process. Download this file courses.json from GitHub. All course information is present in it (as a list of courses).

Specifically, each course has the following structure:

name: The title of the course.
date: Creation date as a UNIX timestamp.
description: The description of the course.
domain: List of the course domain(s).
chapters: List of the course chapters. Each chapter has a title name and content text.

You will need a few Python packages for this project.

BSON - Binary serialization format that is used in MongoDB for efficient data storage and retrieval. It comes bundled with PyMongo.
FastAPI - Web framework for creating Python APIs that offer high performance, automatic validation, interactive documentation, and support for async operations.
PyMongo - Official MongoDB driver for Python. It serves as a high-level API for integrating MongoDB within Python.
Uvicorn - Primary ASGI server that improves application performance. It is responsible for server startup.
Starlette - ASGI framework that powers FastAPI and allows rapid prototyping development.
Pydantic - Integrated data validation and parsing library. We need it to create interactive API documentation while automatically validating incoming request data and enforcing data type rules.

Get them installed via the pip commands like so:

pip install fastapi pymongo uvicorn starlette pydantic

Now, let's write a Python script to insert all this course data into the database so that we can start building API routes. Spin up your IDE, create a file called script.py, and make sure it is in the same directory as the courses.json file.

""" 
Script to parse course information from courses.json, create the appropriate databases and
collection(s) on a local instance of MongoDB, create the appropriate indices (for efficient retrieval)
and finally add the course data on the collection(s).
"""

import pymongo
import json

# Connect to MongoDB
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["courses"]
collection = db["courses"]

# Read courses from courses.json
with open("courses.json", "r") as f:
    courses = json.load(f)

# Create index for efficient retrieval
collection.create_index("name")

# add rating field to each course
for course in courses:
    course['rating'] = {'total': 0, 'count': 0}

# add rating field to each chapter
for course in courses:
    for chapter in course['chapters']:
        chapter['rating'] = {'total': 0, 'count': 0}

# Add courses to collection
for course in courses:
    collection.insert_one(course)

# Close MongoDB connection
client.close()

This script populates a MongoDB database with the course information from the JSON file.

It begins by connecting to the local MongoDB instance. It reads course data from a file called courses.json and creates a new field for course ratings. It then develops an index to speed up data retrieval. Lastly, the course data is added to the MongoDB collection.

It's a straightforward script for managing course data in a database. On running the script, all records from the courses.json should have been inserted into the courses DB. Switch to MongoDB Compass to verify it.

You should be able to see the JSON items in your courses database after running the python script

How to Design the FastAPI Endpoints

These API endpoints provide an efficient way to manage course information, retrieve course details, and allow user interactions for rating chapters.

I recommend designing the API endpoints first along with the HTTP request type before writing the code. This acts as a good reference and provides clarity during the coding process.

Endpoint	Request Type	Description
/courses	GET	Get a list of all available courses with sorting options.

Options: Sort by title (ascending), date (descending), or total course rating (descending).

Optional filtering based on domain is supported. | | /courses/{course_id} | GET | Get the overview of a specific course identified by course_id. | | /courses/{course_id}/{chapter_id} | GET | Get information about a specific chapter within a course. | | /courses/{course_id}/{chapter_id} | POST | Rate a specific chapter within a course.

Options: Positive rating (1), negative rating (-1).

The ratings are aggregated for each course. |

Okay, time to dive into the API code. Create a brand new Python file and call it main.py:

import contextlib
from fastapi import FastAPI, HTTPException, Query
from pymongo import MongoClient
from bson import ObjectId
from fastapi.encoders import jsonable_encoder

app = FastAPI()
client = MongoClient('mongodb://localhost:27017/')
db = client['courses']

The code imports essential modules and creates an active instance of the FastAPI class named app. It also establishes a connection to the local MongoDB database using the PyMongo library and the db variable now stores the connection reference to the courses document.

Let's go over each of these endpoints in more detail now.

The Get All Courses Endpoint (`/courses` – GET)

This endpoint allows you to retrieve a list of all available courses. You can sort the courses based on different criteria, such as alphabetical order (based on the course title in ascending order), date (in descending order), or total course rating (in descending order). Also, we'll allow users to filter the courses based on their domain.

@app.get('/courses')
def get_courses(sort_by: str = 'date', domain: str = None):
    # set the rating.total and rating.count to all the courses based on the sum of the chapters rating
    for course in db.courses.find():
        total = 0
        count = 0
        for chapter in course['chapters']:
            with contextlib.suppress(KeyError):
                total += chapter['rating']['total']
                count += chapter['rating']['count']
        db.courses.update_one({'_id': course['_id']}, {'$set': {'rating': {'total': total, 'count': count}}})


    # sort_by == 'date' [DESCENDING]
    if sort_by == 'date':
        sort_field = 'date'
        sort_order = -1

    # sort_by == 'rating' [DESCENDING]
    elif sort_by == 'rating':
        sort_field = 'rating.total'
        sort_order = -1

    # sort_by == 'alphabetical' [ASCENDING]
    else:  
        sort_field = 'name'
        sort_order = 1

    query = {}
    if domain:
        query['domain'] = domain


    courses = db.courses.find(query, {'name': 1, 'date': 1, 'description': 1, 'domain':1,'rating':1,'_id': 0}).sort(sort_field, sort_order)
    return list(courses)

This code defines an endpoint in the FastAPI application to retrieve a list of all available courses. The endpoint can be accessed using an HTTP GET request to the '/courses' URL.

The @app.get() decorator is attached to the get_course function and it takes care of this.

When a request is made to this endpoint, the code first calculates the total course rating by summing up the ratings of all the chapters in each course. It then updates the rating field of each course in the MongoDB database with the computed total and count of ratings.

Next, the code determines the sorting mode based on the sort_by query parameter. If sort_by is set to date, the courses will be sorted by their creation date in descending order. If it is set to rating, the courses will be sorted by their total rating in descending order. Otherwise, the courses will be sorted alphabetically by their names in ascending order.

If the optional domain query parameter is provided, the code will filter the courses based on the specified domain.

Finally, the code queries the MongoDB database to retrieve the relevant course information, including the course name, creation date, description, domain, and rating. The courses are sorted according to the selected sorting mode and returned as a list.

That was the code explanation, but what about the actual API response? Run the command below in your terminal from the current working directory:

uvicorn main:app --reload

Uvicorn is an ASGI webserver. You can interact with API endpoints right on your local machine without any external server. On running the above command you should see a success message stating that the server has started.

Fire up your browser and enter http://127.0.0.1:8000/courses in the URL bar. The output that you will see will be the JSON response directly from the server.

Verify that the first object contains the following:

{
"name": "Introduction to Programming",
"date": 1659906000,
"description": "An introduction to programming using a language called Python. Learn how to read and write code as well as how to test and \"debug\" it. Designed for students with or without prior programming experience who'd like to learn Python specifically. Learn about functions, arguments, and return values (oh my!); variables and types; conditionals and Boolean expressions; and loops. Learn how to handle exceptions, find and fix bugs, and write unit tests; use third-party libraries; validate and extract data with regular expressions; model real-world entities with classes, objects, methods, and properties; and read and write files. Hands-on opportunities for lots of practice. Exercises inspired by real-world programming problems. No software required except for a web browser, or you can write code on your own PC or Mac.",
"domain": [
    "programming"
    ],
"rating": {
    "total": 6,
    "count": 12
    }
}

Guess what? It is a list of all the courses that we stored in our database. Your front-end application may now iterate over all these items and present them in a fancy way to the user. That is the power of APIs.

The Rating for the entire course will be updated as per the aggregated sum of chapters as mentioned in the assignment document.

At this point, if you wish to see the documentation for your API do so by navigating to the http://127.0.0.1:8000/docs endpoint. This navigable API comes prepackages with FastAPI. How cool is that?

FastAPI docs for all your API endpoints

Don't like the plain old look of the docs? Fret not, there is also a /redoc endpoint with a slightly fancier interface. Just navigate to [http://127.0.0.1:8000/](http://127.0.0.1:8000/docs)redoc and you will be greeted with this screen.

FastAPI alternate redoc interface with search and download options

The Get Course Overview Endpoint (`/courses/{course_id}` – GET)

You'll use this endpoint to get an overview of a specific course. Simply provide the course_id in the URL, and the API will return detailed information about that particular course.

@app.get('/courses/{course_id}')
def get_course(course_id: str):
    course = db.courses.find_one({'_id': ObjectId(course_id)}, {'_id': 0, 'chapters': 0})
    if not course:
        raise HTTPException(status_code=404, detail='Course not found')
    try:
        course['rating'] = course['rating']['total']
    except KeyError:
        course['rating'] = 'Not rated yet' 

    return course

This code snippet searches the MongoDB database for the course with the specified course_ id and extracts the course information while leaving out the chapters field.

If it cannot find the course, it throws an HTTPException with the status code 404. If it finds it, it tries to access the rating field and replaces it with its 'total' value to display the total rating. If not, the rating box is set to Not rated yet.

Finally, without the chapters field, it returns the JSON response of the course information, including the total rating.

Single Course Overview Endpoint Response

Get Specific Chapter Information Endpoint (`/courses/{course_id}/{chapter_id}` – GET)

Hitting this endpoint returns specific information about a chapter within a course. By specifying both the course_id and the chapter_id in the URL, you can access the details of that particular chapter.

@app.get('/courses/{course_id}/{chapter_id}')
def get_chapter(course_id: str, chapter_id: str):    
    course = db.courses.find_one({'_id': ObjectId(course_id)}, {'_id': 0, })
    if not course:
        raise HTTPException(status_code=404, detail='Course not found')
    chapters = course.get('chapters', [])
    try:
        chapter = chapters[int(chapter_id)]
    except (ValueError, IndexError) as e:
        raise HTTPException(status_code=404, detail='Chapter not found') from e
    return chapter

As you might expect, course_id is the course identity, and chapter id is the chapter identifier inside that course.

When a request is made to this endpoint, the code first searches the MongoDB database for the course with the specified course id, ignoring the _id column in the response.

If the course with the supplied course_id cannot be found in the database, the code throws an HTTPException with the status code 404, indicating that the course could not be located.

The code then uses the GET function to retrieve the list of chapters for the course, setting the default value to an empty list if the 'chapters' field does not exist.

Using the chapter_id provided in the request, the code then attempts to retrieve the exact chapter within the list of chapters. If the chapter id is not a valid integer or is out of range for the list of chapters, the code throws an HTTPException with the status code 404. This indicates that it could not locate the chapter.

If it locates the chapter, the response contains information on the individual chapter within the course.

Chapter Detail Endpoint

Rate Chapter Endpoint (`/courses/{course_id}/{chapter_id}` – POST)

This endpoint allows users to rate individual chapters within a course. You can provide a rating of 1 for a positive review or -1 for a negative review. The API aggregates all the ratings for each course, providing valuable feedback for future improvements.

Up until now, we've mostly seen GET requests. But now let's see how you can send data to the server, validate it, and insert it in the application database.

@app.post('/courses/{course_id}/{chapter_id}')
def rate_chapter(course_id: str, chapter_id: str, rating: int = Query(..., gt=-2, lt=2)):
    course = db.courses.find_one({'_id': ObjectId(course_id)}, {'_id': 0, })
    if not course:
        raise HTTPException(status_code=404, detail='Course not found')
    chapters = course.get('chapters', [])
    try:
        chapter = chapters[int(chapter_id)]
    except (ValueError, IndexError) as e:
        raise HTTPException(status_code=404, detail='Chapter not found') from e
    try:
        chapter['rating']['total'] += rating
        chapter['rating']['count'] += 1
    except KeyError:
        chapter['rating'] = {'total': rating, 'count': 1}
    db.courses.update_one({'_id': ObjectId(course_id)}, {'$set': {'chapters': chapters}})
    return chapter

We have put in place an endpoint for users to rate each chapter within a course using an HTTP POST request to the /courses/course_id/chapter_id URL. Users can provide a rating value of 1 for a positive rating or -1 for a negative rating. The code queries the MongoDB database to find the course with the specified course_id, excluding the _id field.

If it doesn't find the course, it raises an HTTP exception with a status code of 404. The code retrieves the list of chapters, setting the default value to an empty list.

If the chapter_id is not a valid integer or is out of range, it raises an HTTPException with a status code of 404. If the chapter is found, the code updates its rating by incrementing the total rating value with the provided rating and incrementing the count value.

If the chapter does not have an existing rating field, it creates one and initializes it with the provided rating and a count of 1. The updated rating is then updated in the database, and the updated chapter is returned as the response, providing feedback to the user about their rating for that chapter.

POST Request to add a rating to a chapter

To make a POST request, open the docs and click on the request highlighted in the above image. Then, click on "Try it out", fill in the post data, and press the Execute button right below. This sends the POST data to the server which is then validated.

If all the submitted data is as expected, the server accepts and shows the 200 status code meaning that the operation was successful. The submitted data is now in the MongoDB document.

Post Request Success

That's a wrap on the API development part.

Automated API Endpoint Testing with PyTest

As the complexity of modern web applications increases, so does the number of API endpoints and their interactions.

In a dynamic e-commerce web app, there could be hundreds of endpoints, each supporting multiple HTTP request methods. And these endpoints might be intricately interconnected.

Ensuring the proper functioning of all these endpoints after each development iteration becomes a formidable task for developers and QA teams. Here is where automated testing comes to the rescue.

Create a file test_app.py in the same directory as courses.json and main.py:

from fastapi.testclient import TestClient
from pymongo import MongoClient
from bson import ObjectId
import pytest
from main import app

client = TestClient(app)
mongo_client = MongoClient('mongodb://localhost:27017/')
db = mongo_client['courses']

That sets up an automated testing environment.

FastAPI Test Client simulates HTTP requests to the web app. With this, you can pretend to be a user, sending requests to your app and getting responses back, just like a real user would.

We're using MongoDB Connection for course data storage, with MongoClient enabling interaction and data updates during tests.

Test Database is a separate database for testing. It will not affect the actual course documents.

With this configuration, you can now create test functions that send requests to your FastAPI app using the TestClient. You will interact with your MongoDB database during these tests, but don't worry—this is just the test database, so nothing important will be harmed.

How to Test the "Get Courses List" Endpoint

These test functions use TestClient to interact with the "/courses" endpoint of the FastAPI application. They check if the endpoint behaves as expected when different parameters, such as sorting and filtering by domain, are provided.

The tests verify the status codes, data presence, sorting order, and domain filtering in the API responses, ensuring the functionality of the course endpoint is correct and reliable.

def test_get_courses_no_params():
    response = client.get("/courses")
    assert response.status_code == 200

def test_get_courses_sort_by_alphabetical():
    response = client.get("/courses?sort_by=alphabetical")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert sorted(courses, key=lambda x: x['name']) == courses


def test_get_courses_sort_by_date():
    response = client.get("/courses?sort_by=date")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert sorted(courses, key=lambda x: x['date'], reverse=True) == courses

def test_get_courses_sort_by_rating():
    response = client.get("/courses?sort_by=rating")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert sorted(courses, key=lambda x: x['rating']['total'], reverse=True) == courses

def test_get_courses_filter_by_domain():
    response = client.get("/courses?domain=mathematics")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert all([c['domain'][0] == 'mathematics' for c in courses])

def test_get_courses_filter_by_domain_and_sort_by_alphabetical():
    response = client.get("/courses?domain=mathematics&sort_by=alphabetical")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert all([c['domain'][0] == 'mathematics' for c in courses])
    assert sorted(courses, key=lambda x: x['name']) == courses

def test_get_courses_filter_by_domain_and_sort_by_date():
    response = client.get("/courses?domain=mathematics&sort_by=date")
    assert response.status_code == 200
    courses = response.json()
    assert len(courses) > 0
    assert all([c['domain'][0] == 'mathematics' for c in courses])
    assert sorted(courses, key=lambda x: x['date'], reverse=True) == courses

Pay attention to the assert statements. The expected results are checked against actual results and it returns a True or False Boolean based on the this comparison. The objective is to get all the tests to pass by equalizing these values.

How to Test the "Get Single Course Info" Endpoint

The tests use TestClient to send queries to FastAPI's "/courses/course id" endpoint, retrieving course data from the MongoDB database using the db.courses.find_one function. Comparing API response data to database data can help you determine if the endpoint handles existing and non-existent course IDs.

def test_get_course_by_id_exists():
    response = client.get("/courses/6431137ab5da949e5978a281")
    assert response.status_code == 200
    course = response.json()
    # get the course from the database
    course_db = db.courses.find_one({'_id': ObjectId('6431137ab5da949e5978a281')})
    # get the name of the course from the database
    name_db = course_db['name']
    # get the name of the course from the response
    name_response = course['name']
    # compare the two
    assert name_db == name_response


def test_get_course_by_id_not_exists():
    response = client.get("/courses/6431137ab5da949e5978a280")
    assert response.status_code == 404
    assert response.json() == {'detail': 'Course not found'}

How to Test the "Get Course Chapter Info" Endpoint

The tests anticipate the FastAPI application's "/courses/course id/chapter number" endpoint to provide chapter information for a certain course ID and number when they use the TestClient to make the request.

We use assertions to determine if the answer includes the anticipated data or gives a "Not Found" response for a non-existent chapter. It validates that the correct API chapter was retrieved and handles existing and non-existent chapters.

def test_get_chapter_info():
    response = client.get("/courses/6431137ab5da949e5978a281/1")
    assert response.status_code == 200
    chapter = response.json()
    assert chapter['name'] == 'Big Picture of Calculus'
    assert chapter['text'] == 'Highlights of Calculus'


def test_get_chapter_info_not_exists():
    response = client.get("/courses/6431137ab5da949e5978a281/990")
    assert response.status_code == 404
    assert response.json() == {'detail': 'Chapter not found'}

How to Test the "Post Course Rating" Endpoint

To test the rating capability, the test function specifies the course ID, chapter ID, and rating variables. It uses the TestClient's post method to submit a POST request to the "/courses/course id/chapter id" API, providing the course ID and chapter number in the URL and passing the rating variable as a query parameter.

FastAPI mimics a user's activity to rate a certain chapter of a course. The response is successful with a 200 status code. JSON content is validated for "name" and "rating" keys, as well as "total" and "count" keys. The total rating and rating count are greater than 0, indicating users have rated the chapter.

def test_rate_chapter():
    course_id = "6431137ab5da949e5978a281"
    chapter_id = "1"
    rating = 1

    response = client.post(f"/courses/{course_id}/{chapter_id}?rating={rating}")

    assert response.status_code == 200

    # Check if the response body has the expected structure
    assert "name" in response.json()
    assert "rating" in response.json()
    assert "total" in response.json()["rating"]
    assert "count" in response.json()["rating"]

    assert response.json()["rating"]["total"] > 0
    assert response.json()["rating"]["count"] > 0

def test_rate_chapter_not_exists():
    response = client.post("/courses/6431137ab5da949e5978a281/990/rate", json={"rating": 1})
    assert response.status_code == 404
    assert response.json() == {'detail': 'Not Found'}

This verification makes sure that the rating addition endpoint works as intended, with the API returning the correct success code and expected information about the chapter, including its name and updated rating details.

By running the pytest command, all the test functions in the test_app.py file will be executed, and you'll get feedback on whether the endpoints are functioning as expected or if any errors or regressions have occurred. This allows developers and QA teams to catch issues early in the development cycle and maintain the application's reliability and stability.

As you can see in the image below, all the tests are passing. Good job! As you keep on adding more features and endpoints to the app, keep adding the associated tests in order to validate correctness. This is called Test Driven Development (TDD).

Running API Tests with Pytest

Running the Pytest command shows the output as illustrated in the image above. It says that 13 tests pasts. This means that all our endpoints are functional and return the expected responses.

By detecting regressions, integrating components, resolving errors, doing load and performance tests, and testing for security, endpoint testing verifies that an application's essential operations are right. All potential weaknesses and vulnerabilities are noted and tagged for inspection.

Pytest helps you make sure that API endpoints work well together, and also helps you deal with failures and edge cases. It can manage numerous concurrent large requests in practical situations.

How to Containerize the Application with Docker

You can put your application and all of its dependencies together into a single unit called a container. This is called containerization. It separates the application from the underlying system, which maintains consistency across different operating systems.

Docker is a modern containerization technology that makes it easier to create, distribute, and execute containers. It enables developers to consistently and reproducibly build, ship, and execute apps without building from source.

Get Docker installed from here: https://www.docker.com/get-started.

Dockerizing Python programs helps you make sure that they run consistently across multiple computers, eliminating compatibility difficulties. It containerizes the software, its dependencies, and customizations, making it portable.

In the same directory as other files, make a new file called Dockerfile. Note that it does not require any extension.

# Use an official Python runtime as a parent image
FROM python:3.9-slim-buster

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /code
WORKDIR /app

COPY ./requirements.txt /app/requirements.txt

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir --upgrade -r /app/requirements.txt

COPY . /app

# Run app.py when the container launches
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

Starting with the official Python 3.9 thin image, the Dockerfile defines the image's blueprint.

It changes the working directory to /app, which is where the application code will be stored. This projects requirements are listed in the requirements.txt file, which was put into the container.

The RUN command uses pip to install Python requirements. COPY moves the app's code from the host to the container's /app directory. CMD provides the command that will be executed when the container starts.

In this case, it runs "uvicorn main:app" (the main.py FastAPI app) with host set to 0.0.0.0 and port 80.

How to Run the Docker Container

Build the Docker image in the same directory as the Dockerfile using: **docker build -t my_python_app .**

Containerizing the FastAPI app with Docker

Run the container in detached mode using the command **docker run -d -p 80:80 my_python_app**.

Once you do this, you can view the status of the containers and the image from Docker Desktop.

Docker Desktop shows that our container image is now in a running state on port 80

How to Terminate the Docker Container

Find the container ID or name with **docker ps**. Stop the container using its ID or name: **docker stop **

This walkthrough has only addressed development, testing, and containerization. Just note that post deployment container security, if neglected, introduces risks like vulnerabilities, misconfigurations, and attacks. You should ideally take advantage of a CNAPP (Cloud Native Application Protection Platform) to scan images, stick to best practises, and monitor running containers for protection.

The takeaway is that Docker containerization allows bundling of Python scripts with dependencies, making them consistent and portable. The Dockerfile describes how the image should be created.

Running the container after it has been constructed is as simple as issuing a single command. It's just as simple to put a stop to it. Docker makes it simple to manage Python application distribution.

Conclusion

This tutorial was a quick start guide to help you leverage the power of FastAPI. We built a course administration API that efficiently handles queries related to courses.

We did this by importing course data from a JSON file into MongoDB and then creating multiple endpoints for users to access course lists, overviews, chapter information, and user scores. We also added a review aggregation feature to demonstrate using HTTP POST and HTTP GET methods so that you can grab data as well as post data to the server.

PyTest helped us handle automated testing, ensuring dependability and stability. We then containerized the application Docker, which simplifies deployment and maintenance.

My Github Repository contains the complete code covered in this quick start walkthrough. Subscribe to my technical blog for technical cheat sheets and resources.

How to Build a Tiered List Maker with Python

Atharva Shah — Fri, 07 Jul 2023 20:48:16 +0000

Hello Pythonistas! Do you want to level up your Python and API skills while also building something really useful? Well, then you're in the right place.

This hands-on tutorial showcases how to leverage Python's capabilities to code an interactive tiered list builder right within your terminal.

We'll use some helpful Python libraries along the way to build a practical tool that allows you to rank and organize your favorite albums engagingly and efficiently in seconds.

Project Overview

Tiered lists are categorizing tools used to rank objects based on likes. They're used in music, movies, and other areas. The album tiered list in this project allocates records to different levels depending on your personal choices.

This step-by-step guide leverages the power of Python libraries like Rich, PyLast, Pillow, and Pick to make a tiered list builder right within the terminal.

Consider easily categorizing your albums into different tiers, such as "S-Tier" for all-time favorites or "B-Tier" for those undiscovered gems. You'll have complete control over how your music collection is organized according to your preferences.

A high level overview of the walkthrough

At the end of this project, you can expect to export all your tiered lists. Here is an example of what it might look like. This can be done for any of the artists of your choice.

Final project outcome

Get Your LastFM API Key

LastFM is a music database and online platform that offers a sophisticated music recommendation system as well as an API. It allows developers to access and download data from their database.

This is a necessary step because the CLI app requests the album metadata and cover from the LastFM API.

First, you'll want to create a LastFM Developer Account.

Never share API credentials. Use environment variables to store them.

Next, copy the API Key and the Shared Secret. Set them as environment variables.

On Windows:

setx LASTFM_API_KEY "your_api_key"
setx LASTFM_API_SECRET "your_api_secret"

On Linux/MacOS:

export LASTFM_API_KEY="your_api_key"
export LASTFM_API_SECRET="your_api_secret"

Import the Modules

Here are the modules you need to have installed to kickstart the project:

json: Encoding and decoding JSON responses from APIs.
os: File and directory operations.
datetime: Formatting and mathematical operations on date and time.
io: Stream-like interface for in-memory byte data.
typing: Type-hinting for improved readability
pylast: A Python wrapper library around the LastFM API.
requests: Make HTTP requests with online services and APIs.
pick: An interactive selection menu for selecting from a list directly in the terminal.
PIL: Image processing and manipulation (for example, drawing, resizing, and saving)
rich: Lovely terminal formatting.

Get these installed using the pip (Python package manager).

pip install pylast requests pick Pillow rich

Now that the setup is done, spin up your code editor, and let's get to building.

import json
import os
from datetime import datetime
from io import BytesIO
from typing import List

import pylast
import requests
from pick import pick
from PIL import Image, ImageDraw, ImageFont
from rich import print
from rich.panel import Panel
from rich.table import Table

This is a CLI-based application. So any choices you make will be made directly within the terminal. Two choices are presented at the startup screen to the user:

Create a Tiered List: Enter the name of the list and the artist. The application will fetch metadata and album covers from the LastFM API and save them to a JSON file.
Export the Tiered List to Image: Use Pandas to export the gathered JSON data to a beautiful PNG/JPG image. The image will have rows and columns to indicate tiers and albums.

To start, let's present an interactive menu to the user:

The pick module presents a choice selection menu in the terminal. Use arrow keys to navigate and hit Enter to confirm.

Ignore the first four options, as they are out of the scope of this walkthrough. You can just use the pass statement instead of invoking those functions to prevent any errors.

To achieve this, you will need to write the following driver code at the end of your file.

LASTFM_API_KEY = os.environ.get("LASTFM_API_KEY")
LASTFM_API_SECRET = os.environ.get("LASTFM_API_SECRET")
network = pylast.LastFMNetwork(api_key=LASTFM_API_KEY, api_secret=LASTFM_API_SECRET)

def start():    
    global network
    startup_question = "What Do You Want To Do?"
    options = ["Rate by Album", "Rate Songs", "See Albums Rated", "See Songs Rated", "Make a Tier List", "See Created Tier Lists", "EXIT"]
    selected_option, index = pick(options, startup_question, indicator="→")

    if index == 0:
        rate_by_album()
    elif index == 1:
        rate_by_song()
    elif index == 2:
        see_albums_rated()
    elif index == 3:
        see_songs_rated()
    elif index == 4:
        create_tier_list()
    elif index == 5:
        see_tier_lists()
    elif index == 6:
        exit()
start()

As seen in the code above, the os.environ.get() function retrieves the value of an environment variable you set in the previous section.

network is probably the most important variable. It has a lot of methods attached to it. These methods include:

Fetching albums of an artist
Fetching metadata about an artist
Fetching metadata about an album
Fetching album covers
Error validation by checking for the 200 (OK) response status.

Then, start() initiates the application, presents a startup question using the pick function, stores user choices, and executes various actions based on the selected option.

The pick method accepts the following parameters:

**options**: The list of options to choose from. These will be the list of albums.
**title**: The title or question to display to the user. The tier list name.
**multiselect**: A flag indicating whether multiple options can be selected. Multiple choice or single choice.
**indicator**: The symbol or character used to indicate the selected option.
**min_selection_count**: The minimum number of options that must be selected. This choice only allows one selection, the default value.

Note: All the code below has to be placed above the driver code. We are going to define several functions, one for each option.

How to Save State in JSON

JSON files are easy to work with and maintain even as the app schema changes. This is why you will be storing the tier list data in JSON format. It's a persistent storage method that allows you to update the album and song ratings, as well as tier lists, even when the program is rerun.

Surely you don't want the user data to be lost when the application restarts? Therefore, a save state is required. It's a database most of the time. But for the sake of simplicity, let's store and retrieve user data using JSON.

def load_or_create_json() -> None:
    if os.path.exists("albums.json"):
        with open("albums.json") as f:
            ratings = json.load(f)
    else:
        # create a new json file with empty dict
        with open("albums.json", "w") as f:
            ratings = {"album_ratings": [], "song_ratings": [], "tier_lists": []}
            json.dump(ratings, f)

This custom function either loads an existing JSON file or produces one if none exists. It guarantees that the application has a file for storing and retrieving album and song ratings, as well as tier lists.

If the file does not exist, it creates a new file named "albums.json" in write mode. Then initialize the ratings variable as a dictionary containing empty lists. json.dump() writes the contents of the ratings dictionary to the JSON file.

How to Write Utility Functions

Utility or helper functions in menu-driven programming perform common tasks or operations related to menu options. These functions are reusable and modular, making code more organized and easier to maintain. Examples include:

Display Menu
Input Validation
Data Persistence
Formatting and Display
Error Handling
Common Operations.

These functions handle common tasks required by multiple menu options, promoting code reusability and reducing redundancy. Encapsulating these functions in menu logic helps maintain code flow, and facilitates testing, debugging, and future modifications.

Think of them as bridges that help connect two functions better and isolate trivial logic that can be used on the fly. This project relies on two helper functions.

Remove album from list

First, we'll write a function to remove the picked album from the list to prevent repetition across different tiers. Here's what that looks like:

def create_tier_list_helper(albums_to_rank, tier_name):
    # if there are no more albums to rank, return an empty list
    if not albums_to_rank:
        return []

    question = f"Select the albums you want to rank in  {tier_name}"
    tier_picks = pick(options=albums_to_rank, title=question, multiselect=True, indicator="→", min_selection_count=0)
    tier_picks = [x[0] for x in tier_picks]

    for album in tier_picks:
        albums_to_rank.remove(album)

    return tier_picks

This allows users to rank albums inside certain tiers and facilitates the creation of tier lists.

It requires two arguments: albums_to_rank and tier_name. If there are no more albums to rank, the function produces an empty list. Users can choose albums to rate from albums to rank, save them in tier picks, remove them, and return the tier picks list.

The returned value tier_picks is a Python list.

Return cover of selected album

Next, write a function that returns the cover of an album users select. Here's what it looks like:

def get_album_cover(artist, album):
    album = network.get_album(artist, album)
    album_cover = album.get_cover_image()
    # check if it is a valid url
    try:
        response = requests.get(album_cover)
        if response.status_code != 200:
            album_cover = "https://community.mp3tag.de/uploads/default/original/2X/a/acf3edeb055e7b77114f9e393d1edeeda37e50c9.png"
    except:
        album_cover = "https://community.mp3tag.de/uploads/default/original/2X/a/acf3edeb055e7b77114f9e393d1edeeda37e50c9.png"
    return album_cover

This retrieves the album cover image for a specified artist and album name via the LastFM API. It validates the cover image URL from the API answer with an HTTP request.

The album cover is returned if the URL is correct. Else, a fallback placeholder image for the album cover is provided by default.

The network object that you created earlier has several handy methods. The first line gets the album object and then gets the cover image for that object directly via LastFM.

How to Add the Tiered List Data to JSON

Once the user picks the "create tier list" option from the menu the script presents them with the available tiers and requests them to input a valid artist and a name for their tier list so that it can be stored in the JSON file.

After choosing the "create tier list" option, the script validates the artist returns the metadata using the LastFM API.

Use the network object to validate if the artist exists. If yes, request all the albums for that artist. Populate a list with these albums and set the option to that list so it shows up in the choices for the S tier.

In the image below, the (x) mark indicates the user has selected that particular album to be in the S-Tier.

This is a prompt for users to select albums that they want to move to the S-Tier. Navigate with arrow keys to select zero, one or more albums from the list.

After the user has selected these albums, you would like to serialize this list and put it into a JSON file that will be used to generate the actual image later. This JSON file needs to have a data definition.

Think about how databases have a schema. They have tables and columns and rows that describe the nature and the format of the data.

Similarly, we are going to define the schema of the JSON file to store all these tier list choices. Each tier list object contains the following properties:

tier_list_name: The name given to the tier list.
artist: The name of the artist for whom the tier list is created.
s_tier, a_tier, b_tier, c_tier, d_tier, e_tier: Arrays that hold the albums and their corresponding cover art for each tier. Albums are represented as objects with "album" and "cover_art" properties.
time: Creation timestamp.
Each tier array contains one or more album objects with "album" representing the album name and "cover_art"

This is the sample JSON schema. Once the user makes the choices in the terminal, a serialized Python object similar to this containing the tier list data will be written to the JSON file.

{
  "tier_lists": [
        {
            "tier_list_name": "THE WEEKND RANKED",
            "artist": "the weeknd",
            "s_tier": [
                {
                    "album": "After Hours",
                    "cover_art": "https://lastfm.freetls.fastly.net/i/u/300x300/7d957bd27dd562bee7aaa89eafa0bbe6.jpg"
                }
            ],
            "a_tier": [
                {
                    "album": "Kiss Land",
                    "cover_art": "https://lastfm.freetls.fastly.net/i/u/300x300/01ad150445023de653c50dbbc3e10dbc.jpg"
                },
                {
                    "album": "Echoes of Silence",
                    "cover_art": "https://lastfm.freetls.fastly.net/i/u/300x300/4f257619898b44b7a8f95431045e9ffe.png"
                }
            ],
            "b_tier": [],
            "c_tier": [],
            "d_tier": [],
            "e_tier": [
                {
                    "album": "I Feel It Coming",
                    "cover_art": "https://lastfm.freetls.fastly.net/i/u/300x300/974deeb8c348d0ad0c0fa10941dd67e8.jpg"
                }
            ],
            "time": "2023-04-23 23:56:14.652417"
        }
    ]
}

You want to dynamically write to this JSON file as the user continues to keep making tier lists. That is, it should continue to grow and expand to fit all the album covers. The below code does exactly that:

def create_tier_list():
    load_or_create_json()
    with open("albums.json") as f:
        album_file = json.load(f)

    print("TIERS - S, A, B, C, D, E")

    question = "Which artist do you want to make a tier list for?"
    artist = input(question).strip().lower()

    try:
        get_artist = network.get_artist(artist)
        artist = get_artist.get_name()
        albums_to_rank = get_album_list(artist)

        # keep only the album name by splitting the string at the first - and removing the first element
        albums_to_rank = [x.split(" - ", 1)[1] for x in albums_to_rank[1:]]

        question = "What do you want to call this tier list?"
        tier_list_name = input(question).strip()

        # repeat until the user enters at least one character
        while not tier_list_name:
            print("Please enter at least one character")
            tier_list_name = input(question).strip()

        # S TIER
        question = "Select the albums you want to rank in S Tier:"
        s_tier_picks = create_tier_list_helper(albums_to_rank, "S Tier")
        s_tier_covers = [get_album_cover(artist, album) for album in s_tier_picks]
        s_tier = [{"album":album,"cover_art": cover} for album, cover in zip(s_tier_picks, s_tier_covers)]

        # A TIER
        question = "Select the albums you want to rank in A Tier:"
        a_tier_picks = create_tier_list_helper(albums_to_rank, "A Tier")
        a_tier_covers = [get_album_cover(artist, album) for album in a_tier_picks]
        a_tier = [{"album":album,"cover_art": cover} for album, cover in zip(a_tier_picks, a_tier_covers)]

        # B TIER
        question = "Select the albums you want to rank in B Tier:"
        b_tier_picks = create_tier_list_helper(albums_to_rank, "B Tier")
        b_tier_covers = [get_album_cover(artist, album) for album in b_tier_picks]
        b_tier = [{"album":album,"cover_art": cover} for album, cover in zip(b_tier_picks, b_tier_covers)]

        # C TIER
        question = "Select the albums you want to rank in C Tier:"
        c_tier_picks = create_tier_list_helper(albums_to_rank, "C Tier")
        c_tier_covers = [get_album_cover(artist, album) for album in c_tier_picks]
        c_tier = [{"album":album,"cover_art": cover} for album, cover in zip(c_tier_picks, c_tier_covers)]

        # D TIER
        question = "Select the albums you want to rank in D Tier:"
        d_tier_picks = create_tier_list_helper(albums_to_rank, "D Tier")
        d_tier_covers = [get_album_cover(artist, album) for album in d_tier_picks] 
        d_tier = [{"album":album,"cover_art": cover} for album, cover in zip(d_tier_picks, d_tier_covers)]
        # E TIER
        question = "Select the albums you want to rank in E Tier:"
        e_tier_picks = create_tier_list_helper(albums_to_rank, "E Tier")
        e_tier_covers = [get_album_cover(artist, album) for album in e_tier_picks]
        e_tier = [{"album":album,"cover_art": cover} for album, cover in zip(e_tier_picks, e_tier_covers)]

        # check if all tiers are empty and if so, exit
        if not any([s_tier_picks, a_tier_picks, b_tier_picks, c_tier_picks, d_tier_picks, e_tier_picks]):
            print("All tiers are empty. Exiting...")
            return


        # # add the albums that were picked to the tier list
        tier_list = {
            "tier_list_name": tier_list_name,
            "artist": artist,
            "s_tier": s_tier, 
            "a_tier": a_tier,
            "b_tier": b_tier,
            "c_tier": c_tier,
            "d_tier": d_tier,
            "e_tier": e_tier,
            "time": str(datetime.now())
        }

        # add the tier list to the json file
        album_file["tier_lists"].append(tier_list)

        # save the json file
        with open("albums.json", "w") as f:
            json.dump(album_file, f, indent=4)

        return

    except pylast.PyLastError:
        print("❌[b red] Artist not found [/b red]")

This is the core function used to create tier lists for albums and store them in albums.json. Here's what's going on in it:

The user enters the artist's name and retrieves information from the LastFM API.
Next, provide a name for the tier list they want to create.
For each tier (S, A, B, C, D, E), select albums to rank within that tier using a helper function you wrote earlier.
Retrieval of album cover art for each selected album is done via the get_album_cover(), and the selected albums and their corresponding cover art are stored as dictionaries in the respective tier list.
If all tiers are empty, the function exits. Nothing is written into the JSON file.
Otherwise, the tier list is added to the JSON file which is saved in the current working directory (same path as the Python script).

Now, this is selection for the next tier (A-Tier). The albums we selected in the previous options do not appear anymore meaning they have already been selected.

How to Use Pillow for Visual Transformations

Now that you have all the JSON data for your tier lists, you want to export all that to an image so that you can share it with your friends or post it on the web. But how should you do this? Let's break it down:

First, you'll want to determine the number of tiers. Then, determine the position and sizing of both the tier list grid and the album cover squares.

Here, you'll want to think about dynamic width and height offsets. How should you prevent overflow of images, add new rows, or maintain minimum height?

All this is related to the image canvas. Pillow is an excellent choice for this. You can resize, adjust, and expand the dimensions of all your images as well as the background canvas on the fly based on the user input and selection.

Tier list template made with Pillow. Refer the code below for explanation.

The most logical way to tackle this is to pass the tier list object to a function and let it loop over all the tiers. Inside each tier, let it loop over all the records and add an item. If the album cover exceeds the max width, add a new row so it does not overflow. Continue this until all the albums in each tier are processed. Violà!

def image_generator(file_name, data):

    # return if the file already exists
    if os.path.exists(file_name):
        return

    # Set the image size and font
    image_width = 1920
    image_height = 5000
    font = ImageFont.truetype("arial.ttf", 15)
    tier_font = ImageFont.truetype("arial.ttf", 30)

    # Make a new image with the size and background color black
    image = Image.new("RGB", (image_width, image_height), "black")
    text_cutoff_value = 20

    #Initialize variables for row and column positions
    row_pos = 0
    col_pos = 0
    increment_size = 200

    """S Tier"""
    # leftmost side - make a square with text inside the square and fill color
    if col_pos == 0:
        draw = ImageDraw.Draw(image)
        draw.rectangle((col_pos, row_pos, col_pos + increment_size, row_pos + increment_size), fill="red")
        draw.text((col_pos + (increment_size//3), row_pos+(increment_size//3)), "S Tier", font=tier_font, fill="white")
        col_pos += increment_size

    for album in data["s_tier"]:
        # Get the cover art
        response = requests.get(album["cover_art"])
        cover_art = Image.open(BytesIO(response.content))

        # Resize the cover art
        cover_art = cover_art.resize((increment_size, increment_size))

        # Paste the cover art onto the base image
        image.paste(cover_art, (col_pos, row_pos))

        # Draw the album name on the image with the font size 10 and background color white
        draw = ImageDraw.Draw(image)

        # Get the album name
        name = album["album"]
        if len(name) > text_cutoff_value:
            name = f"{name[:text_cutoff_value]}..."

        draw.text((col_pos, row_pos + increment_size), name, font=font, fill="white")

        # Increment the column position
        col_pos += 200
        # check if the column position is greater than the image width
        if col_pos > image_width - increment_size:
            # add a new row
            row_pos += increment_size + 50
            col_pos = 0 

    # add a new row to separate the tiers
    row_pos += increment_size + 50
    col_pos = 0

    """A TIER"""
    if col_pos == 0:
        draw = ImageDraw.Draw(image)
        draw.rectangle((col_pos, row_pos, col_pos + increment_size, row_pos + increment_size), fill="orange")
        draw.text((col_pos + (increment_size//3), row_pos+(increment_size//3)), "A Tier", font=tier_font, fill="white")
        col_pos += increment_size

    for album in data["a_tier"]:
        response = requests.get(album["cover_art"])
        cover_art = Image.open(BytesIO(response.content))
        cover_art = cover_art.resize((increment_size, increment_size))
        image.paste(cover_art, (col_pos, row_pos))
        draw = ImageDraw.Draw(image)

        name = album["album"]
        if len(name) > text_cutoff_value:
            name = f"{name[:text_cutoff_value]}..."

        draw.text((col_pos, row_pos + increment_size), name, font=font, fill="white")

        col_pos += 200
        if col_pos > image_width - increment_size:
            row_pos += increment_size + 50
            col_pos = 0 

    row_pos += increment_size + 50
    col_pos = 0

    """B TIER"""
    if col_pos == 0:
        draw = ImageDraw.Draw(image)
        draw.rectangle((col_pos, row_pos, col_pos + increment_size, row_pos + increment_size), fill="yellow")
        draw.text((col_pos + (increment_size//3), row_pos+(increment_size//3)), "B Tier", font=tier_font, fill="black")
        col_pos += increment_size

    for album in data["b_tier"]:
        response = requests.get(album["cover_art"])
        cover_art = Image.open(BytesIO(response.content))
        cover_art = cover_art.resize((increment_size, increment_size))
        image.paste(cover_art, (col_pos, row_pos))
        draw = ImageDraw.Draw(image)

        name = album["album"]
        if len(name) > text_cutoff_value:
            name = f"{name[:text_cutoff_value]}..."

        draw.text((col_pos, row_pos + increment_size), name, font=font, fill="white")
        col_pos += 200
        if col_pos > image_width - increment_size:
            # add a new row
            row_pos += increment_size + 50
            col_pos = 0

    row_pos += increment_size + 50
    col_pos = 0

    """C TIER"""
    if col_pos == 0:
        draw = ImageDraw.Draw(image)
        draw.rectangle((col_pos, row_pos, col_pos + increment_size, row_pos + increment_size), fill="green")
        draw.text((col_pos + (increment_size//3), row_pos+(increment_size//3)), "C Tier", font=tier_font, fill="black")
        col_pos += increment_size

    for album in data["c_tier"]:
        response = requests.get(album["cover_art"])
        cover_art = Image.open(BytesIO(response.content))       
        cover_art = cover_art.resize((increment_size, increment_size))
        image.paste(cover_art, (col_pos, row_pos))
        draw = ImageDraw.Draw(image)

        name = album["album"]
        if len(name) > text_cutoff_value:
            name = f"{name[:text_cutoff_value]}..."

        draw.text((col_pos, row_pos + increment_size), name, font=font, fill="white")

        col_pos += 200
        if col_pos > image_width - increment_size:
            row_pos += increment_size + 50
            col_pos = 0

    row_pos += increment_size + 50
    col_pos = 0


    """D TIER"""
    if col_pos == 0:
        draw = ImageDraw.Draw(image)
        draw.rectangle((col_pos, row_pos, col_pos + increment_size, row_pos + increment_size), fill="blue")
        draw.text((col_pos + (increment_size//3), row_pos+(increment_size//3)), "D Tier", font=tier_font, fill="black")
        col_pos += increment_size

    for album in data["d_tier"]:
        response = requests.get(album["cover_art"])
        cover_art = Image.open(BytesIO(response.content))
        cover_art = cover_art.resize((increment_size, increment_size))
        image.paste(cover_art, (col_pos, row_pos))        
        draw = ImageDraw.Draw(image)

        name = album["album"]
        if len(name) > text_cutoff_value:
            name = f"{name[:text_cutoff_value]}..."

        draw.text((col_pos, row_pos + increment_size), name, font=font, fill="white")

        col_pos += 200
        if col_pos > image_width - increment_size:
            # add a new row
            row_pos += increment_size + 50
            col_pos = 0

    row_pos += increment_size + 50
    col_pos = 0


    """E TIER"""
    if col_pos == 0:
        draw = ImageDraw.Draw(image)
        draw.rectangle((col_pos, row_pos, col_pos + increment_size, row_pos + increment_size), fill="pink")
        draw.text((col_pos + (increment_size//3), row_pos+(increment_size//3)), "E Tier", font=tier_font, fill="black")
        col_pos += increment_size

    for album in data["e_tier"]:

        response = requests.get(album["cover_art"])
        cover_art = Image.open(BytesIO(response.content))
        cover_art = cover_art.resize((increment_size, increment_size))    
        image.paste(cover_art, (col_pos, row_pos))
        draw = ImageDraw.Draw(image)
        name = album["album"]
        if len(name) > text_cutoff_value:
            name = f"{name[:text_cutoff_value]}..."

        draw.text((col_pos, row_pos + increment_size), name, font=font, fill="white")
        col_pos += 200
        if col_pos > image_width - increment_size:
            row_pos += increment_size + 50
            col_pos = 0

    row_pos += increment_size + 50
    col_pos = 0

    image = image.crop((0, 0, image_width, row_pos))

    image.save(f"{file_name}")

First of all, with two parameters (file name and data), this custom function is responsible for converting all the JSON data we stored into a nicely organized tier list image.

It determines whether or not the file with the specified file name exists and returns true if it does. This saves computing if you have already made the tier list with that name.

You can see that it specifies the image size and font for constructing the tier list visual, generates a new image with a black backdrop, defines variables for row and column places, and sets an increment size.

The function generates the S Tier portion of the tier list, generating a square with text within that is filled with red color.

After retrieving cover graphics for each album in the S tier, the album title is drawn on the image using a given typeface once the cover art is scaled and placed onto it. If the column position is more than the image width, a new row is added.

This process is repeated for the A, B, C, D, and E Tiers, with each tier having its color. If the picture file does not already exist, the resulting image is saved.

In a nutshell, this places all the album covers in rows and columns inside each tier, and the new rows are introduced as needed to accommodate the width of the image. Dynamic width and height offsets are set for the natural growth of width and height.

This entire image is generated with the Pillow library by processing the data from the JSON file. First, the tiers are set to the left edge of the canvas and sequentially, the selected albums are placed on the canvas. Any overflow is taken care of by adding a row beneath the tier list.

How to Export the Created Image

You are almost there. This final function passes the tier list object data to the previously defined function to render an image using pillow.

Think of it as a connecting link between two functions It simply prints the success or failure message in the CLI to let users know the image generation status.

def see_tier_lists():
    load_or_create_json()
    with open("albums.json", "r") as f:
        data = json.load(f)

    if not data["tier_lists"]:
        print("❌ [b red]No tier lists have been created yet![/b red]")
        return

    for key in data["tier_lists"]:
        image_generator(f"{key['tier_list_name']}.png", key)
        print(f"✅ [b green]CREATED[/b green] {key['tier_list_name']} tier list.")

    print("✅ [b green]DONE[/b green]. Check the directory for the tier lists.")    
    return

Let the user know that the image is rendered in the current directory.

Key Takeaways

This tutorial demonstrated ways to transform JSON data into interactive tier list graphics using Python and the Pillow library. By combining image manipulation and API data retrieval, appealing representations of album rankings are generated.

To recap, you learned:

How to retrieve album data using the LastFM API.
How to generate tier lists based on user input and album ratings.
How to use the Pillow library to create and manipulate images.
How to resize and paste album cover art onto the base image.
How to add text and tier labels to the image.
How to dynamically write to JSON files.

Want to grab the code from this tutorial? Get it from my Github Repo. It includes other CRUD functions like reviewing, rating, and viewing all your albums and artists right within the terminal.

This is also published as a Python package for ease of use. Refer to this release page on PyPi.

This project uses Python and image manipulation libraries to create visually engaging tier lists for gaming communities, music rankings, and content evaluations. Users can rate albums interactively right within their terminal and integrate other APIs or data sources to enhance the creative process. This practical application explores new possibilities in data visualization.

Atharva Shah - freeCodeCamp.org

The Python Decorator Handbook

Table of Contents

How Python Decorators Work

Log Arguments and Return Value of a Function

Usage and Applications

Get the Execution Time of a Function

Usage and Applications

Convert Function Return Value to a Specified Data Type

Usage and Application

Cache Function Results

Usage and Applications

Validate Function Arguments Based on Condition

Usage and Applications

Retry a Function Multiple Times on Failure

Usage and Application

Enforce Rate Limits on a Function

Usage and Application

Handle Exceptions and Provide Default Response

Usage and Applications

Enforce Type Checking on Function Arguments

Usage and Applications

Measure Memory Usage of a Function

Usage and Applications

Cache Function Results with Expiration Time

Usage and Applications

Conclusion

How to Use Databricks Delta Lake with SQL – Full Handbook

Prerequisites

Table of Contents

Introduction to Databricks

Data Ingestion

Dashboards

Policies

History

Optimization

Alerts

Persona-Based Design

SQL Workspace

Integration with other BI Tools

Introduction to Delta

Why Delta Lake?

Features and Capabilities

How to Create and Manage Tables

How to Create Tables from a Databricks Dataset

Saving the loaded CSV file to Delta using Python

Delta SQL Command Support

Data Manipulation Language (DML)

Data Definition Language (DDL) Commands

UPSERT Operation

Advanced SQL Queries

Data Visualization in Delta

Count of Diamonds by Clarity

Average Price by Depth Range

Price Distribution by Table

Price Factor by X, Y and Z

Add Constraints

How to Work with Dataframes

Manipulate the data and displays the results

Version Control and Time Travel in Delta

Restore Setup

Restoring From A Version Number

Autogenerated Fields

Delta Table Cloning

Delta Magic Commands

Conclusion

References

FastAPI Handbook – How to Develop, Test, and Deploy APIs

What We'll Build

Table of Contents

API Methods

Client and Server

How to Set Up the MongoDB Database

How to Parse and Insert Course Data into MongoDB

How to Design the FastAPI Endpoints

The Get All Courses Endpoint (/courses – GET)

The Get Course Overview Endpoint (/courses/{course_id} – GET)

Get Specific Chapter Information Endpoint (/courses/{course_id}/{chapter_id} – GET)

Rate Chapter Endpoint (/courses/{course_id}/{chapter_id} – POST)

Automated API Endpoint Testing with PyTest

The Get All Courses Endpoint (`/courses` – GET)

The Get Course Overview Endpoint (`/courses/{course_id}` – GET)

Get Specific Chapter Information Endpoint (`/courses/{course_id}/{chapter_id}` – GET)

Rate Chapter Endpoint (`/courses/{course_id}/{chapter_id}` – POST)