infrastructure - freeCodeCamp.org

Building a Website in 2026: What Matters More Than Your Tech Stack

Manish Shivanandhan — Sun, 14 Jun 2026 02:17:56 +0000

For years, developers have debated which technology stack was best for building websites.

Some preferred React. Others chose Vue, Angular, Svelte, or server-side frameworks such as Laravel and Django.

Entire conferences, blogs, and social media discussions have been dedicated to comparing frameworks and programming languages.

In 2026, those debates matter less than many developers think.

A modern website can be built with almost any mature framework and still perform well. The bigger challenge is making sure people can actually find, trust, and use that website.

Discoverability, performance, infrastructure, structured data, and AI search visibility now have a greater impact on success than the choice between competing frontend libraries.

The websites that win today aren't necessarily built with the most fashionable technologies. They're built with a strong foundation that helps users and search systems understand, access, and trust their content.

In this article, we'll look at what really matters when building a website these days. We'll explore why performance, hosting, domain management, structured data, and content quality often have a bigger impact than the technology stack itself.

We'll also examine how AI-powered search is changing the way people find information online and what developers can do to improve their website's visibility.

What We'll Cover:

The Tech Stack Has Become a Commodity
Performance Is Still a Competitive Advantage
Domains and Infrastructure Still Matter
Hosting Is No Longer Just About Servers
Structured Data Has Become Essential
The Rise of AI Search and Answer Engines
Content Quality Is More Important Than Ever
User Experience Is the New Differentiator
The Future Is About Outcomes, Not Frameworks

The Tech Stack Has Become a Commodity

The web development ecosystem has matured significantly over the past decade. Most modern frameworks provide similar capabilities. They support component-based development, server-side rendering, API integrations, authentication systems, and performance optimization.

As a result, the gap between frameworks has narrowed.

A poorly optimized website built with the latest framework will often perform worse than a well-optimized website built with older technology. Users rarely care whether a page was built with React, Vue, or another framework. They care whether it loads quickly, works on mobile devices, and provides useful information.

Businesses care even more about outcomes. They want traffic, conversions, customer engagement, and revenue growth. None of those metrics improve simply because a team adopted a trendy technology stack.

This shift has forced development teams to focus on factors that have a direct impact on visibility and user experience.

Performance Is Still a Competitive Advantage

Despite advances in hosting and frontend tooling, website performance remains one of the strongest predictors of user satisfaction.

Research consistently shows that slower websites lead to higher bounce rates and lower conversion rates. Users expect pages to load almost instantly. Even a delay of a few seconds can cause visitors to abandon a website before interacting with its content.

Modern performance optimisation goes beyond minimising JavaScript bundles. Teams must consider image optimisation, edge caching, content delivery networks, lazy loading, and server response times.

For example, an e-commerce website might reduce page load times by serving product images in modern formats such as WebP, implementing lazy loading for below-the-fold content, and using a CDN to deliver assets from locations closer to shoppers. These improvements often produce a more noticeable impact than migrating to a new frontend framework.

Many websites spend months migrating between frameworks while ignoring performance bottlenecks that would have a much larger impact on user experience. In practice, improving page speed often delivers greater business value than rebuilding an application using a different frontend stack.

Performance has also become increasingly important for search visibility. Search engines reward websites that provide a fast and reliable user experience. A technically impressive website that loads slowly is unlikely to achieve its full potential.

Domains and Infrastructure Still Matter

Developers often focus on application code while overlooking the infrastructure that supports it.

A website's domain remains one of its most important digital assets. Domain management affects security, reliability, and long-term brand ownership. Choosing a reputable registrar and maintaining proper DNS configuration are critical responsibilities.

A simple example is setting up DNS failover and enabling registrar-level security features such as domain lock and two-factor authentication. These measures help prevent outages and unauthorised domain transfers that could take a website offline.

For many teams, services such as Namecheap and GoDaddy provide a straightforward way to manage domain registration, DNS records, SSL certificates, and related infrastructure. While these tasks may seem mundane compared to application development, they directly influence website availability and security.

DNS performance has become particularly important as websites adopt distributed architectures. Modern applications frequently rely on multiple services, APIs, content delivery networks, and edge platforms. A poorly configured DNS setup can introduce unnecessary latency and create reliability issues.

Infrastructure decisions also influence scalability. As traffic grows, websites must continue delivering fast and consistent experiences without requiring major architectural changes.

The most successful development teams treat infrastructure as a strategic asset rather than an afterthought.

Hosting Is No Longer Just About Servers

In the past, hosting primarily involved renting a server and deploying application code.

Today, hosting platforms offer far more than compute resources. They provide global content delivery networks, automatic scaling, integrated security features, observability tools, and deployment automation.

The rise of edge computing has changed how websites are delivered. Content can now be served from locations close to users, reducing latency and improving responsiveness.

A media website experiencing a sudden traffic spike after a story goes viral can benefit from automatic scaling and edge caching, maintaining fast load times without requiring engineers to provision additional infrastructure manually.

Modern hosting decisions affect everything from performance and reliability to search rankings and customer satisfaction.

This means developers should evaluate hosting providers based on outcomes rather than specifications. Raw server resources matter less than factors such as uptime, deployment speed, geographic distribution, and operational simplicity.

A website that remains available during traffic spikes creates a better user experience than one that struggles under load, regardless of the underlying technology stack.

Structured Data Has Become Essential

One of the most overlooked aspects of modern website development is structured data.

Search engines and AI systems increasingly rely on structured information to understand website content. Schema markup helps machines identify products, articles, organisations, events, reviews, and many other types of information.

For instance, an online store can use a Product schema to display pricing and availability information in search results. At the same time, a recipe website can implement a Recipe schema to surface cooking times, ratings, and ingredients directly within search experiences.

Without structured data, websites force search systems to infer meaning from unstructured text. This increases the likelihood of misinterpretation.

Structured data improves the chances that content will appear in rich search results, featured snippets, knowledge panels, and other enhanced search experiences.

More importantly, structured data provides context that helps emerging AI systems understand content accurately.

As search evolves beyond traditional blue links, machine-readable information becomes increasingly valuable.

Developers who ignore structured data risk making their websites less visible, even if the content itself is excellent.

The Rise of AI Search and Answer Engines

Perhaps the biggest shift in website visibility is the growth of AI-powered search experiences.

Users increasingly ask questions directly to AI assistants rather than typing keywords into traditional search engines. These systems generate answers by combining information from multiple sources and presenting results in a conversational format.

This change creates new challenges for website owners.

Ranking on Google is no longer the only goal. Websites must also be structured in ways that help AI systems understand, retrieve, and reference their content.

A software company publishing detailed comparison guides, implementation tutorials, and clearly structured FAQs is more likely to be cited in AI-generated responses than a competitor relying solely on promotional landing pages.

This is where Answer Engine Optimisation (AEO) is becoming important. Unlike traditional SEO, which focuses on improving rankings in search results, AEO focuses on increasing the likelihood that content will be selected, cited, or referenced within AI-generated responses.

AI-powered search systems evaluate content differently from traditional search engines. Rather than simply matching keywords, they attempt to identify sources that provide clear explanations, authoritative information, and direct answers to user questions. Content that is well structured, factually accurate, and easy to interpret tends to perform better in these environments.

Platforms such as DirJournal, an answer engine optimisation platform, help businesses understand how their content appears across AI-driven search environments. As teams adapt to changing search behaviour, they're increasingly monitoring not only search rankings but also the frequency with which AI systems reference their brands, products, and expertise.

The websites that succeed in this environment are often those that publish clear, authoritative content supported by strong technical foundations.

In many cases, the same practices that improve traditional SEO also support AI discoverability. Fast websites, structured data, authoritative content, and clear information architecture all contribute to better visibility.

Content Quality Is More Important Than Ever

Technology can improve delivery, but content remains the primary reason users visit a website.

AI systems are becoming increasingly effective at identifying expertise, authority, and relevance. Thin content designed solely for search rankings is becoming less effective.

Modern websites must provide genuine value. They need original insights, practical examples, clear explanations, and trustworthy information.

For example, a cybersecurity vendor might publish original research on emerging threats, while a healthcare provider could create evidence-based patient guides reviewed by medical professionals. Content grounded in expertise tends to earn greater trust and visibility.

Developers building content-driven websites should think beyond page views and rankings. The goal is to create resources that answer real questions and solve real problems.

Content that demonstrates expertise is more likely to earn links, generate engagement, and be referenced by both search engines and AI systems.

The websites that stand out now are those that prioritize usefulness over optimization tricks.

User Experience Is the New Differentiator

As technology becomes more accessible, user experience becomes a larger competitive advantage.

Visitors expect intuitive navigation, accessible interfaces, responsive layouts, and consistent performance across devices.

Simple improvements such as reducing the number of checkout steps, increasing button sizes on mobile devices, or ensuring keyboard navigation works correctly can significantly improve usability and conversion rates.

Poor user experiences create friction that drives users away regardless of how advanced the underlying technology may be.

Accessibility deserves particular attention. Websites should be usable by people with diverse abilities and assistive technologies. Accessibility improvements often enhance usability for all visitors while supporting compliance requirements.

The best websites combine technical excellence with thoughtful design. They remove obstacles and help users accomplish their goals quickly and efficiently.

The Future Is About Outcomes, Not Frameworks

The web development industry has reached a point where most modern frameworks are capable of delivering excellent results.

The real challenge is no longer choosing the perfect technology stack.

Success depends on building websites that are fast, discoverable, reliable, secure, and understandable to both humans and machines. Performance optimization, domain management, hosting strategy, structured data, content quality, and AI search visibility now play a larger role in determining outcomes.

These days, the websites that succeed aren't necessarily built with the newest technologies. They're built with the strongest foundations.

Developers who focus on those foundations will create websites that continue to perform well regardless of how search engines, AI systems, or frontend frameworks evolve in the years ahead.

Hope you enjoyed this article. You can connect with me on LinkedIn.

How Large-Scale Platforms Handle Millions of Daily Transactions

Manish Shivanandhan — Sat, 13 Jun 2026 06:50:15 +0000

Every day, millions of people order food, stream videos, send messages, book rides, make payments, and shop online. Most of these actions take only a few seconds from the user's perspective. A user clicks a button, and the platform responds almost instantly.

Behind the scenes, however, these platforms are processing enormous numbers of transactions. A single popular application may handle thousands of requests every second and millions of transactions every day. Each transaction must be processed accurately, securely, and quickly.

In this article, we'll explore how large-scale platforms manage massive transaction volumes, the engineering challenges involved, and the architectural patterns developers use to build reliable systems.

What We'll Cover:

Why Transaction Volume Creates Unique Challenges
Breaking Monoliths Into Services
Using Load Balancers to Distribute Traffic
Why Databases Become Bottlenecks
Caching Frequently Accessed Data
Processing Tasks Asynchronously
Preventing Duplicate Transactions
Monitoring Everything
Preparing for Traffic Spikes
Building for Failure
The Importance of Consistency and Reliability
Conclusion

Why Transaction Volume Creates Unique Challenges

Handling a few hundred transactions per day is relatively straightforward. A single server and database can often manage the workload without difficulty. The challenge emerges as usage grows and systems begin serving thousands or even millions of users simultaneously.

Consider an online marketplace operating across multiple countries. At any given moment, thousands of users may be placing orders. Inventory must be updated in real time, payments must be processed accurately, notifications must be delivered, and fraud detection systems must evaluate transactions before approval. All of this happens within seconds.

At scale, even a minor delay can affect thousands of users. Systems must maintain low response times while preventing database bottlenecks, avoiding duplicate transactions, handling unexpected traffic spikes, and remaining reliable when failures occur.

To solve these problems, engineering teams rely on distributed systems and scalable architectural patterns.

Breaking Monoliths Into Services

Many successful platforms begin as monolithic applications where all functionality exists within a single codebase. While this approach works well during the early stages of growth, it can become increasingly difficult to scale as transaction volume increases.

To overcome this limitation, large platforms often adopt a service-oriented architecture. Instead of one application handling every responsibility, individual services are created for specific business functions such as user management, payments, inventory, notifications, and analytics.

A simplified order-processing workflow might look like this:

def create_order(user_id, product_id):
    inventory.reserve(product_id)

    payment_result = payment.charge(user_id)

    if payment_result.success:
        order.create(user_id, product_id)
        notification.send_confirmation(user_id)

    return payment_result

This separation allows each service to scale independently. If payment activity suddenly increases, engineers can allocate additional resources specifically to the payment service without affecting the rest of the platform. It also lets teams develop, deploy, and maintain services independently, improving both agility and reliability.

Using Load Balancers to Distribute Traffic

No single server can handle millions of daily transactions on its own. To distribute incoming requests efficiently, platforms place load balancers in front of their application servers.

Instead of connecting directly to a server, users send requests to a load balancer. The load balancer determines which server is best positioned to handle each request based on factors such as current load, availability, and health status.

A simplified architecture looks like this:

Users
   |
Load Balancer
   |
-------------------
|        |        |
Server1 Server2 Server3

If one server becomes overloaded or fails, traffic can be redirected to healthier servers. This improves both performance and availability. Modern cloud providers offer managed load-balancing solutions that automatically distribute traffic based on resource utilization and server health.

Why Databases Become Bottlenecks

Scaling application servers is often relatively easy. But databases frequently become the most significant bottleneck in transaction-heavy systems.

Every transaction ultimately requires reading or writing data. Consider an online task management platform where users complete tasks and receive rewards. Each completed task may trigger multiple database operations, including verification of task completion, updating account balances, recording transaction history, and generating audit logs.

As transaction volume grows, database performance becomes critical. One common solution is read replication. Instead of relying on a single database instance, platforms create multiple replicas that handle read requests while the primary database focuses on write operations.

The architecture may resemble the following:

Primary DB
     |
-------------------------
|         |            |
Replica1 Replica2 Replica3

By distributing read traffic across multiple replicas, platforms reduce pressure on the primary database and improve response times for users.

Caching Frequently Accessed Data

Not every request needs to reach the database. In fact, repeatedly querying the database for the same information can significantly increase infrastructure costs and response times.

To address this, platforms use caching systems such as Redis to store frequently accessed data in memory. Information such as user profiles, product details, and application settings often changes infrequently and can be retrieved directly from the cache.

Without caching:

user = database.get_user(user_id)

With caching:

user = cache.get(user_id)

if not user:
    user = database.get_user(user_id)
    cache.set(user_id, user)

Memory access is substantially faster than database queries. When a platform processes millions of requests every day, caching can dramatically improve performance while reducing backend load.

Processing Tasks Asynchronously

Users expect immediate responses. If every operation must finish before the system responds, applications quickly become sluggish under heavy load.

To improve responsiveness, large-scale systems separate critical user-facing actions from background processing tasks. Consider a payment transaction. The user needs confirmation that the payment was successful, but they don't need to wait for analytics updates, report generation, or email delivery.

A synchronous implementation might look like this:

process_payment()
send_email()
update_analytics()
generate_report()

A more scalable approach uses message queues:

process_payment()

queue.publish("send_email")
queue.publish("update_analytics")
queue.publish("generate_report")

Background workers consume these queued tasks and process them independently. This architecture improves user experience and enables systems to handle significantly larger transaction volumes.

Preventing Duplicate Transactions

One of the most important challenges in transaction processing is preventing duplicate execution.

Network interruptions can create situations where users unknowingly submit the same request multiple times. Imagine a customer making a purchase. The payment succeeds, but the confirmation never reaches the user's device because of a network failure. Believing the payment failed, the customer clicks the button again.

Without safeguards, the platform could charge the customer twice.

Many systems solve this problem through idempotency keys. A simplified implementation looks like this:

def process_payment(request_id, amount):

    if payment_exists(request_id):
        return existing_payment(request_id)

    payment = create_payment(request_id, amount)
    return payment

If the same request arrives again, the system returns the original result instead of processing a second payment. This pattern is widely used in financial services, payment gateways, and banking applications.

Monitoring Everything

As systems grow more complex, visibility becomes essential. Engineering teams can't effectively troubleshoot issues they can't observe.

Modern platforms collect metrics from every layer of their infrastructure. Engineers continuously monitor request latency, database response times, error rates, queue depth, CPU utilization, and memory consumption.

A simple monitoring rule might look like this:

if error_rate > 5:
    alert("High error rate detected")

Monitoring enables teams to identify problems before they impact users. It also provides valuable data for performance optimization and future capacity planning.

Preparing for Traffic Spikes

Traffic patterns are rarely predictable. An e-commerce platform may experience enormous demand during holiday sales, while a ticketing website can receive millions of requests within minutes when a popular event goes live.

To handle these surges, platforms rely on autoscaling. Cloud infrastructure can automatically add resources as demand increases and remove them when traffic subsides.

A simplified scaling rule might look like this:

if cpu_usage > 70:
    add_server()

Autoscaling helps maintain performance during peak periods while controlling infrastructure costs during quieter times.

Building for Failure

One of the most important principles in distributed systems is accepting that failures are inevitable.

Servers crash. Databases become unavailable. Networks experience interruptions. Rather than hoping these events never occur, large-scale platforms design systems that can continue operating when failures happen.

For example, payment systems often include retry logic:

for attempt in range(3):
    try:
        charge_customer()
        break
    except:
        continue

In addition, platforms implement redundancy by running multiple instances of critical components across different geographic regions and availability zones. If one component fails, another can take over with minimal disruption.

This strategy significantly improves availability and resilience.

The Importance of Consistency and Reliability

At scale, transaction processing isn't solely about speed. Accuracy is equally important.

Users may tolerate a slight delay, but they won't tolerate duplicate charges, missing funds, incorrect balances, or lost transactions. For this reason, large-scale transaction systems place a strong emphasis on consistency, auditing, logging, reconciliation, and recovery mechanisms.

Every transaction must be traceable. Every failure must be recoverable. These requirements become particularly important in industries such as finance, e-commerce, subscription billing, and task earning platforms where money and rewards move between users and businesses every day.

Conclusion

The ability to handle millions of daily transactions isn't the result of a single technology. It comes from combining multiple architectural principles that work together to create reliable, scalable systems.

Large-scale platforms distribute traffic across multiple servers, separate responsibilities into specialized services, cache frequently accessed data, process background work asynchronously, continuously monitor system health, and design for inevitable failures.

For developers, understanding these patterns provides valuable insight into how modern internet platforms operate behind the scenes. Whether you're building a payment processor, a SaaS platform, an online marketplace, or a task earning application, the same foundational principles apply.

As systems grow, scalability becomes less about writing more code and more about designing architecture that remains reliable under increasing demand. The platforms that succeed are the ones capable of delivering fast, accurate, and consistent transactions regardless of how many users arrive.

Hope you enjoyed this article. You can connect with me on LinkedIn.

Beyond NVIDIA: Where the AI Infra Trade Actually Shows Up

Nikhil Adithyan — Fri, 29 May 2026 22:26:37 +0000

The AI capex trade is usually discussed like one clean idea. Capex simply means capital expenditure, or the money companies spend on long-term assets like data centers, chips, servers, power systems, and other infrastructure.

NVIDIA. Hyperscalers. Data centers. Power demand. Everything gets pushed into the same bucket and called "AI infrastructure."

But I don't think this is very useful anymore.

Capex doesn't move through the market as a headline. It moves through a chain. A cloud company decides to spend more on AI infrastructure, but that spending has to pass through chips, semiconductor equipment, servers, networking, data centers, power systems, cooling, and construction before it becomes usable compute.

That's where the story gets more interesting.

The obvious AI names still matter, but they're not the whole map. If AI capex is becoming one of the biggest investment cycles in the market, then the better question isn't just:

"Which companies are AI stocks?"

It's actually:

"Where does the money actually travel?"

In this article, we'll use Python and EODHD data to build a simple AI capex map. The goal isn't to create a buy list. The goal is to separate the theme into layers, compare fundamentals with market recognition, and see where the AI infrastructure trade is already showing up in the data.

Prerequisites
What We're Investigating
Import the Required Packages
Building the AI Capex Universe
Pulling the Financial Data Behind the Story
- Fundamentals Data
- Historical Prices Data
Separating Business Strength from Market Recognition
- Fundamental Signal
- Market Recognition Signal
The AI Capex Matrix: Where the Trade Actually Shows Up
Which AI Infrastructure Layers Has the Market Rewarded Most?
The Physical Infrastructure Layer Is No Longer Hidden
What the Market Has Already Noticed
What This Study Shows
Conclusion

Prerequisites

Before following along, you should be comfortable with basic Python, especially working with dictionaries, lists, functions, and pandas DataFrames.

You’ll also need:

Python 3.9 or later
An EODHD API key
The following Python libraries: requests, pandas, numpy, and matplotlib
Basic familiarity with financial metrics like revenue growth, profit margin, P/E ratio, stock returns, volatility, and drawdown

You don’t need advanced finance knowledge for this article. The goal is to show how data visualization can help map a market theme, not to build a complete valuation model or stock recommendation engine.

What We're Investigating

The lazy version of this article would be a list of AI stocks.

That's not what I want to do here.

The more useful approach is to treat AI capex as a spending chain and ask where each part of that chain appears in the market.

A company selling GPUs is exposed to the theme in one way. A company building electrical systems for data centers is exposed in a completely different way. Both can benefit from the same capex cycle, but the economics, margins, valuation, and market behavior may look very different.

So the investigation has three parts.

First, we'll create a working AI infrastructure universe across layers like chips, semiconductor equipment, servers, networking, data centers, power, cooling, and construction.

Second, we'll pull fundamentals and price data from EODHD to measure two things:

Fundamental signal: Is the business showing growth and profitability?
Market recognition signal: Has the stock already been rewarded by the market?

Third, we'll map the companies into a matrix and look for patterns.

The main output isn't a ranking of the "best AI infrastructure stocks." It's a clearer view of where the AI capex trade has already shown up, where it looks concentrated, and where the physical infrastructure layer starts becoming hard to ignore.

Import the Required Packages

We'll keep the setup light. This is an analysis notebook, not a production system.

import requests
import pandas as pd
import numpy as np
from datetime import date, timedelta
import matplotlib.pyplot as plt

These packages cover everything we need here.

requests will call the EODHD API, pandas will handle the tables, and numpy will help with basic calculations. We'll use date and timedelta for the one-year price window, and matplotlib for the charts.

Building the AI Capex Universe

There's one issue with analyzing AI infrastructure stocks: AI capex exposure isn't a clean financial field.

No API directly tells us that a company is "30% exposed to AI data center spending" or "highly tied to GPU infrastructure." So we need a research universe first.

For this article, I used an LLM as a research assistant to draft the first version of the AI capex chain, then manually reviewed the companies before pulling fundamentals and price data from EODHD.

The universe is split into layers:

Demand-side hyperscalers
AI compute and chips
Semiconductor equipment
Servers and storage
Networking
Data centers
Power and electrification
Cooling and industrial systems
Construction and engineering

ai_capex_universe = [
    {'ticker': 'MSFT.US', 'company': 'Microsoft', 'capex_layer': 'Demand-side hyperscalers', 'exposure_level': 'High', 'reason': 'Major cloud and AI infrastructure spender through Azure'},
    {'ticker': 'AMZN.US', 'company': 'Amazon', 'capex_layer': 'Demand-side hyperscalers', 'exposure_level': 'High', 'reason': 'Large AI and cloud infrastructure spender through AWS'},
    {'ticker': 'GOOGL.US', 'company': 'Alphabet', 'capex_layer': 'Demand-side hyperscalers', 'exposure_level': 'High', 'reason': 'Major AI infrastructure spender across Google Cloud and internal AI systems'},
    {'ticker': 'META.US', 'company': 'Meta Platforms', 'capex_layer': 'Demand-side hyperscalers', 'exposure_level': 'High', 'reason': 'Large AI compute and data center spending program'},

    {'ticker': 'NVDA.US', 'company': 'NVIDIA', 'capex_layer': 'AI compute and chips', 'exposure_level': 'Very High', 'reason': 'Core GPU and accelerator supplier for AI training and inference'},
    {'ticker': 'AMD.US', 'company': 'Advanced Micro Devices', 'capex_layer': 'AI compute and chips', 'exposure_level': 'High', 'reason': 'AI accelerator and data center CPU exposure'},
    {'ticker': 'AVGO.US', 'company': 'Broadcom', 'capex_layer': 'AI compute and chips', 'exposure_level': 'High', 'reason': 'Custom silicon and networking exposure for AI infrastructure'},
    {'ticker': 'MRVL.US', 'company': 'Marvell Technology', 'capex_layer': 'AI compute and chips', 'exposure_level': 'High', 'reason': 'Custom silicon, networking, and data infrastructure exposure'},

    {'ticker': 'AMAT.US', 'company': 'Applied Materials', 'capex_layer': 'Semiconductor equipment', 'exposure_level': 'High', 'reason': 'Supplies equipment used in advanced chip manufacturing'},
    {'ticker': 'LRCX.US', 'company': 'Lam Research', 'capex_layer': 'Semiconductor equipment', 'exposure_level': 'High', 'reason': 'Semiconductor manufacturing equipment supplier'},
    {'ticker': 'KLAC.US', 'company': 'KLA', 'capex_layer': 'Semiconductor equipment', 'exposure_level': 'High', 'reason': 'Process control and inspection tools for chip manufacturing'},
    {'ticker': 'ASML.US', 'company': 'ASML', 'capex_layer': 'Semiconductor equipment', 'exposure_level': 'Very High', 'reason': 'Critical lithography equipment supplier for advanced chips'},

    {'ticker': 'DELL.US', 'company': 'Dell Technologies', 'capex_layer': 'Servers and storage', 'exposure_level': 'High', 'reason': 'AI server and enterprise hardware exposure'},
    {'ticker': 'HPE.US', 'company': 'Hewlett Packard Enterprise', 'capex_layer': 'Servers and storage', 'exposure_level': 'Medium', 'reason': 'Server, storage, and enterprise infrastructure exposure'},
    {'ticker': 'SMCI.US', 'company': 'Super Micro Computer', 'capex_layer': 'Servers and storage', 'exposure_level': 'High', 'reason': 'AI server systems and data center hardware exposure'},

    {'ticker': 'ANET.US', 'company': 'Arista Networks', 'capex_layer': 'Networking', 'exposure_level': 'High', 'reason': 'Data center networking supplier tied to AI cluster buildouts'},
    {'ticker': 'CSCO.US', 'company': 'Cisco', 'capex_layer': 'Networking', 'exposure_level': 'Medium', 'reason': 'Networking and enterprise infrastructure exposure'},

    {'ticker': 'EQIX.US', 'company': 'Equinix', 'capex_layer': 'Data centers', 'exposure_level': 'Medium', 'reason': 'Global data center and interconnection infrastructure'},
    {'ticker': 'DLR.US', 'company': 'Digital Realty', 'capex_layer': 'Data centers', 'exposure_level': 'Medium', 'reason': 'Data center real estate exposure'},

    {'ticker': 'VRT.US', 'company': 'Vertiv', 'capex_layer': 'Power and electrification', 'exposure_level': 'High', 'reason': 'Power and thermal infrastructure for data centers'},
    {'ticker': 'ETN.US', 'company': 'Eaton', 'capex_layer': 'Power and electrification', 'exposure_level': 'Medium', 'reason': 'Electrical systems and power management exposure'},
    {'ticker': 'PWR.US', 'company': 'Quanta Services', 'capex_layer': 'Power and electrification', 'exposure_level': 'Medium', 'reason': 'Grid, power, and infrastructure construction exposure'},
    {'ticker': 'CEG.US', 'company': 'Constellation Energy', 'capex_layer': 'Power and electrification', 'exposure_level': 'Medium', 'reason': 'Power demand beneficiary from data center expansion'},

    {'ticker': 'TT.US', 'company': 'Trane Technologies', 'capex_layer': 'Cooling and industrial systems', 'exposure_level': 'Medium', 'reason': 'Cooling and climate systems exposure for buildings and infrastructure'},
    {'ticker': 'CARR.US', 'company': 'Carrier Global', 'capex_layer': 'Cooling and industrial systems', 'exposure_level': 'Medium', 'reason': 'Cooling, HVAC, and infrastructure systems exposure'},
    {'ticker': 'JCI.US', 'company': 'Johnson Controls', 'capex_layer': 'Cooling and industrial systems', 'exposure_level': 'Medium', 'reason': 'Building systems, controls, and cooling infrastructure exposure'},

    {'ticker': 'EME.US', 'company': 'EMCOR Group', 'capex_layer': 'Construction and engineering', 'exposure_level': 'Medium', 'reason': 'Electrical and mechanical construction exposure'},
    {'ticker': 'FIX.US', 'company': 'Comfort Systems USA', 'capex_layer': 'Construction and engineering', 'exposure_level': 'Medium', 'reason': 'Mechanical and electrical services for commercial infrastructure'}
]

universe = pd.DataFrame(ai_capex_universe)

universe.head()

This gives us the research universe.

The important thing is that this table doesn't prove anything by itself. It only defines the map. The actual comparison comes from the fundamentals and historical price data we pull next.

Pulling the Financial Data Behind the Story

The universe gives us the map, but the map is not the analysis.

Now we need actual data behind each company. For that, we'll use EODHD fundamentals and historical prices.

The fundamentals help us check business strength. The price data helps us see whether the market has already recognized the company as part of the AI capex trade.

Fundamentals Data

First, we'll pull fundamentals using EODHD's fundamentals endpoint.

api_key = 'YOUR EODHD API KEY'

def get_fundamentals(ticker):
    url = f'https://eodhd.com/api/fundamentals/{ticker}?api_token={api_key}&fmt=json'
    data = requests.get(url).json()
    return data

Note: Replace YOUR EODHD API KEY with your actual EODHD API key.

This function calls the fundamentals endpoint for one ticker and returns the full JSON response.

We don't need the entire response for this analysis, so we'll extract only the fields we care about.

def extract_fundamental_fields(ticker, data):
    general = data.get('General', {})
    highlights = data.get('Highlights', {})
    valuation = data.get('Valuation', {})
    technicals = data.get('Technicals', {})

    return {
        'ticker': ticker,
        'sector': general.get('Sector'),
        'industry': general.get('Industry'),
        'market_cap': highlights.get('MarketCapitalization'),
        'revenue_growth_yoy': highlights.get('QuarterlyRevenueGrowthYOY'),
        'profit_margin': highlights.get('ProfitMargin'),
        'operating_margin': highlights.get('OperatingMarginTTM'),
        'return_on_equity': highlights.get('ReturnOnEquityTTM'),
        'pe_ratio': highlights.get('PERatio'),
        'forward_pe': valuation.get('ForwardPE'),
        'beta': technicals.get('Beta')
    }

These fields give us a compact view of growth, profitability, valuation, and company context.

Now we can run this across the full universe.

fundamental_rows = []

for ticker in universe['ticker']:
    try:
        data = get_fundamentals(ticker)
        row = extract_fundamental_fields(ticker, data)
        fundamental_rows.append(row)
        print(f'{ticker} DONE')

    except Exception as e:
        fundamental_rows.append({
            'ticker': ticker,
            'sector': np.nan,
            'industry': np.nan,
            'market_cap': np.nan,
            'revenue_growth_yoy': np.nan,
            'profit_margin': np.nan,
            'operating_margin': np.nan,
            'return_on_equity': np.nan,
            'pe_ratio': np.nan,
            'forward_pe': np.nan,
            'beta': np.nan
        })
        print(f'{ticker} ERROR')

fundamentals = pd.DataFrame(fundamental_rows)

fundamentals.head()

The try block keeps the scan moving if one ticker fails. That matters because this universe mixes different types of companies, and one missing response should not break the whole analysis.

Historical Prices Data

Next, we'll pull one year of historical prices using EODHD's historical end-of-day prices endpoint.

price_start = date.today() - timedelta(days=365)
price_end = date.today()

def get_price_history(ticker):
    url = f'https://eodhd.com/api/eod/{ticker}?api_token={api_key}&fmt=json&from={price_start.isoformat()}&to={price_end.isoformat()}&period=d'
    data = requests.get(url).json()
    prices = pd.DataFrame(data)

    if prices.empty:
        return pd.DataFrame()

    prices['date'] = pd.to_datetime(prices['date'], errors='coerce')
    prices['adjusted_close'] = pd.to_numeric(prices['adjusted_close'], errors='coerce')

    prices = prices.dropna(subset=['date', 'adjusted_close'])
    prices = prices.sort_values('date').reset_index(drop=True)

    return prices[['date', 'adjusted_close']]

We use adjusted close because it's cleaner for return calculations after splits and dividends.

Now we'll convert the price history into a few market signals.

def calculate_market_signals(prices):
    if prices.empty or len(prices) < 60:
        return {
            'return_1y': np.nan,
            'return_6m': np.nan,
            'return_3m': np.nan,
            'volatility_1y': np.nan,
            'max_drawdown_1y': np.nan
        }

    prices = prices.copy()
    prices['daily_return'] = prices['adjusted_close'].pct_change()

    latest_close = prices['adjusted_close'].iloc[-1]

    return_1y = (latest_close / prices['adjusted_close'].iloc[0]) - 1
    return_6m = (latest_close / prices['adjusted_close'].iloc[-126]) - 1 if len(prices) >= 126 else np.nan
    return_3m = (latest_close / prices['adjusted_close'].iloc[-63]) - 1 if len(prices) >= 63 else np.nan

    volatility_1y = prices['daily_return'].std() * np.sqrt(252)

    running_high = prices['adjusted_close'].cummax()
    drawdown = (prices['adjusted_close'] / running_high) - 1
    max_drawdown_1y = drawdown.min()

    return {
        'return_1y': return_1y,
        'return_6m': return_6m,
        'return_3m': return_3m,
        'volatility_1y': volatility_1y,
        'max_drawdown_1y': max_drawdown_1y
    }

These signals tell us how strongly the market has already responded to each company.

Now we run the same logic for every ticker.

market_rows = []

for ticker in universe['ticker']:
    try:
        prices = get_price_history(ticker)
        signals = calculate_market_signals(prices)
        signals['ticker'] = ticker
        market_rows.append(signals)
        print(f'{ticker} DONE')

    except Exception:
        market_rows.append({
            'ticker': ticker,
            'return_1y': np.nan,
            'return_6m': np.nan,
            'return_3m': np.nan,
            'volatility_1y': np.nan,
            'max_drawdown_1y': np.nan
        })
        print(f'{ticker} ERROR')

market_signals = pd.DataFrame(market_rows)

market_signals.head()

Finally, we merge the universe, fundamentals, and market signals into one dataset.

capex_data = universe.merge(fundamentals, on='ticker', how='left')
capex_data = capex_data.merge(market_signals, on='ticker', how='left')

print(capex_data.columns)
capex_data.head()

Separating Business Strength from Market Recognition

Now comes the part that makes the analysis useful.

If we only look at stock returns, we end up chasing what already moved. If we only look at fundamentals, we miss how the market is actually treating the theme.

So I split the analysis into two simple signals:

Fundamental Signal: is the business showing growth and profitability?
Market Recognition Signal: has the market already rewarded the stock?

First, we need a helper function to normalize each metric.

def min_max_score(series):
    series = pd.to_numeric(series, errors='coerce')

    if series.isna().all():
        return pd.Series(0, index=series.index)

    min_val = series.min()
    max_val = series.max()

    if min_val == max_val:
        return pd.Series(0.5, index=series.index)

    return (series - min_val) / (max_val - min_val)

This brings every metric into a 0 to 1 range, so growth, margins, returns, and drawdowns can be compared without mixing raw scales.

Fundamental Signal

Now we build the fundamental signal.

capex_data['revenue_growth_score'] = min_max_score(capex_data['revenue_growth_yoy'])
capex_data['profit_margin_score'] = min_max_score(capex_data['profit_margin'])
capex_data['operating_margin_score'] = min_max_score(capex_data['operating_margin'])
capex_data['roe_score'] = min_max_score(capex_data['return_on_equity'])

capex_data['fundamental_signal'] = (
    capex_data['revenue_growth_score'] * 0.35 +
    capex_data['operating_margin_score'] * 0.30 +
    capex_data['profit_margin_score'] * 0.20 +
    capex_data['roe_score'] * 0.15
) * 100

capex_data['fundamental_signal'] = capex_data['fundamental_signal'].round(2)
capex_data[['ticker', 'company', 'capex_layer', 'revenue_growth_yoy', 'operating_margin', 'profit_margin', 'return_on_equity', 'fundamental_signal']].sort_values('fundamental_signal', ascending=False).head(10)

This signal isn't trying to crown the best company. It's just checking whether the business data supports the AI capex story.

In my run, NVIDIA clearly stood out because its revenue growth and margins were on a different level. But the interesting part was not only NVIDIA. Names like KLA, Arista, Broadcom, Microsoft, Meta, Lam Research, Alphabet, and Super Micro also appeared near the top for different reasons.

That already tells us something important: the AI capex chain has different types of winners. Some are high-margin platform businesses. Some are semiconductor equipment names. Some are high-growth hardware names with thinner margins.

Market Recognition Signal

Now we build the market recognition signal.

capex_data['return_1y_score'] = min_max_score(capex_data['return_1y'])
capex_data['return_6m_score'] = min_max_score(capex_data['return_6m'])
capex_data['return_3m_score'] = min_max_score(capex_data['return_3m'])
capex_data['drawdown_score'] = min_max_score(capex_data['max_drawdown_1y'])

capex_data['market_recognition_signal'] = (
    capex_data['return_1y_score'] * 0.40 +
    capex_data['return_6m_score'] * 0.30 +
    capex_data['return_3m_score'] * 0.20 +
    capex_data['drawdown_score'] * 0.10
) * 100

capex_data['market_recognition_signal'] = capex_data['market_recognition_signal'].round(2)
capex_data[['ticker','company','capex_layer','return_1y','return_6m','return_3m','max_drawdown_1y','market_recognition_signal']].sort_values('market_recognition_signal', ascending=False).head(10)

This is where the story gets more interesting.

The market recognition list wasn't just filled with hyperscalers or chip names. Comfort Systems, Vertiv, Quanta Services, Dell, Applied Materials, and Lam Research showed up strongly. That is the first clear sign that the AI capex trade is spreading into the physical infrastructure layer, not staying locked inside the usual mega-cap AI basket.

The AI Capex Matrix: Where the Trade Actually Shows Up

At this point, we have two separate lenses.

The fundamental signal tells us whether the business looks strong.
The market recognition signal tells us whether the stock has already been rewarded.

Now we can put both on the same chart.

plt.figure(figsize=(12, 8))

plot_data = capex_data.dropna(
    subset=['market_recognition_signal', 'fundamental_signal', 'market_cap']
).copy()

plot_data['bubble_size'] = np.sqrt(plot_data['market_cap']) / 5000

for layer in plot_data['capex_layer'].unique():
    layer_data = plot_data[plot_data['capex_layer'] == layer]

    plt.scatter(
        layer_data['market_recognition_signal'],
        layer_data['fundamental_signal'],
        s=layer_data['bubble_size'],
        alpha=0.6,
        label=layer
    )

for _, row in plot_data.iterrows():
    if row['market_recognition_signal'] > 55 or row['fundamental_signal'] > 45:
        plt.text(row['market_recognition_signal'] + 0.8, row['fundamental_signal'] + 0.8, row['ticker'].replace('.US', ''), fontsize=10)

plt.axvline(plot_data['market_recognition_signal'].median(), linestyle='--', linewidth=1)
plt.axhline(plot_data['fundamental_signal'].median(), linestyle='--', linewidth=1)

plt.text(median_market + 2, median_fundamental + 55, 'Strong fundamentals,\nmore recognized',fontsize=10)
plt.text(4, median_fundamental + 55,'Strong fundamentals,\nless recognized',fontsize=10)
plt.text(median_market + 2, 4, 'High market recognition,\nweaker fundamentals',fontsize=10)
plt.text(4, 4, 'Less clear in this framework', fontsize=10)

plt.title('AI Capex Matrix: Fundamentals vs Market Recognition')
plt.xlabel('Market Recognition Signal')
plt.ylabel('Fundamental Signal')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

This is the most useful chart in the study.

It makes one thing clear: AI capex doesn't show up in one clean cluster.

NVIDIA is the obvious fundamental outlier. That makes sense. Its growth and margins are difficult to compare with almost anything else in the universe.

But the right side of the chart is where the broader story starts. AMD, Marvell, Vertiv, Comfort Systems, Dell, Lam Research, Applied Materials, and Quanta Services show stronger market recognition. That is a very different mix of companies. Some are chip-related. Some are equipment-related. Some are physical infrastructure names.

That matters because it shows the market isn't only rewarding the most obvious AI companies. It's also rewarding the companies that help turn AI capex into actual infrastructure.

This is the main shift in the article: the AI capex trade starts looking less like a tech basket and more like a buildout chain.

Which AI Infrastructure Layers Has the Market Rewarded Most?

The matrix is useful at the company level. But the AI capex trade also needs to be viewed by layer.

So next, I grouped the companies by capex_layer and calculated median returns and median signal scores.

layer_performance = capex_data.groupby('capex_layer').agg(
    company_count=('ticker', 'count'),
    median_return_1y=('return_1y', 'median'),
    median_return_6m=('return_6m', 'median'),
    median_fundamental_signal=('fundamental_signal', 'median'),
    median_market_recognition=('market_recognition_signal', 'median')
).reset_index()

layer_performance = layer_performance.sort_values('median_return_1y', ascending=False)

layer_performance

Then I plotted the median one-year return by infrastructure layer.

plt.figure(figsize=(11, 6))

plt.barh(layer_performance['capex_layer'], layer_performance['median_return_1y'] * 100)

plt.gca().invert_yaxis()

plt.title('Median 1Y Return by AI Infrastructure Layer', fontsize=14, pad=12)
plt.xlabel('Median 1Y Return (%)')
plt.ylabel('')

plt.grid(axis='x', alpha=0.25)

plt.tight_layout()
plt.show()

This chart is where the story becomes much less obvious.

Construction and engineering ranked at the top by median one-year return, followed by semiconductor equipment, AI compute and chips, and servers and storage. That's not the usual way people talk about the AI trade.

The takeaway is not that construction and engineering is automatically the best AI capex layer. The sample size is small, so the result should be read as directional. But it still tells us something useful: the market has been rewarding the physical buildout side of AI infrastructure, not just the companies selling chips or cloud services.

That's the larger point. Once AI capex becomes real-world infrastructure, the trade starts showing up in companies tied to equipment, servers, electrical work, and construction.

The Physical Infrastructure Layer Is No Longer Hidden

This is the part of the AI capex trade that I find most useful.

The obvious AI story starts with chips and hyperscalers. But once the spending becomes real infrastructure, the list gets wider. AI data centers need servers, networking equipment, power systems, cooling, grid work, electrical construction, and physical capacity.

So I filtered the dataset to focus on the non-obvious infrastructure layers.

physical_layers = ['Power and electrification', 'Cooling and industrial systems', 'Construction and engineering',
                   'Data centers', 'Servers and storage', 'Networking']

physical_infra = capex_data[capex_data['capex_layer'].isin(physical_layers)].copy()
physical_infra = physical_infra.sort_values(['market_recognition_signal', 'fundamental_signal'], ascending=False)
physical_watchlist = physical_infra[['ticker', 'company', 'capex_layer', 'revenue_growth_yoy', 'operating_margin',
                                     'return_1y', 'return_6m', 'fundamental_signal', 'market_recognition_signal']].head(12)

physical_watchlist.head(10)

Comfort Systems, Vertiv, Dell, Quanta Services, Cisco, HPE, EMCOR, Equinix, Johnson Controls, and Digital Realty all sit in different parts of the physical buildout. Some are tied to servers. Some are tied to power and electrification. Some are tied to data centers, cooling, or construction.

The key point is simple: the market is already treating parts of the physical infrastructure layer as part of the AI capex story.

That doesn't mean every name here has the same quality or the same upside. The fundamental signals vary a lot. But the table shows why looking only at "AI software" or "AI chip" names misses a large part of the spending chain.

What the Market Has Already Noticed

This section is important because not every AI capex name is early.

Some companies in the chain have already moved aggressively. That doesn't make them weak companies, but it changes the question. At that point, the question is no longer just whether the company is exposed to AI infrastructure. The better question is whether the market has already priced in a large part of that exposure.

To check that, I sorted the universe by the market recognition signal.

market_already_noticed = capex_data.sort_values('market_recognition_signal', ascending=False).head(10).copy()

market_already_noticed['return_1y'] = (market_already_noticed['return_1y'] * 100).round(2)
market_already_noticed['return_6m'] = (market_already_noticed['return_6m'] * 100).round(2)
market_already_noticed['return_3m'] = (market_already_noticed['return_3m'] * 100).round(2)
market_already_noticed['max_drawdown_1y'] = (market_already_noticed['max_drawdown_1y'] * 100).round(2)

market_already_noticed = market_already_noticed[['ticker', 'company', 'capex_layer', 'return_1y', 'return_6m', 'return_3m', 
                                                 'max_drawdown_1y', 'market_recognition_signal', 'fundamental_signal']]

market_already_noticed

This list is a useful reality check.

Comfort Systems, AMD, Marvell, Vertiv, Lam Research, Dell, Applied Materials, Quanta Services, Cisco, and Alphabet all show up with strong market recognition. The mix is the important part. It includes chips, semiconductor equipment, servers, networking, power, construction, and a hyperscaler.

That tells us the AI capex trade has already broadened in price action. It's not waiting quietly in the background.

But this also means we need to be careful with the "hidden beneficiary" framing. Some infrastructure names have already delivered very large one-year returns. So the smarter follow-up question is not:

"Which companies are exposed?"

It's:

"How much of that exposure has the market already recognized?"

What This Study Shows

The AI capex trade is easier to understand when we stop treating it as one group of "AI stocks."

The data shows three things clearly.

First, the obvious names still matter. NVIDIA remains the cleanest fundamental outlier in this universe, and chip-related names continue to sit close to the center of the AI infrastructure story.

Second, the trade has already moved beyond chips. Semiconductor equipment, servers, networking, power, and construction names all show up in the market recognition data. That makes sense. AI infrastructure isn't just model training. It needs physical capacity, electrical systems, cooling, data centers, and buildout work.

Third, market recognition and business strength don't always move together. Some companies have strong fundamentals but quieter price action. Others have already moved aggressively, even if their fundamental signal isn't as strong. That's why a simple "AI beneficiary" label isn't enough.

Conclusion

AI capex isn't just a mega-cap tech story. It's a spending chain.

Once we trace that chain, the theme becomes broader and more interesting. It moves from chips to semiconductor equipment, from servers to networking, from data centers to power, cooling, and construction.

The goal of this study wasn't to find the best AI infrastructure stock. It was to build a clearer map of where the trade is already showing up.

That map matters because the next phase of the AI story may not be about who mentions AI the most. It may be about who sits closest to the infrastructure that makes AI possible.

GDPR Article 32 for Software Engineers: Technical Controls, Implementations, and Auditor Questions

Ayobami Adejumo — Thu, 28 May 2026 16:20:25 +0000

When I first read GDPR Article 32, I made a mistake. I thought it was a legal document.

But it's not. It's an infrastructure specification.

The regulation says you need "appropriate technical measures" to protect personal data. That phrase is terrifying because it's vague. What does "appropriate" mean? What counts as a "technical measure"? Who decides whether you've done enough?

The compliance consultant will give you a 50-page policy document. The auditor will ignore it and ask for your database schema.

This guide is the middle ground. I've implemented Article 32 controls for 12 SaaS companies. The same nine controls appear every time. The same three auditor questions appear every time.

This is a complete guide to the 9 technical controls you must implement, the exact code and commands for each, and the questions your GDPR auditor will ask.

What You'll Learn
Prerequisites
Part 1: Understanding Article 32
Part 2: Article 32(1)(a) — Pseudonymisation and Encryption
Part 3: Article 32(1)(b) — Confidentiality and Integrity
Part 4: Article 32(1)(c) — Availability and Resilience
Part 5: Article 32(1)(d) — Regular Testing
Part 6: Penetration Testing
Best Practices Summary
What's Next
Resources

What You'll Learn

The 9 technical controls required by GDPR Article 32(1)(a) through (d)
Exact PostgreSQL commands for pseudonymisation and field-level encryption
How to implement automatic logoff and unique user identification
Application-level audit logging that goes beyond CloudTrail
Integrity controls that prove data has not been altered
mTLS and TLS 1.3 for transmission security
The 5 auditor questions you must answer with evidence

Let's dive in.

Prerequisites

Before following along, you should have:

Knowledge:

Familiarity with PostgreSQL and basic SQL
Basic understanding of AWS services (KMS, RDS, CloudTrail)
Comfort reading Python and JavaScript/Node.js code
A working knowledge of what GDPR is — if you are starting from scratch, read the ICO's GDPR overview first

Tools and access:

PostgreSQL 14 or later
An AWS account with IAM administrator access
Python 3.8 or later with cryptography library (pip install cryptography)
Node.js 16 or later
A compliance automation tool — Vanta or OneTrust — is optional but recommended for evidence collection

Estimated time: The controls in this guide take 2–4 weeks to implement fully, depending on your existing infrastructure. Individual controls range from 30 minutes (KMS key setup) to 5 days (full application-layer encryption rollout).

Part 1: Understanding Article 32 — The Technical Requirements

1.1. What Article 32 Actually Requires

Article 32 of the GDPR is titled "Security of processing." It requires controllers and processors to implement "appropriate technical and organisational measures" to ensure a level of security appropriate to the risk.

Here is the important distinction most teams miss: Article 32 is not a checklist of policies. A policy says "we encrypt personal data." Evidence says "here is the KMS key with automatic rotation, here is the application-layer encryption code, and here are the CloudTrail logs showing every decryption attempt." The auditor wants evidence, not documentation.

The four main requirements:

Section	Requirement	What It Means for Engineers
32(1)(a)	Pseudonymisation and encryption	Personal data must be stored so it cannot be attributed to a specific data subject without additional information held separately
32(1)(b)	Confidentiality, integrity, availability, and resilience	Systems must protect data from unauthorised access, alteration, loss, and be able to recover from incidents
32(1)(c)	Restoring availability and access	You must be able to restore data and regain system access after a physical or technical incident
32(1)(d)	Regular testing and risk assessment	You must have a process for regularly testing and evaluating your security measures

1.2. The Scope Question: What Data Is Covered?

Before implementing any controls, you must know what data falls under Article 32. The regulation applies to personal data — any information that can identify a living individual directly or indirectly.

Data types and their protection levels:

Category	Examples	Protection Level
Personal data	Name, email, phone, IP address	Standard
Sensitive personal data	Health data, biometric data, political opinions, religious beliefs	Enhanced
Pseudonymised data	Data where direct identifiers are replaced with a code	Standard
Anonymised data	Data that cannot be re-identified under any reasonable circumstances	Out of scope

The data mapping question your auditor will ask:

"Can you provide a data flow diagram showing where personal data enters your system, where it is stored, where it is processed, and how it is deleted?"

Before the auditor asks, run this command to document all databases storing personal data in your AWS environment:

# List all RDS instances with their encryption status
# Any StorageEncrypted: false is a finding
aws rds describe-db-instances \
  --query 'DBInstances[*].{
    ID:DBInstanceIdentifier,
    Engine:Engine,
    StorageEncrypted:StorageEncrypted,
    Region:AvailabilityZone
  }' \
  --output table

Any instance showing StorageEncrypted: false must be addressed before your Article 32 audit.

Part 2: Article 32(1)(a) — Pseudonymisation and Encryption

2.1. How to Implement Pseudonymisation at the Database Layer

Pseudonymisation replaces direct identifiers — names, email addresses, passport numbers — with a pseudonym or code. The goal is that the main working dataset cannot identify a data subject without access to a separately stored, separately protected lookup table.

Here is the incorrect approach — direct identifiers in plaintext:

-- Bad: Direct identifiers stored in the main working table
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    full_name VARCHAR(255),       -- Direct identifier — should not be here
    email VARCHAR(255),           -- Direct identifier — should not be here
    passport_number VARCHAR(50)   -- Direct identifier — should not be here
);

This approach means any engineer, analyst, or attacker with SELECT access to the users table can immediately read and identify individuals. There is no separation between working data and identifying data.

Here is the correct implementation with a separate identifiers table:

-- Good: Pseudonymised main table with a separate, restricted lookup table

-- Step 1: Main working table uses only the pseudonym
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    pseudonym UUID DEFAULT gen_random_uuid(),  -- Non-guessable pseudonym
    created_at TIMESTAMP DEFAULT NOW(),
    account_status VARCHAR(50)
    -- No direct identifiers here
);

-- Step 2: Identifier lookup table — kept separate, access restricted
CREATE TABLE user_identifiers (
    pseudonym UUID PRIMARY KEY,
    full_name VARCHAR(255),
    email VARCHAR(255),
    passport_number VARCHAR(50),
    FOREIGN KEY (pseudonym) REFERENCES users(pseudonym)
);

-- Step 3: Grant minimal, role-based access
GRANT SELECT ON users TO app_role;                              -- Application uses pseudonym only
GRANT SELECT, INSERT, UPDATE ON user_identifiers TO identity_service_role;  -- Only the identity service sees names

What each part does:

gen_random_uuid() creates a version-4 UUID pseudonym for each user — unpredictable and not reversible without the lookup table
The main users table is safe for analytics, reporting, and general application use without exposing any identifying information
Only the identity_service_role can join the two tables — this role is assigned only to the specific service that handles identity operations

The auditor question you will receive:

"How do you ensure that pseudonymised data cannot be re-identified by an unauthorised party?"

Your evidence:

-- Show that only the identity service role has access to the identifiers table
SELECT grantee, privilege_type, table_name
FROM information_schema.role_table_grants
WHERE table_name = 'user_identifiers';

-- Expected output: only identity_service_role listed

2.2. How to Implement Encryption at Rest with Customer-Managed Keys

Storage-layer encryption protects data if someone physically steals the disk. But it does not protect against a privileged AWS employee, a compromised cloud administrator, or an authorised user with direct database access. Article 32 auditors know this distinction — and they will ask about it.

Here is the incorrect approach — AWS-managed keys:

# Bad: AWS-managed KMS key
# You do not control who at AWS can access the key material
aws kms create-key \
  --origin AWS_KMS \
  --description "AWS managed key for production"

The problem: when the auditor asks "can you prove that AWS employees cannot decrypt your customer data?", the answer is no. AWS-managed keys are managed by AWS.

Here is the correct implementation — customer-managed key with automatic rotation:

# Step 1: Create a customer-managed KMS key
KEY_ID=$(aws kms create-key \
  --origin AWS_KMS \
  --description "Customer-managed key for production PII — Article 32 compliant" \
  --tags TagKey=Purpose,TagValue=GDPR TagKey=Environment,TagValue=production \
  --query 'KeyMetadata.KeyId' \
  --output text)

echo "Created KMS key: $KEY_ID"

# Step 2: Enable automatic 90-day rotation
aws kms enable-key-rotation --key-id $KEY_ID

# Step 3: Apply to your production RDS instance
aws rds modify-db-instance \
  --db-instance-identifier production-db \
  --kms-key-id $KEY_ID \
  --apply-immediately

The auditor question:

"Show me that your encryption keys are rotated automatically and that you can prove who has accessed them."

Your evidence:

# Verify rotation is enabled — expected output: true
aws kms get-key-rotation-status --key-id $KEY_ID \
  --query 'KeyRotationEnabled'

# Show the CloudTrail audit trail of every key usage event
aws logs filter-log-events \
  --log-group-name cloudtrail-logs \
  --filter-pattern '{ $.eventSource = "kms.amazonaws.com" }' \
  --query 'events[*].{Time:timestamp,Event:message}' \
  --output table

2.3. How to Implement Application-Layer Encryption for Sensitive Fields

Storage encryption is the floor. Application-layer encryption is the ceiling that Article 32 auditors are increasingly expecting for health data, financial records, and other sensitive personal data.

Here is the difference: with storage encryption only, a database administrator who runs SELECT email FROM users sees the plaintext email address. With application-layer encryption, they see gAAAAABm... — an encrypted byte string that only the application (with access to the Vault key) can decrypt.

# application_encryption.py
from cryptography.fernet import Fernet

class FieldEncryption:
    """
    Encrypts sensitive personal data fields before they are stored in the database.
    The encryption key is stored in HashiCorp Vault or AWS Secrets Manager — never in code.
    A database administrator with direct SQL access sees only encrypted bytes.
    """

    def __init__(self, key: str):
        # key must be a 32-byte base64-encoded string — retrieve from Vault
        self.cipher = Fernet(key.encode())

    def encrypt_field(self, plaintext: str) -> str:
        """Encrypt a sensitive field before writing to the database."""
        if not plaintext:
            return None
        encrypted_bytes = self.cipher.encrypt(plaintext.encode())
        return encrypted_bytes.decode()

    def decrypt_field(self, ciphertext: str) -> str:
        """
        Decrypt a field when legitimately needed by the application.
        This method requires the Vault key — database admins cannot call it.
        """
        if not ciphertext:
            return None
        decrypted_bytes = self.cipher.decrypt(ciphertext.encode())
        return decrypted_bytes.decode()


# Usage in your application:
from vault_client import get_secret  # Your Vault or Secrets Manager client

# Retrieve the encryption key at application startup — never hardcode it
encryption_key = get_secret("gdpr/field-encryption-key")
encryptor = FieldEncryption(encryption_key)

# Before storing a user's health record
user.health_data_encrypted = encryptor.encrypt_field(user.health_data_plaintext)

# Before reading for a legitimate purpose (subject access request, etc.)
health_data = encryptor.decrypt_field(user.health_data_encrypted)

The auditor question:

"If a database administrator queries the users table directly, can they read customer health data in plaintext?"

Your evidence: Run a direct database query and show the auditor the encrypted output. Then demonstrate that the decryption key is not accessible to database administrators — it is retrieved only by the application through Vault.

Part 3: Article 32(1)(b) — Confidentiality and Integrity

3.1. How to Implement Automatic Logoff

Article 32(1)(b) requires protection against "unauthorised access to personal data." A session that never expires — or expires after 24 hours — is an access control gap. A user who logs in on a shared machine and walks away has left an open door.

Here is the incorrect approach — a 24-hour JWT session:

// Bad: 24-hour access token with no inactivity check
const token = jwt.sign(
  { userId: user.id, role: user.role },
  process.env.JWT_SECRET,
  { expiresIn: '24h' }  // Too long — violates Article 32 intent
);

The problem: if a user logs in on a shared computer and closes the laptop without logging out, the session remains valid for up to 24 hours. Anyone who opens that laptop can access personal data.

Here is the correct implementation — a 15-minute access token with a rolling refresh:

// Good: Short-lived access token with rolling refresh via HTTP-only cookie

// Access token — valid for 15 minutes of activity
const accessToken = jwt.sign(
  { userId: user.id, role: user.role, type: 'access' },
  process.env.JWT_ACCESS_SECRET,
  { expiresIn: '15m' }
);

// Refresh token — valid for 8 hours total session duration
const refreshToken = jwt.sign(
  { userId: user.id, type: 'refresh' },
  process.env.JWT_REFRESH_SECRET,
  { expiresIn: '8h' }
);

// Set refresh token as HTTP-only cookie — not accessible to JavaScript
res.cookie('refreshToken', refreshToken, {
  httpOnly: true,    // Prevents XSS access
  secure: true,      // HTTPS only
  sameSite: 'strict', // Prevents CSRF
  maxAge: 8 * 60 * 60 * 1000  // 8 hours in milliseconds
});

// Session middleware that enforces absolute timeout
const MAX_TOTAL_SESSION_MS = 8 * 60 * 60 * 1000; // 8 hours

app.use((req, res, next) => {
  if (!req.session?.createdAt) return next();

  const sessionAge = Date.now() - req.session.createdAt;
  if (sessionAge > MAX_TOTAL_SESSION_MS) {
    req.session.destroy();
    return res.status(401).json({
      error: 'Session expired after 8 hours. Please log in again.'
    });
  }
  next();
});

The auditor question:

"Show me that your application terminates inactive sessions after a reasonable period."

Your evidence: A browser developer tools screenshot showing the cookie expiration time, plus a test recording showing that after 15 minutes of inactivity the user is presented with a re-authentication prompt.

3.2. How to Implement Unique User Identification with IRSA

Article 32(1)(b) requires that you can identify who accessed personal data. Shared service accounts make this impossible — the audit log shows data-export-service but you cannot tell which engineer triggered the export.

Here is the incorrect approach — a shared service account:

# Bad: One shared Kubernetes service account used by multiple engineers and pipelines
apiVersion: v1
kind: ServiceAccount
metadata:
  name: data-export           # Three engineers and two pipelines share this identity
  namespace: production

When an audit log shows data-export performed a bulk user export at 03:17 UTC, you cannot answer the auditor's question: "who authorised this?"

Here is the correct implementation — IAM Roles for Service Accounts (IRSA):

# Step 1: Create a separate IAM role for each service identity
# This command creates a role that can only be assumed by the 'payment-service'
# Kubernetes service account in the 'production' namespace

aws iam create-role \
  --role-name eks-payment-service-role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/YOUR_OIDC_ID"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-east-1.amazonaws.com/id/YOUR_OIDC_ID:sub":
            "system:serviceaccount:production:payment-service"
        }
      }
    }]
  }'

# Step 2: Annotate the Kubernetes service account with its unique IAM role
apiVersion: v1
kind: ServiceAccount
metadata:
  name: payment-service          # One service account, one service, one role
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/eks-payment-service-role

Every AWS API call from payment-service now appears in CloudTrail as eks-payment-service-role — a unique, traceable identity. No shared accounts. No ambiguous audit logs.

The auditor question:

"How do you ensure that every action on personal data can be attributed to a specific individual or service?"

Your evidence:

# Verify no shared service accounts exist — every account should have a unique role annotation
kubectl get serviceaccounts --all-namespaces \
  -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.metadata.annotations.eks\.amazonaws\.com/role-arn}{"\n"}{end}'

Part 4: Article 32(1)(c) — Availability and Resilience

4.1. How to Implement Multi-AZ and Backup Requirements

Article 32(1)(c) requires "the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident." This is not a suggestion — it is a legal requirement. If your database is in a single Availability Zone and that AZ experiences a networking event, you are in violation.

Here is the incorrect approach — single-AZ RDS with no automated backups:

# Bad: Single-AZ RDS — one networking event makes personal data unavailable
resource "aws_db_instance" "production" {
  identifier              = "production-database"
  multi_az                = false   # No automatic failover
  backup_retention_period = 0       # No automated backups — Article 32 violation
}

If the Availability Zone has a networking issue, the database is unreachable. If the instance is corrupted, there are no backups to restore. Both scenarios violate Article 32(1)(c).

Here is the correct implementation — Multi-AZ with tested automated backups:

# Good: Multi-AZ RDS with 30-day backup retention
resource "aws_db_instance" "production" {
  identifier = "production-database"

  # Multi-AZ creates a synchronous standby replica in a different AZ
  # Automatic failover completes in 60-120 seconds with no data loss
  multi_az = true

  # 30-day backup retention — gives you recovery point flexibility
  backup_retention_period = 30
  backup_window           = "03:00-04:00"  # Low-traffic window for backup

  # Copy all tags to snapshots for compliance tracking
  copy_tags_to_snapshot = true

  # Performance Insights for monitoring query health
  performance_insights_enabled          = true
  performance_insights_retention_period = 7

  tags = {
    Environment       = "production"
    DataClassification = "personal-data"
    GDPRScope         = "article32"
  }
}

How to test your RTO and RPO monthly:

# Step 1: Find your most recent automated snapshot
SNAPSHOT_ID=$(aws rds describe-db-snapshots \
  --db-instance-identifier production-database \
  --snapshot-type automated \
  --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

echo "Testing restore of snapshot: $SNAPSHOT_ID"

# Step 2: Start the restore — measure the time
START_TIME=$(date +%s)

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier gdpr-restore-test \
  --db-snapshot-identifier $SNAPSHOT_ID \
  --db-instance-class db.t3.medium \
  --no-publicly-accessible \
  --tags Key=Purpose,Value=gdpr-rto-test Key=DeleteAfter,Value=$(date -d '+1 day' +%Y-%m-%d)

# Step 3: Wait for restore to complete
aws rds wait db-instance-available \
  --db-instance-identifier gdpr-restore-test

END_TIME=$(date +%s)
RTO_SECONDS=$((END_TIME - START_TIME))
echo "Restore completed in $((RTO_SECONDS / 60)) minutes"

# Step 4: Verify data integrity with a spot check
# Connect to the restored instance and verify record counts match production
# psql -h RESTORED_ENDPOINT -U admin -d production \
#   -c "SELECT COUNT(*) FROM users; SELECT MAX(created_at) FROM orders;"

# Step 5: Delete the test instance
aws rds delete-db-instance \
  --db-instance-identifier gdpr-restore-test \
  --skip-final-snapshot

The auditor question:

"What is your Recovery Time Objective and Recovery Point Objective for personal data? When did you last test it?"

Your evidence: A documented monthly DR test log showing: snapshot used, restore start time, restore completion time, data verification query results, and the engineer who conducted the test.

Part 5: Article 32(1)(d) — Regular Testing

5.1. How to Implement Automated Vulnerability Scanning

Article 32(1)(d) requires "a process for regularly testing, assessing and evaluating the effectiveness of technical and organisational measures." This includes automated vulnerability scanning of every container image before it reaches production.

Here is the incorrect approach — no scanning in the deployment pipeline:

# Bad: No vulnerability scanning — a critical CVE in the base image deploys undetected
name: Deploy
on: [push]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: docker build -t myapp .
      - run: docker push myapp  # Deploys without any security check

If a critical CVE is present in the base image (such as a remote code execution vulnerability in OpenSSL), it goes straight to production. Under Article 32(1)(d), this is a finding.

Here is the correct implementation — Trivy scanning with pipeline enforcement:

# Good: Trivy scans every image — CRITICAL/HIGH CVEs block the deployment
name: Security Scan and Deploy
on: [push, pull_request]

jobs:
  trivy-scan:
    name: Container Vulnerability Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Build container image
        run: docker build -t myapp:${{ github.sha }} .

      - name: Scan for vulnerabilities with Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'myapp:${{ github.sha }}'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'         # Fail the pipeline — image cannot deploy with CRITICAL/HIGH CVEs

      - name: Upload scan results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        if: always()             # Upload results even if scan failed, for review
        with:
          sarif_file: 'trivy-results.sarif'

Trivy scans for:

CVEs in the base image OS packages (for example, a critical OpenSSL vulnerability in your Ubuntu base)
Vulnerable versions of application dependencies (a known exploit in an npm or pip package your application uses)
Misconfigurations in the Dockerfile (running as root, using latest tag instead of a pinned SHA)

Results appear in the GitHub Security tab, creating a timestamped, searchable history of every scan. That history is your Article 32(1)(d) evidence.

How to run a weekly AWS Inspector assessment for running workloads:

# List all active CRITICAL findings across your AWS account
aws inspector2 list-findings \
  --filter-criteria '{
    "severity": [{"comparison": "EQUALS", "value": "CRITICAL"}],
    "findingStatus": [{"comparison": "EQUALS", "value": "ACTIVE"}]
  }' \
  --query 'findings[*].{
    Title:title,
    Resource:resources[0].id,
    Severity:severity,
    CVE:packageVulnerabilityDetails.vulnerabilityId
  }' \
  --output table

The auditor question:

"Show me your vulnerability management programme, including how you prioritise and remediate findings."

Your evidence: A weekly vulnerability report — generated automatically from the above command — showing active findings, severity, the GitHub issue created for each finding, and the closure date once remediated.

Part 6: Article 32(1)(d) — Penetration Testing

6.1. Why Automated Scanning Is Not Enough

Article 32(1)(d) requires evaluating the effectiveness of security measures. Automated vulnerability scanners find known CVEs in libraries and OS packages. They cannot find:

Business logic vulnerabilities (an API endpoint that returns another user's data when given a specific parameter)
Authentication bypasses (a JWT implementation that accepts unsigned tokens)
Privilege escalation paths (an attacker can move from a low-privilege role to admin through a sequence of legitimate API calls)
Insecure direct object references (accessing /api/users/124 instead of /api/users/123 returns data for a different customer)

The ICO (UK Information Commissioner's Office) and the CNIL (France's data protection authority) both state in their guidance that annual manual penetration testing is expected for organisations processing significant volumes of personal data.

What an acceptable pen test scope looks like:

# Annual Penetration Test Scope — Article 32 Compliance

## Testing Period
Start: 2025-04-01  
End: 2025-04-14  
Testing firm: [Accredited firm — CREST or CHECK certified]

## In Scope
- Production web application: https://app.yourcompany.com
- Production API: https://api.yourcompany.com/v1/*
- Authentication flows: OAuth2, JWT, session management
- Data stores: PostgreSQL (via application access only, not direct DB access)
- AWS account: External reconnaissance of public-facing services only

## Testing Types
- External infrastructure testing (all public IP ranges)
- Web application testing (OWASP Top 10 2021)
- API security testing (all authenticated and unauthenticated endpoints)
- Authentication and session management testing
- GDPR-specific test cases (data subject rights endpoints, consent flows)

## Remediation SLAs
- CRITICAL: 24 hours from report delivery
- HIGH: 7 calendar days
- MEDIUM: 30 calendar days
- LOW: 90 calendar days

How to track and evidence remediation:

# Create GitHub issues for each finding on receipt of the pen test report
# This creates a traceable record of every finding and its resolution

for finding_id in $(cat pentest-report-findings.txt); do
  gh issue create \
    --title "Pen test finding: $finding_id" \
    --body "See pentest-report-2025-04.pdf, section $finding_id. Severity: HIGH. SLA: 7 days." \
    --label "security,pentest" \
    --assignee "@security-lead"
done

The auditor question:

"When was your last penetration test? Show me the report and your remediation evidence."

Your evidence:

The penetration test report from a CREST or CHECK certified firm, dated within the last 12 months
A remediation tracker (GitHub issues or Jira) showing every CRITICAL and HIGH finding with a closure date
Evidence that all CRITICAL findings were closed within 24 hours (the git commit or deployment log)

Here are the key takeaways from this guide:

✅ Do: Implement application-layer encryption for sensitive fields. Storage encryption alone is not enough — a DBA with direct database access can still read plaintext.

✅ Do: Use customer-managed KMS keys with automatic rotation. You need to prove control over the key material.

✅ Do: Store pseudonymised data separately from identifiers, with restricted role-based access to the lookup table.

✅ Do: Enforce automatic logoff after 15 minutes of inactivity with an 8-hour absolute session limit.

✅ Do: Use unique service accounts with IRSA. Every action on personal data must be attributable to a specific identity.

✅ Do: Test your backups monthly. Document RTO and RPO with actual restore test results.

✅ Do: Run Trivy in CI to block CRITICAL and HIGH CVEs before deployment.

✅ Do: Conduct an annual manual penetration test from a CREST or CHECK certified firm.

❌ Don't: Use 24-hour JWT sessions or sessions with no inactivity timeout.

❌ Don't: Store secrets in environment variables, .env files, or hardcoded in source code.

❌ Don't: Skip the annual penetration test. An auditor from the ICO or CNIL will not accept "we run automated scans" as a substitute.

❌ Don't: Use AWS-managed KMS keys if you need to prove key material control to your auditor.

Resources

ICO Guide to GDPR Article 32 — The UK Information Commissioner's Office official guidance on Article 32 security obligations
ENISA Guidelines on Article 32 — The EU Agency for Cybersecurity's SME guidelines on personal data security
Trivy by Aqua Security — Open-source container vulnerability scanner used in Part 5
OWASP Top 10 2021 — The standard reference for web application security risks, used in pen test scoping
AWS KMS Key Rotation Documentation — Official AWS documentation for automatic key rotation
PostgreSQL Row Security Policies — How to implement row-level security for granular access control on pseudonymised data
EKS IAM Roles for Service Accounts (IRSA) — Official AWS documentation for unique service account identity on EKS
CREST Certified Testing Firms — Directory of CREST-certified penetration testing firms for your annual Article 32 assessment

Ayobami Adejumo is a senior platform engineer and compliance infrastructure specialist. He writes about GDPR engineering controls, SOC2 implementation, and FinOps - cloud cost optimization

Why Your “Simple Deploy” Turned Into a Week of Infrastructure Work

Manish Shivanandhan — Mon, 11 May 2026 18:31:13 +0000

If you're running production workloads, this guide is for you.

It's not about side projects, early-stage experiments, or a single-service app with low traffic.

This is for teams shipping real systems. Systems with users, uptime expectations, and release pressure.

Because at that stage, your deploy process is no longer a convenience. It's part of your product.

And right now, for most teams, it's the weakest part.

In this article, we'll look at why deployment complexity keeps growing as systems scale, how modern tooling unintentionally pushes teams into platform engineering work, and why many production teams are rethinking the infrastructure they manage themselves.

We'll also look at where Platform as a Service (PaaS) fits into this shift, what trade-offs it introduces, and when adopting one actually makes sense.

What We'll Cover:

The Promise You Were Sold
The Hidden Contract You Are Already Operating Under
You Are Already Acting Like a Platform Team
The Cost Is Not Complexity. It Is Time
Why “It Works on My Machine” Still Exists
Fragmentation Is the Root Problem
This Model Breaks as You Scale
The Shift Toward Platforms
What You Stop Paying For
From Infrastructure Work Back to Product Work
Collapsing the Stack
The Trade-Off You Are Actually Making
When This Becomes Urgent
What a “Simple Deploy” Actually Means
Closing Thought

The Promise You Were Sold

Every modern stack makes the same promise: Shipping is easy. Deploying is automated. Infrastructure is abstracted away. Push your code. Watch it go live.

That promise works , until it doesn’t.

And when it breaks, it doesn't fail gracefully. It expands.

A “simple deploy” turns into a multi-day investigation across systems you never intended to own.

Not because your team is careless. Because the model itself assumes you'll take on more responsibility than it admits.

The Hidden Contract You Are Already Operating Under

When you deploy today, you're not just shipping code. You're agreeing to run a distributed system of tools.

You own the build pipeline, the container lifecycle, the runtime configuration, the network rules, the secrets layer, the scaling logic, and the observability stack.

Each of these is presented as a separate concern. In reality, they're tightly coupled.

And you're the only layer holding them together. That's the hidden contract.

You Are Already Acting Like a Platform Team

If your deploy process involves CI pipelines, container registries, cloud services, environment variables, and monitoring tools, you're not just an application team anymore. You're running a platform.

You're defining how code moves from commit to production. You're deciding how failures are handled. And you're shaping how services communicate.

That's platform engineering work.

The issue isn't that this work exists. The issue is that most teams take it on unintentionally, without the structure, tooling, or dedicated ownership a real platform team would require.

The Cost Is Not Complexity. It Is Time

It's easy to describe this problem as “complexity.” But that undersells it.

The real cost shows up in how your team spends its time.

Deploys that should take minutes stretch into hours. Then days. Engineers context-switch from product work into debugging CI caches, fixing misconfigured secrets, or tracing network failures across services.

Releases slow down. Not because your team can't build features, but because shipping them becomes unpredictable.

Onboarding gets harder. New engineers don't just learn the codebase. They have to learn your deployment system.

None of this appears on a roadmap. But it directly impacts how fast you can move.

Why “It Works on My Machine” Still Exists

We were supposed to have solved this: Containers. Infrastructure as code. Reproducible builds.

Yet the gap between local and production still shows up at the worst possible moment.

Because the problem was never just environment parity. It's system parity.

Your local setup doesn't include the same limits, permissions, network paths, or scaling behavior as production.

Those differences only surface when everything is wired together. Which means they surface during deploys.

Fragmentation Is the Root Problem

Modern tooling didn't remove infrastructure complexity. It redistributed it.

Instead of managing servers, you manage integrations between services. Instead of a single failure domain, you have many.

A deploy can fail because of a CI issue, a registry timeout, a secret misconfiguration, a networking rule, or a scaling limit.

Each lives in a different system. Each requires different context.

Individually, these tools are well-designed. Collectively, they form a system that's hard to reason about under pressure.

This Model Breaks as You Scale

This only works while your system is small. But production systems don't stay small.

More services mean more pipelines. More configurations. More failure points.

Over time, the effort required to maintain your deployment system grows faster than the product itself.

That is the inflection point: where engineering time shifts away from building features and toward maintaining the machinery that ships them.

If you're already feeling that shift, it's not temporary. It's structural.

At some point, there's a question that becomes hard to ignore: Why are you still managing this yourself?

Not because you can't. But because it's no longer clear that you should.

The Shift Toward Platforms

This is where Platform as a Service changes the model. Not by adding more tools, but by taking ownership of the system those tools create.

A PaaS defines a path from code to production. That path is opinionated, constrained, and consistent.

Those constraints aren't limitations. They're what remove entire categories of failure.

Instead of assembling a deployment pipeline, you adopt one.

What You Stop Paying For

Moving to a PaaS is often framed as convenience. For production teams, it's closer to cost removal.

You stop spending time deciding how builds run, how services are exposed, how scaling is configured, and how logs are collected.

You stop debugging the integration points between those decisions. You trade flexibility for predictability.

And for most teams, predictability is the constraint that actually matters.

From Infrastructure Work Back to Product Work

The biggest change isn't in your architecture. It's in your allocation of engineering effort.

Time spent debugging deploys shifts back to building features. Time spent maintaining pipelines shifts to improving the product.

Deploys become routine again. Not because they're simpler in theory, but because the system around them is controlled.

Collapsing the Stack

The advantage of a PaaS isn't abstraction. It's consolidation.

Build, deploy, runtime, and observability are integrated into a single system.

There are fewer layers to coordinate. Fewer places to look when something fails. And fewer decisions to get wrong.

Platforms like Sevalla, Railway, and Render are pushing this further by tightening the loop between code and production, reducing both the number of systems involved and the surface area developers need to understand.

The goal is operational clarity.

The Trade-Off You Are Actually Making

The common objection is control. And it's valid. You give up the ability to customize every layer of your infrastructure.

But in practice, most teams aren't using that control to create differentiation. They're using it to keep a fragile system running, and it’s what keeps teams stuck maintaining systems they shouldn’t own.

Every custom configuration adds another failure point. Another dependency. Another thing to maintain under pressure.

The trade-off isn't control versus convenience. It's control versus reliability.

When This Becomes Urgent

You don't need a major outage to justify a change. The signals show up earlier.

Deploys feel unpredictable. Releases slow down. Engineers spend more time on pipelines than product logic. Onboarding takes longer than it should.

These aren't isolated issues. They are indicators that your current model isn't scaling with your system.

When Managing Infra Still Makes Sense

A PaaS may not right for every team.

If your app is still small, deployments are smooth, and your team isn't spending much time on infrastructure, you may not need a PaaS yet.

Some large companies also choose to build and manage their own platforms. For them, infrastructure is an important part of the business, so the extra work is worth it.

The important thing is making that choice on purpose.

Managing infrastructure is not always a bad thing. The real problem starts when app teams slowly take on platform work without enough people, clear ownership, or the right experience to handle it well.

What a “Simple Deploy” Actually Means

A simple deploy isn't one that feels easy when everything works. It's one that continues to work as your system grows.

It's predictable. Failures are rare. When they happen, they're easy to diagnose.

And most importantly, it doesn't require your engineers to think about infrastructure to ship code.

That outcome isn't achieved by adding more tools. It's achieved by reducing the system you have to manage.

Closing Thought

Your deploy didn't turn into a week of infrastructure work because you missed something. It turned into that because you're operating a model that expects you to.

You can continue investing in that model. Or you can adopt one where deploying is a solved problem.

For production teams, that's no longer a philosophical choice. It's an operational one.

Join my Applied AI newsletter to learn how to build and ship real AI systems. Practical projects, production-ready code, and direct Q&A. You can also connect with me on LinkedIn.

The Real Infrastructure Behind Remote Work (It’s Not Just Wi-Fi)

Manish Shivanandhan — Wed, 06 May 2026 22:44:54 +0000

Remote work looks simple from the outside: a laptop, a quiet corner, and a stable Wi-Fi connection. That's the image most people have in mind.

It suggests freedom without friction, mobility without tradeoffs.

But the reality is more complex. Remote work isn't powered by a single connection. It runs on a layered system of infrastructure that most people never think about until something breaks.

When your video call freezes, your VPN drops, or your access fails at the worst possible time, you start to see the hidden machinery.

To understand remote work properly, you have to look beyond Wi-Fi. What matters is the entire stack that sits underneath it.

What We'll Cover:

Connectivity Is a System, Not a Signal
The Cloud Is Your Real Workplace
Identity Has Replaced Location
The VPN Bottleneck
Real Mobility Requires Network Flexibility
Latency Is the Hidden Constraint
Hardware Still Matters
Collaboration Depends on Synchronization
The Illusion of Simplicity
Building a Resilient Remote Setup
Remote Work Is an Infrastructure Problem

Connectivity Is a System, Not a Signal

Wi-Fi is only the last hop in a much larger network. It's the interface, not the infrastructure.

When you join a call or access a system, your data travels through local routers, internet service providers, undersea cables, cloud networks, and finally into the services you depend on. Each layer introduces latency, reliability constraints, and points of failure.

This is why two networks that both show “full bars” can behave very differently. One might route traffic efficiently through stable backbone providers. The other might be congested, poorly peered, or geographically inefficient.

For remote workers, especially those who travel or move between cities, this variability becomes a constant factor. You're not just relying on a connection. You're relying on the quality of the path your data takes.

The Cloud Is Your Real Workplace

Your office is no longer a building. It's a distributed system.

Every tool you use, from document editing to project management, runs on cloud infrastructure. Platforms like Google Workspace, Microsoft 365, and Notion aren't just applications. They're environments where your work lives.

This shift changes the nature of reliability. In a traditional office, your main dependency was local infrastructure. Now, your ability to work depends on global uptime, distributed servers, and content delivery networks.

It also means that performance is tied to geography. The distance between you and a cloud region affects how responsive your tools feel. Even small delays compound over time, especially in collaborative workflows.

Remote work isn't just about accessing tools. It's about accessing them efficiently.

Identity Has Replaced Location

In an office, access was tied to where you were. Inside the network meant trusted, while outside meant restricted.

Remote work breaks that model. Now, identity is the perimeter.

Authentication systems, single sign-on providers, and device trust mechanisms define whether you can work. Tools like Okta and Microsoft Entra ID act as gatekeepers to your entire workflow.

This introduces a new dependency layer. If identity systems fail or misbehave, work stops completely. It doesn't matter how strong your internet connection is. Without authentication, you can't access anything.

This is why remote work infrastructure is tightly coupled with security architecture. Convenience and control are constantly balanced, often in ways that users only notice when friction appears.

The VPN Bottleneck

For many organizations, remote access still runs through virtual private networks. A VPN creates a secure tunnel into corporate systems, but it also introduces overhead.

Traffic is routed through centralized gateways, which can become bottlenecks. Latency increases. Performance drops. Simple tasks feel slower than they should.

Modern architectures are shifting toward zero trust models, where access is granted per request rather than through a single tunnel. But the transition is uneven. Cloudflare is one of the most popular enterprise VPNs in use trusted especially by enterprises.

Many remote workers still operate in hybrid setups, where some tools are cloud-native while others require legacy access paths.

This mismatch creates inconsistency. Some apps feel instant. Others feel like they belong to a different era.

Real Mobility Requires Network Flexibility

One of the promises of remote work is location independence. In practice, this is harder than it sounds.

Moving between networks introduces friction. Public Wi-Fi can be unreliable or insecure. Local SIM cards require setup, verification, and often physical access. Roaming charges can be unpredictable and expensive.

This is where newer connectivity models start to matter. An international e-sim allows you to provision mobile data across countries without swapping physical cards. It removes one layer of operational overhead.

More importantly, it gives you redundancy. If a local network fails, you can switch to a mobile connection instantly. That fallback can be the difference between missing a critical meeting and continuing without disruption.

Remote work isn't just about having a connection. It's about having options when that connection fails.

Latency Is the Hidden Constraint

Most people think in terms of speed. Faster internet is assumed to be better.

But for remote work, latency is often more important than bandwidth. A high-speed connection with poor latency will still feel slow in interactive tasks like video calls, remote desktops, or collaborative editing.

Latency is affected by distance, routing efficiency, and network congestion. It's also harder to control. You can't simply upgrade your plan to fix it.

This is why experienced remote workers optimize for stability over raw speed. A consistent connection with predictable latency is more valuable than a fast but volatile one.

Hardware Still Matters

It's easy to focus entirely on networks and software, but hardware plays a critical role.

Your laptop’s thermal performance affects sustained workloads. Your webcam and microphone influence how you're perceived in meetings. Your router determines how well your local network handles multiple devices.

Even power reliability becomes part of the equation. In some locations, unstable electricity can interrupt work more often than network issues.

Remote work infrastructure extends all the way to the physical layer. Ignoring it creates weak points that show up at the worst times.

Collaboration Depends on Synchronization

Working remotely isn't just about individual productivity. It's also about coordination.

Time zones, asynchronous communication, and real-time collaboration tools all interact in complex ways. A delay in one system can ripple through an entire team’s workflow.

For example, a slow connection during a shared document session can lead to version conflicts. A dropped call can delay decisions. A failed upload can block downstream tasks.

These aren't isolated issues. They're systemic effects of how distributed systems behave under imperfect conditions.

The more distributed your team becomes, the more important infrastructure reliability becomes.

The Illusion of Simplicity

Remote work tools are designed to feel simple. Join a call. Open a document. Send a message.

But this simplicity is an abstraction. Underneath it is a dense network of dependencies, each with its own failure modes.

When everything works, the system feels invisible. When something breaks, the complexity becomes obvious very quickly.

Understanding this helps set realistic expectations. It also changes how you approach your setup. Instead of optimizing for convenience alone, you start optimizing for resilience.

Building a Resilient Remote Setup

A robust remote work setup is not defined by a single tool or connection. It's defined by how well it handles failure.

This means having backup connectivity, whether through mobile data or an international e-sim. It means choosing tools that degrade gracefully under poor network conditions. It means understanding where your bottlenecks are and planning around them.

It also means accepting that no setup is perfect. The goal isn't to eliminate failure, but to reduce its impact.

Remote Work Is an Infrastructure Problem

The narrative around remote work often focuses on lifestyle: freedom, flexibility, and autonomy.

Those benefits are real, but they're built on top of infrastructure. Without reliable systems, the experience breaks down quickly.

What looks like a simple setup is actually a distributed architecture that spans networks, cloud platforms, identity systems, and physical hardware.

The better you understand that architecture, the better you can navigate it.

Wi-Fi is just the surface. The real work happens underneath.

Join my Applied AI newsletter to learn how to build and ship real AI systems. Practical projects, production-ready code, and direct Q&A. You can also connect with me on LinkedIn.

The Hidden Tax of Infrastructure: Why Your Team Shouldn’t Be Running It Anymore

Manish Shivanandhan — Thu, 23 Apr 2026 17:05:15 +0000

Most engineering teams don't set out to manage infrastructure. They start with a product idea, a customer need, or a business problem.

Infrastructure enters the picture as a means to an end. Servers need to be provisioned. Databases need to be configured. Networks need to be secured. At first, this work feels necessary and even empowering. It gives teams control.

But over time, that control turns into a burden.

What begins as a few Terraform scripts or cloud console clicks evolves into a growing layer of responsibility.

Teams find themselves maintaining deployment pipelines, debugging networking issues, rotating credentials, patching systems, and responding to incidents unrelated to their product logic.

This is the hidden tax of infrastructure. It's not a line item in your budget, but it is paid every day in engineering time, cognitive load, and lost focus.

What We'll Cover:

Infrastructure is Not a One-Time Cost
The Cognitive Load Problem
Reliability is Harder Than it Looks
Security and Compliance Never Stand Still
The Illusion of Control
The Rise of PaaS as an Alternative
Speed is a Competitive Advantage
Cost is More Than the Cloud Bills
Rethinking Ownership

Infrastructure is Not a One-Time Cost

A common mistake teams make is treating infrastructure as a setup task. Something you “get right” once and move on from.

In reality, infrastructure is a continuous system. It changes with scale, traffic patterns, security threats, and team structure.

Every component you introduce adds a long tail of operational work. A load balancer isn't just a load balancer. It requires configuration tuning, monitoring, failover planning, and periodic upgrades. A database isn't just storage. It brings backup strategies, replication concerns, indexing decisions, and performance tuning.

Even with infrastructure-as-code tools, the maintenance burden doesn't disappear. It becomes codified, but it still exists. Engineers must review changes, manage state, handle drift, and respond when things break.

The cost compounds quietly. It shows up in slower delivery cycles, longer onboarding times for new engineers, and increased risk during deployments. It's not visible in sprint planning, but it's always there.

The Cognitive Load Problem

One of the most underestimated aspects of infrastructure management is cognitive load.

Modern systems are complex. Distributed architectures, microservices, container orchestration, and multi-region deployments all introduce layers of abstraction that engineers must understand.

When a team owns its infrastructure, every engineer becomes partially responsible for this complexity. Even if you have dedicated platform engineers, application developers still need to understand enough to debug issues and deploy changes safely.

This context switching has a real cost. An engineer working on a feature must also think about container resource limits, networking rules, observability gaps, and failure modes. Instead of focusing on business logic, they're juggling operational concerns.

Cognitive load slows teams down. It increases the chance of mistakes. It makes systems harder to reason about. And it reduces the time engineers spend on the work that actually differentiates your product.

Reliability is Harder Than it Looks

Running infrastructure in production means owning reliability. This includes uptime, latency, data integrity, and incident response. Many teams underestimate how difficult this is to do well.

High availability isn't just about redundancy. It requires careful design, testing, and ongoing validation. Failover mechanisms must be exercised. Monitoring systems must be tuned to detect real issues without creating noise. Incident response processes must be defined and practised.

When something goes wrong, the cost is immediate and visible. Engineers are pulled into debugging sessions. Customers are affected. Business metrics drop. Postmortems are written. Action items are created, which often add more infrastructure complexity.

Over time, teams build layers of safeguards and tooling to improve reliability. But each layer adds more to manage. The system becomes harder to change. The risk of unintended consequences increases.

This is the paradox of self-managed infrastructure. The more you invest in reliability, the more complex your system becomes, and the more effort it takes to maintain that reliability.

Security and Compliance Never Stand Still

Security is another dimension where the hidden tax becomes clear. Threats evolve constantly. Best practices change. Compliance requirements grow more stringent.

When you run your own infrastructure, you're responsible for staying ahead of these changes. This includes patching systems, managing access controls, encrypting data, auditing logs, and responding to vulnerabilities.

Even small gaps can have serious consequences. A misconfigured permission, an outdated dependency, or an exposed endpoint can lead to breaches. The cost of prevention is an ongoing effort. The cost of failure can be catastrophic.

Compliance adds another layer. For teams in regulated industries, infrastructure must meet specific standards. This often requires documentation, audits, and controls that go beyond basic security practices.

All of this work is necessary, but it doesn't directly contribute to your product’s value. It's part of the hidden tax you pay for owning infrastructure.

The Illusion of Control

One of the main reasons teams continue to manage their own infrastructure is the belief that it gives them control. They can customise everything. They can optimise for their specific needs. They aren't dependent on external platforms.

While this is true in theory, in practice, the level of control is often overstated. Most teams don't need deep customisation at the infrastructure level. They need reliability, scalability, and predictable behaviour.

The control you gain comes at the cost of responsibility. Every customisation must be maintained. Every optimisation must be monitored. Every deviation from standard patterns increases the risk of issues.

In many cases, teams end up recreating capabilities that are already available in managed platforms. They build internal tooling for deployment, scaling, and monitoring, only to maintain it indefinitely.

The question isn't whether you can manage your own infrastructure. It's whether you should. Most small to mid-sized teams shouldn't be managing infrastructure at all. If it's not your competitive advantage, it's a distraction.

When Managing Your Own Infrastructure Actually Makes Sense

It would be incorrect to say that no team should manage its own infrastructure. There are cases where it's not just justified, but necessary.

Large-scale systems with highly specific performance or latency requirements often need deep control over infrastructure. Companies operating at the scale of Netflix or Uber invest heavily in custom infrastructure because small optimisations can translate into significant cost savings or improvements in user experience.

Similarly, teams working in highly regulated environments may require strict control over data residency, auditability, and security boundaries. In some cases, compliance frameworks or internal risk policies limit the use of third-party platforms, making self-managed infrastructure the only viable option.

There's also a class of companies where infrastructure itself is part of the product. Cloud providers, developer platforms, and data infrastructure companies are clear examples. For these teams, building and operating infrastructure isn't a distraction, it's the core business.

Finally, organisations with mature platform engineering teams can justify owning infrastructure when they're able to abstract complexity away from application developers. In these setups, internal platforms function similarly to PaaS, but are tailored to the organisation’s specific needs.

The common thread across all of these cases is scale, specialisation, or strategic necessity. Managing infrastructure makes sense when it creates a clear competitive advantage or satisfies constraints that cannot be addressed otherwise.

For most small to mid-sized teams, none of these conditions apply. The infrastructure they build doesn't differentiate their product, but it still carries the full operational burden.

The Rise of PaaS as an Alternative

Platform-as-a-Service, or PaaS, changes the equation. Instead of managing infrastructure directly, teams deploy applications to a platform that handles the underlying complexity.

With PaaS, concerns like provisioning, scaling, load balancing, and patching are abstracted away. Engineers focus on code and configuration, not on servers and networks.

This doesn't eliminate all operational work, but it shifts the responsibility. The platform provider handles the heavy lifting. Your team benefits from standardised, battle-tested infrastructure without having to build and maintain it.

PaaS also reduces cognitive load. Developers interact with a simpler interface. Deployments become more predictable. Observability is often built in. This allows teams to move faster and with greater confidence.

Importantly, PaaS aligns infrastructure with application needs. Instead of designing infrastructure first and fitting applications into it, teams define what their application requires, and the platform provides it.

Heroku was the first to bring PaaS mainstream. Since Heroku is shutting down, I moved to Sevalla for its simplicity and the speed with which new features, especially agentic tools, are introduced. Here is a list of alternatives.

Speed is a Competitive Advantage

In most markets, speed matters. The ability to ship features quickly, respond to feedback, and iterate on ideas is a key competitive advantage.

Infrastructure management can slow this down. Changes require coordination. Deployments carry risk. Debugging issues takes time away from development.

By reducing the infrastructure burden, PaaS enables faster delivery. Teams can deploy changes more frequently. They can experiment with new ideas without worrying about underlying systems. They can recover from failures more quickly.

This isn't just about engineering efficiency. It has a direct impact on business outcomes. Faster delivery leads to better products, happier customers, and a stronger market position.

Cost is More Than the Cloud Bills

When teams evaluate infrastructure strategies, they often focus on direct costs. Cloud bills, reserved instances, and resource utilisation are measured and optimised.

But the hidden tax of infrastructure is mostly indirect. It includes engineering time spent on maintenance, the opportunity cost of delayed features, and the risk of outages and security incidents.

These costs are harder to quantify, but they're often larger than the direct costs. A single incident can consume days of engineering time. A delayed feature can impact revenue. A security breach can damage a reputation.

PaaS may appear more expensive on paper, but it often reduces total cost when you account for these hidden factors. It shifts spending from operational overhead to product development.

Rethinking Ownership

The core question isn't about tools or technologies. It's about ownership. What should your team own, and what should it delegate?

Your product is your core asset. It's what differentiates you in the market. Infrastructure, while critical, is a means to support that product.

By continuing to manage infrastructure, teams take on responsibilities that don't directly contribute to their goals. They pay the hidden tax in time, focus, and risk.

PaaS offers a way to rebalance this. It allows teams to delegate infrastructure concerns and focus on building value.

The shift isn't always easy. It requires changes in mindset, tooling, and processes. But for many teams, it's a necessary step.

Because the real cost of infrastructure isn't what you pay your cloud provider. It's what you give up to run it yourself.

Join my Applied AI newsletter to learn how to build and ship real AI systems. Practical projects, production-ready code, and direct Q&A. You can also connect with me on LinkedIn.

From Metrics to Meaning: How PaaS Helps Developers Understand Production

Manish Shivanandhan — Thu, 23 Apr 2026 16:52:22 +0000

Modern production systems generate more data than most developers can realistically process.

Every request emits logs. Every service exports metrics. Every dependency introduces another layer of signals.

In theory, this should make systems easier to understand. In practice, it does the opposite.

Dashboards become dense, alerts become noisy, and when something breaks, the same questions still come up: What's actually wrong? Who's affected? Where do you even start?

The problem isn't observability. It's interpretation.

Most teams aren't short on metrics. They're short on meaning.

And that gap exists because developers are often forced to reason about infrastructure when they should be focused on application behaviour.

Metrics exist to describe systems, but without the right level of abstraction, they become another layer of complexity.

This is where modern PaaS platforms change the equation. They don't remove metrics. Instead, they turn them into signals that developers can actually use.

This article breaks down five metrics that consistently matter in production systems. More importantly, it shows how a PaaS helps translate these metrics into something actionable, without requiring developers to act as infrastructure operators.

I’ll be using the Sevalla dashboard to explain these metrics, but other platforms like Railway and Render will have similar metrics.

What We'll Cover:

What a PaaS Actually Does
Latency Becomes a Clear Performance Signal
Error Rate Becomes a Reliable Indicator of Failure
Throughput Becomes Context Instead of a Problem
Resource Utilisation Moves Out of the Critical Path
Instance Health Becomes Invisible by Design
From Metrics to Meaning
Why This Matters for Developers
The Real Advantage Is Clarity

What a PaaS Actually Does

A Platform as a Service (PaaS) is an abstraction layer over infrastructure that handles deployment, scaling, networking, and runtime management for you.

Instead of provisioning servers, configuring load balancers, and setting up autoscaling rules, you deploy your application and the platform takes care of how it runs in production.

Platforms like Sevalla, Railway, and Render operate on this model. The key shift is responsibility.

In a traditional setup, developers are responsible for both application behaviour and infrastructure behaviour. If latency spikes or errors increase, you have to determine whether the issue is in your code, your scaling rules, or the underlying system.

A PaaS moves most of that infrastructure responsibility into the platform.

You still get access to metrics, but many of the variables behind those metrics –instance lifecycle, scaling decisions, resource allocation – are handled automatically.

This changes how you interpret what you see.

Metrics stop being signals that require cross-layer investigation, and start becoming signals that map more directly to application behaviour.

Now let's see what can happen if your team switches to using a PaaS.

Latency Becomes a Clear Performance Signal

Latency is the most direct representation of user experience. It tells you how long your system takes to respond.

When latency increases, users feel it immediately. Pages slow down. APIs become unreliable. Even small delays impact engagement.

Most developers know to look at percentiles like p95 or p99 instead of averages. The slowest requests are what define perceived performance.

But in many environments, understanding latency isn't straightforward.

A spike could come from inefficient code. Or from cold starts. Or from scaling delays. Or from network routing issues. Developers are forced to investigate layers they didn't build.

This is where a PaaS changes the role of latency.

Instead of being a starting point for infrastructure debugging, latency becomes a clean signal of application performance. Scaling, routing, and resource allocation are handled by the platform. What remains is a clearer relationship between code and outcome.

When latency increases, developers can focus on what they actually control: queries, logic, and dependencies.

The metric stays the same. The meaning becomes clearer.

Error Rate Becomes a Reliable Indicator of Failure

Error rate answers a simple question. Is the system working or not?

It's usually measured as the percentage of requests that fail due to server-side issues. These are failures users can't recover from. A broken checkout flow or a failed API call directly impacts trust.

In theory, error rate should be one of the easiest metrics to act on. In practice, it rarely is.

Errors can come from application bugs, but also from timeouts, resource limits, failed deployments, or unstable instances. Developers end up correlating errors with infrastructure events just to understand what happened.

This slows everything down.

A PaaS reduces this ambiguity.

Failures caused by scaling, instance crashes, or transient infrastructure issues are handled at the platform level. Retries, isolation, and recovery mechanisms are built in.

What remains is a tighter link between error rate and application correctness.

When the error rate increases, it's far more likely to be something in the code or a dependency, not an invisible infrastructure issue.

This shifts the error rate from a noisy metric into a reliable signal.

Throughput Becomes Context Instead of a Problem

Throughput measures how many requests your system handles over time.

It provides context for everything else. Latency and error rate only make sense when you know how much traffic the system is handling.

A spike in latency during high traffic is expected. The same spike during low traffic is a warning sign.

But in many systems, throughput introduces operational complexity. Traffic changes require scaling decisions. Teams define autoscaling rules, tune thresholds, and try to predict demand. When things go wrong, they revisit those decisions.

Developers end up thinking about capacity instead of behaviour.

A PaaS shifts this responsibility. Scaling is automatic. Traffic spikes are absorbed by the platform. Developers don't need to decide how many instances should be running or when to scale.

Throughput becomes what it should be: context.

It helps explain what's happening, without forcing developers to manage how the system adapts.

Resource Utilisation Moves Out of the Critical Path

Resource utilization measures how much CPU, memory, and I/O your system consumes.

Traditionally, this has been central to operating systems. High CPU or memory usage signals potential issues. Teams monitor these metrics to avoid failures and plan scaling.

But for most developers, resource utilization isn't where value is created.

Yet in many environments, developers are still responsible for interpreting these signals. They tune memory limits, investigate CPU spikes, and try to optimise resource usage to keep systems stable.

This is operational work.

A PaaS changes the role of these metrics.

Resource management is handled by the platform. Allocation, scaling, and isolation happen automatically. Developers don't need to constantly watch CPU graphs or memory charts to keep the system running.

These metrics still exist, but they move into the background.

They become diagnostic tools rather than primary signals.

Developers can focus on performance at the application level, instead of managing how infrastructure behaves under load.

Instance Health Becomes Invisible by Design

Instance health tracks restarts, crashes, and lifecycle events.

In many systems, this is a critical metric. Frequent restarts indicate instability. Memory leaks, crashes, or resource exhaustion often show up here first.

Teams monitor instance health to catch issues early and prevent cascading failures.

But this also reveals something important: developers are aware of, and responsible for, the lifecycle of infrastructure. They track restarts, investigate crashes, and try to stabilise the system manually.

A PaaS removes this responsibility.

Unhealthy instances are restarted automatically. Load is redistributed. Capacity is maintained without manual intervention.

Instance health doesn't disappear, but it no longer requires constant attention. It becomes part of the platform’s internal behaviour, not something developers need to actively manage.

From Metrics to Meaning

These five metrics haven't changed.

Latency still reflects performance. Error rate still reflects correctness. Throughput still reflects demand. Resource utilization still reflects efficiency. Instance health still reflects stability.

What changes is how much work it takes to interpret them.

In lower-level environments, developers have to connect these signals themselves. A latency spike leads to checking throughput, then resource usage, then instance behaviour. Each step requires context, assumptions, and time.

This is where complexity accumulates.

A PaaS reduces that gap.

It handles scaling, recovery, and resource management so that metrics map more directly to application behaviour. The signals become easier to interpret because fewer variables are exposed.

Instead of asking multiple questions across layers, developers can move more directly from symptom to cause.

Why This Matters for Developers

Most developers don't want to manage infrastructure. They want to build features, ship improvements, and respond to user needs.

But as systems grow, operational responsibility expands. Monitoring becomes more complex. Debugging requires more context. A significant portion of time shifts from building to maintaining.

Metrics are part of this shift.

They're necessary, but they also reflect how much of the system you're responsible for understanding.

A PaaS doesn't eliminate metrics. It reduces the effort required to make sense of them.

It ensures that when something changes in production, the signals developers see are closer to the reality they care about: application behaviour. User experience. System correctness.

The Real Advantage Is Clarity

The goal is not to have fewer metrics.

It's to have metrics that mean something without requiring deep infrastructure reasoning.

These five metrics form a complete picture of system health. But their real value depends on how directly they map to what developers control.

The more layers you have to think about, the harder mapping becomes.

A good PaaS removes those layers. It turns metrics from raw data into usable signals.

And that shift from metrics to meaning is what allows developers to understand production systems without being buried under them.

Join my Applied AI newsletter to learn how to build and ship real AI systems. Practical projects, production-ready code, and direct Q&A. You can also connect with me on LinkedIn.

How to Get Started with Terraform

Manish Shivanandhan — Wed, 15 Apr 2026 16:25:48 +0000

Infrastructure has undergone a fundamental shift over the past decade.

What was once configured manually through dashboards and shell access is now defined declaratively in code. This shift isn't just about convenience. It's about repeatability, auditability, and control.

Terraform sits at the centre of this transformation. It allows you to define infrastructure using configuration files, apply those configurations consistently across environments, and evolve systems safely over time.

For teams building modern applications, especially on platform abstractions, Terraform becomes the control plane for everything from application deployment to databases and networking.

The open source Terraform provider from Sevalla extends this model by allowing teams to manage the entire application platform as code, not just underlying infrastructure. It enables you to define applications, databases, networking, storage, and deployment workflows in a single, unified configuration.

Instead of stitching together multiple tools or relying on manual setup, everything from code deployment to traffic routing and environment configuration can be expressed declaratively. This creates a consistent, repeatable system where environments can be replicated easily, changes are version-controlled, and production setups can evolve safely over time.

This article walks through how to go from zero to a production-ready setup using Terraform and the Sevalla Terraform Provider, focusing on practical concepts rather than theory.

What We'll Cover:

What Terraform Actually Does
Setting Up Terraform for the First Time
Understanding Providers, Resources, and Data Sources
Building a Real Application Stack
Managing Configuration and Secrets
Scaling and Process Configuration
Adding Networking and Traffic Management
Pipelines and Continuous Deployment
From Configuration to Production
Why Terraform Scales with You

What Terraform Actually Does

Terraform is an infrastructure-as-code tool that translates configuration files into real infrastructure. You describe the desired state of your system, and Terraform figures out how to achieve it.

At a high level, Terraform operates in three phases.

First, it initializes the working directory and downloads required providers. Providers are plugins that allow Terraform to interact with specific platforms.

Next, it creates an execution plan. This plan shows what resources will be created, modified, or destroyed to match your configuration.

Finally, it applies the plan, making the necessary API calls to bring your infrastructure into the desired state.

The key idea is that Terraform is declarative. You define what you want, not how to do it. Terraform handles the orchestration.

This abstraction becomes extremely powerful as systems grow more complex.

Setting Up Terraform for the First Time

Getting started with Terraform requires very little setup. You install the CLI, create a working directory, and define a basic configuration.

A Terraform configuration is written in HCL, a domain-specific language designed to be human-readable. Even a simple configuration establishes the core concepts.

You define the required provider, configure authentication, and declare resources.

Here's a minimal example that provisions an application using a managed platform provider.

terraform {
 required_providers {
   sevalla = {
     source  = "sevalla-hosting/sevalla"
     version = "~> 1.0"
   }
 }
}

provider "sevalla" {
}
data "sevalla_clusters" "all" {}
resource "sevalla_application" "web" {
 display_name = "my-web-app"
 cluster_id   = data.sevalla_clusters.all.clusters[0].id
 source       = "publicGit"
 repo_url     = "https://github.com/example/app"
}

This configuration does several things.

First, it declares the provider, which tells Terraform how to communicate with the platform. It also fetches available clusters using a data source. It then defines an application resource that points to a Git repository.

Even at this stage, you're already defining infrastructure in a reproducible way.

To execute this configuration, you run three commands.

You initialize the project, generate a plan, and apply it.

export SEVALLA_API_KEY="your-api-key"
terraform init
terraform plan
terraform apply

After applying, your application is deployed without manual steps.

Understanding Providers, Resources, and Data Sources

Terraform revolves around three core constructs.

Providers act as the bridge between Terraform and external systems. They expose APIs in a structured way that Terraform can use.

Resources represent the infrastructure you want to create. These are the building blocks of your system. Applications, databases, load balancers, and storage buckets are all modeled as resources.

Data sources allow you to query existing infrastructure. Instead of creating something new, you retrieve information that can be used elsewhere in your configuration.

The combination of these constructs allows you to build flexible and composable systems.

For example, you can fetch a list of available clusters using a data source and then dynamically assign your application to one of them. This reduces hardcoding and improves portability.

As your configuration grows, these abstractions help you maintain clarity and structure.

Building a Real Application Stack

A production system is rarely just a single application. It typically includes multiple components that need to work together.

With Terraform, you can define the entire stack in one place.

You might start with an application, then add a managed database, connect them internally, and expose the application through a load balancer.

A simplified flow looks like this.

You define the application resource that pulls code from a repository. You provision a database resource, such as PostgreSQL or Redis. You establish an internal connection between the application and the database. You configure environment variables for credentials. You optionally add a custom domain or routing layer.

Each of these components is a resource, and Terraform ensures they're created in the correct order.

This approach eliminates configuration drift. Instead of manually setting up each component, everything is defined in code and version-controlled.

It also makes environments consistent. Your staging and production setups can be identical except for a few variables.

Managing Configuration and Secrets

Production systems require configuration. This includes environment variables, API keys, and connection strings.

Terraform provides multiple ways to handle this.

You can define variables in your configuration and pass values at runtime. Sensitive values, such as API keys, are typically injected via environment variables.

For example, authentication is handled through an API key that can be set as an environment variable.

export SEVALLA_API_KEY="your-api-key"

This avoids hardcoding credentials in configuration files.

You can also define environment variables as part of your infrastructure. This allows you to configure applications consistently across environments.

The important principle is separation of concerns. Infrastructure definitions should remain clean, while sensitive data is managed securely.

Scaling and Process Configuration

Modern applications often consist of multiple processes. A web server handles incoming requests, background workers process jobs, and scheduled tasks run periodically.

Terraform allows you to define these processes explicitly.

You can configure different process types, allocate resources, and scale them independently. This is particularly useful for handling variable workloads.

For example, you might scale web processes based on incoming traffic while keeping background workers at a steady level.

By defining this in code, scaling becomes predictable and repeatable.

You avoid manual intervention and ensure that your system behaves consistently under load.

Adding Networking and Traffic Management

As systems grow, managing traffic becomes more important.

Terraform enables you to define networking components such as load balancers and routing rules. You can map domains to applications, distribute traffic across multiple services, and control access.

This is essential for production readiness.

A load balancer can improve availability by distributing traffic across instances. Domain configuration ensures that users can access your application through a stable endpoint.

You can also define restrictions, such as IP allowlists, to enhance security.

All of this is managed declaratively, which reduces the risk of misconfiguration.

Pipelines and Continuous Deployment

Production systems require reliable deployment workflows.

You can use Terraform to define deployment pipelines and stages. This allows you to model how code moves from development to production.

You can define multiple stages, associate applications with each stage, and control how deployments are triggered.

This brings infrastructure and deployment logic into a single system.

Instead of relying on external scripts or manual processes, everything is defined in a structured and version-controlled way.

It also improves traceability. You can see exactly how a system is configured and how changes are applied over time.

From Configuration to Production

Moving from a simple setup to production involves more than just adding resources. It requires discipline in how you manage infrastructure.

Version control becomes critical. Every change to your infrastructure should go through code review. This reduces the risk of introducing breaking changes.

State management is another key aspect. Terraform keeps track of the current state of your infrastructure. This state must be stored securely and consistently, especially in team environments.

You also need to think about environment separation. Development, staging, and production should be isolated but defined using similar configurations.

Finally, observability should be integrated from the start. While Terraform provisions infrastructure, you need monitoring and logging to understand how it behaves in production.

Why Terraform Scales with You

Terraform works well for small projects, but its real value becomes apparent as systems grow.

As you add more services, environments, and dependencies, manual management becomes unsustainable. Terraform provides a structured way to manage this complexity.

It enforces consistency. It enables automation. It creates a single source of truth for your infrastructure.

Most importantly, it allows teams to move faster without sacrificing reliability.

By defining infrastructure as code, you reduce ambiguity. You make systems easier to understand, easier to debug, and easier to evolve.

That is what takes you from zero to production in a way that actually scales.

Want to build like a 10x developer? Learn through real projects, simple explanations, and tools that help you ship faster. Join my newsletter and start levelling up every week.

How to Make IT Operations More Efficient with AIOps: Build Smarter, Faster Systems

Balajee Asish Brahmandam — Fri, 09 May 2025 21:20:18 +0000

In the rapidly evolving IT landscape, development teams have to operate at their best and manage complex systems while minimizing downtime. And having to do many routine tasks manually can really slow down operations and reduce efficiency.

These days, we can use artificial intelligence to manage and enhance IT operations. This is where AIOps for IT operations comes into play.

AIOps is changing IT operations as it lets teams create better, faster systems that can find and resolve problems on their own. It also helps them make the best use of resources, and grow without as many problems.

In this tutorial, you’ll learn about the key components of AIOps, how they interact with other IT systems, and how you can apply AIOps to improve the efficiency of your environment.

Here’s what we’ll cover:

What is AIOps?
- The Significance of AIOps for IT Operations
- AIOps can help address these challenges by
Getting Started with AIOps
Real-World Use Case: AIOps in Cloud Infrastructure and Incident Management
Conclusion

What is AIOps?

AIOps is artificial intelligence for IT operations. It means enhancing and streamlining IT chores by means of artificial intelligence and machine learning.

AIOps systems examine the vast volumes of data generated by IT systems, such as logs and metrics, while utilizing machine learning methods. The main objective of AIOps is to enable companies to more quickly and effectively identify and resolve IT issues.

Key components of AIOps include:

Anomaly detection: the process of spotting unusual patterns in a system's operation that might indicate a problem.
Event correlation: the process of examining data from several sources to determine how they complement one another and help to explain why issues arise.
Automated response: acting to resolve issues without human assistance.

The Significance of AIOps for IT Operations

The rise of hybrid and multi-cloud platforms, microservices architectures, and systems that can expand quickly are complicating IT operations. Often, conventional IT management tools fall behind the size and speed of the systems that we need to monitor and maintain.

Here are some issues that often come up in standard IT operations:

Manual troubleshooting: IT teams sometimes must comb through logs and reports by hand to identify the root of issues.
Long settlement times: The longer it takes to resolve a problem after discovery, the more downtime and dissatisfied users result.
Scalability: Monitoring all system components becomes more difficult as they grow since more manual labor would be required.

AIOps can help address these challenges by

Improving incident resolution times: By correlating events and providing actionable insights, AIOps can resolve problems in real-time.
Scaling effortlessly: AIOps can handle large volumes of data and events without additional resources, making it ideal for scaling operations
Automating incident detection and response: AI models can detect issues and automatically resolve them, reducing manual intervention.

You can better understand AIOps by looking at its main components:

1. Machine Learning for Predictive Analytics

AIOps tools forecast future events by means of machine learning and examining historical data. Prediction analytics, for example, can inform teams when a system's performance is likely to decline, letting them address the issue before it worsens.

2. Automating and Self-Healing

AIOps lets your team automate daily tasks, eliminating the need for human intervention. Services, for instance, can be restarted, or resources can be relocated. Running the company costs less, and problem resolution takes less time.

3. Event Correlation and Root Cause Analysis

Event correlation is the technique of linking events from several related systems to identify the root cause of the problem. For instance, AIOps will examine server, network, and application logs to determine what’s wrong – whether it’s a network problem or a web application failure – and correct it.

Getting Started with AIOps

Enhancing your team’s IT operations with AIOps involves including tools and procedures run by artificial intelligence in your present system. These are the most crucial actions to start with:

1. Choose an AIOps Tool

There are several AIOps platforms available, each with its own set of features. Some popular AIOps tools include:

Moogsoft: An AIOps platform that uses machine learning for event correlation, anomaly detection, and incident management.
BigPanda: Focuses on automating incident management and root cause analysis.
Splunk IT Service Intelligence: Offers advanced analytics for monitoring and managing IT infrastructure.

When selecting an AIOps tool, consider the following:

Integration with existing tools: Ensure the platform integrates with your current monitoring, logging, and alerting systems.
Scalability: The platform should be able to handle large volumes of data and scale with your organization.
Ease of use: Look for a user-friendly interface and automation capabilities to minimize manual intervention.

2. Implement AIOps in Your IT Environment

These are the steps you’ll need to take to integrate AIOps into your IT operations:

Data aggregation: is the process of collecting data from various sources, including computers, network devices, cloud infrastructure, and applications, and consolidating it all onto one platform.
Determine thresholds and KPIs: Identify the most crucial key performance indicators such as error rates, system uptime, and response for your company.
Establishing alerts and automation: For instance, when thresholds are crossed, configure automatic responses to restart services or raise resource consumption.

3. Leverage Machine Learning for Anomaly Detection

Machine learning models are quite crucial in the search for anomalies. These models can identify trends that are not usual and learn from prior data. This enables IT departments to identify issues early on before they escalate.

Example: A machine learning model may detect a spike in CPU usage that is unusual for a particular time of day, triggering an alert or automatic remediation process, such as scaling the application to add more resources.

import numpy as np
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt

# Example dataset (e.g., CPU usage or network traffic over time)
data = np.array([50, 51, 52, 53, 200, 55, 56, 57, 58, 60]).reshape(-1, 1)

# Initialize Isolation Forest model for anomaly detection
model = IsolationForest(contamination=0.1)  # 10% outliers
model.fit(data)

# Predict anomalies: -1 indicates anomaly, 1 indicates normal
predictions = model.predict(data)

# Plotting the results
plt.plot(data, label="System Metric")
plt.scatter(np.arange(len(data)), data, c=predictions, cmap="coolwarm", label="Anomalies")
plt.title("Anomaly Detection in System Metric")
plt.legend()
plt.show()

4. Automate Root Cause Analysis

AIOps platforms can automatically correlate data from various sources to identify the root cause of incidents. For instance, if an application is experiencing high response times, AIOps can check the server logs, network status, and database performance to determine if the issue is due to a server failure, database bottleneck, or network congestion.

import splunklib.client as client
import splunklib.results as results

# Connect to Splunk server (replace with actual credentials)
service = client.Service(
    host='localhost',
    port=8089,
    username='admin',
    password='password'
)

# Perform a search query to find events related to system issues
search_query = 'search index=main "error" OR "fail" | stats count by sourcetype'

# Run the search
job = service.jobs.create(search_query)

# Wait for the search job to complete
while not job.is_done():
    print("Waiting for results...")
    time.sleep(2)

# Retrieve and process the results
for result in results.JSONResultsReader(job.results()):
    print(result)

5. Set Up Automated Responses Using Webhooks

In AIOps, automated incident response is triggered through Webhooks or other messaging systems. For example, when an anomaly is detected, a Webhook can notify a team or initiate a resolution process.

import requests

# Simulate an anomaly detection system that triggers when an anomaly is found
def send_alert_to_webhook(anomaly_detected):
    webhook_url = 'https://your-webhook-url.com'
    payload = {
        "text": f"Alert: Anomaly detected! Please review the system metrics immediately."
    }

    if anomaly_detected:
        response = requests.post(webhook_url, json=payload)
        print("Alert sent to webhook")
        return response.status_code
    return None

# Simulate anomaly detection
anomaly_detected = True  # Set to True when an anomaly is found

# Trigger automated response (alert)
status_code = send_alert_to_webhook(anomaly_detected)

if status_code == 200:
    print("Webhook triggered successfully")
else:
    print("Failed to trigger webhook")

6. Automate system cleanup with Ansible (sample playbook)

Automatic remediation is a major component of AIOps in resolving issues without any human intervention. Like restarting a service when a system measure exceeds a particular threshold, here is an illustration of an Ansible script that automatically resolves an issue.

- name: Automated Remediation for High CPU Usage
  hosts: all
  become: true
  tasks:
    - name: Check CPU Usage
      shell: "top -bn1 | grep load | awk '{printf \"%.2f\", $(NF-2)}'"
      register: cpu_load
      changed_when: false

    - name: Restart service if CPU load is high
      service:
        name: "your-service-name"
        state: restarted
      when: cpu_load.stdout | float > 80.0

Real-World Use Case: AIOps in Cloud Infrastructure and Incident Management

Imagine a large-scale e-commerce company that operates in the cloud, hosting its infrastructure on AWS. The company’s platform is supported by hundreds of virtual machines (VMs), microservices, databases, and web servers.

As the company grows, so do the complexities of its IT operations, especially in managing system health, uptime, and performance. The company has a traditional monitoring setup in place using basic cloud-native tools. But as the platform scales, the sheer volume of data (logs, metrics, alerts) overwhelms the IT team, leading to delays in identifying the root cause of issues and resolving them in real time.

Challenges:

Incident overload: With hundreds of alerts coming in daily, the team struggled to prioritize critical incidents, which led to slower resolution times.
Manual processes: Identifying the root cause of issues required manual sifting through logs, which was time-consuming and error-prone.
Scalability issues: As the company scaled its infrastructure, manual intervention became increasingly inefficient, and the system could not dynamically respond to issues without human input.

AIOps implementation:

The company decided to implement an AIOps platform to streamline incident management, automate responses, and predict issues before they occurred.

Step 1: Setting Up Monitoring with Prometheus

First, we need to monitor system performance to collect metrics such as CPU usage and memory consumption. We’ll use Prometheus, an open-source monitoring tool, to collect this data.

Install Prometheus:

First, download and install Prometheus:

wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz
tar -xvzf prometheus-2.27.1.linux-amd64.tar.gz
cd prometheus-2.27.1.linux-amd64/
./prometheus

Then install Node Exporter (to collect system metrics):

wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar -xvzf node_exporter-1.1.2.linux-amd64.tar.gz
cd node_exporter-1.1.2.linux-amd64/
./node_exporter

Next, configure Prometheus to scrape metrics from Node Exporter:

##Edit prometheus.yml to scrape metrics from the Node Exporter:
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

And start Prometheus:

./prometheus --config.file=prometheus.yml

You can now access Prometheus via http://localhost:9090 to verify that it's collecting metrics.

Step 2: Collecting System Data (CPU Usage)

Now that we have Prometheus collecting metrics, we need to extract CPU usage data (which will be the focus of our anomaly detection) from Prometheus.

Querying Prometheus API for CPU Usage

We’ll use Python to query Prometheus and retrieve CPU usage data (for example, using the node_cpu_seconds_total metric). We’ll fetch the data for the last 30 minutes.

import requests
import pandas as pd
from datetime import datetime, timedelta

# Define the Prometheus URL and the query
prom_url = "http://localhost:9090/api/v1/query_range"
query = 'rate(node_cpu_seconds_total{mode="user"}[1m])'

# Define the start and end times
end_time = datetime.now()
start_time = end_time - timedelta(minutes=30)

# Make the request to Prometheus API
response = requests.get(prom_url, params={
    'query': query,
    'start': start_time.timestamp(),
    'end': end_time.timestamp(),
    'step': 60
})

data = response.json()['data']['result'][0]['values']
timestamps = [item[0] for item in data]
cpu_usage = [item[1] for item in data]

# Create a DataFrame for easier processing
df = pd.DataFrame({
    'timestamp': pd.to_datetime(timestamps, unit='s'),
    'cpu_usage': cpu_usage
})

print(df.head())

Step 3: Anomaly Detection with Machine Learning

To detect anomalies in CPU usage, we’ll use Isolation Forest, a machine learning algorithm from Scikit-learn.

Train an Anomaly Detection Model:

First, install Scikit-learn:

pip install scikit-learn matplotlib

Then you’ll need to train the model using the CPU usage data we collected:

from sklearn.ensemble import IsolationForest
import numpy as np
import matplotlib.pyplot as plt

# Prepare the data for anomaly detection (CPU usage data)
cpu_usage_data = df['cpu_usage'].values.reshape(-1, 1)

# Train the Isolation Forest model (anomaly detection)
model = IsolationForest(contamination=0.05)  # 5% expected anomalies
model.fit(cpu_usage_data)

# Predict anomalies (1 = normal, -1 = anomaly)
predictions = model.predict(cpu_usage_data)

# Add predictions to the DataFrame
df['anomaly'] = predictions

# Visualize the anomalies
plt.figure(figsize=(10, 6))
plt.plot(df['timestamp'], df['cpu_usage'], label='CPU Usage')
plt.scatter(df['timestamp'][df['anomaly'] == -1], df['cpu_usage'][df['anomaly'] == -1], color='red', label='Anomaly')
plt.title("CPU Usage with Anomalies")
plt.xlabel("Time")
plt.ylabel("CPU Usage (%)")
plt.legend()
plt.show()

Step 4: Automating Incident Response with AWS Lambda

When an anomaly is detected (for example, high CPU usage), AIOps can automatically trigger a response, such as scaling up resources.

AWS Lambda for Automated Scaling

Here’s an example of how to use AWS Lambda to scale up EC2 instances when CPU usage exceeds a threshold.

First, create your AWS Lambda function that scales EC2 instances when CPU usage exceeds 80%.

import boto3

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')

    # If CPU usage exceeds threshold, scale up EC2 instance
    if event['cpu_usage'] > 0.8:  # 80% CPU usage
        instance_id = 'i-1234567890'  # Replace with your EC2 instance ID
        ec2.modify_instance_attribute(InstanceId=instance_id, InstanceType={'Value': 't2.large'})

    return {
        'statusCode': 200,
        'body': f'Instance {instance_id} scaled up due to high CPU usage.'
    }

Then you’ll need to trigger the Lambda function. Set up AWS CloudWatch Alarms to monitor the output from the anomaly detection and trigger the Lambda function when CPU usage exceeds the threshold.

Step 5: Proactive Resource Scaling with Predictive Analytics

Finally, using predictive analytics, AIOps can forecast future resource usage and proactively scale resources before problems arise.

Predictive Scaling:

We’ll use a linear regression model to predict future CPU usage and trigger scaling events proactively.

Start by training a predictive model:

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

# Historical data (CPU usage trends)
data = pd.DataFrame({
    'timestamp': pd.date_range(start="2023-01-01", periods=100, freq='H'),
    'cpu_usage': np.random.normal(50, 10, 100)  # Simulated data
})

X = np.array(range(len(data))).reshape(-1, 1)  # Time steps
y = data['cpu_usage']

model = LinearRegression()
model.fit(X, y)

# Predict next 10 hours
future_prediction = model.predict([[len(data) + 10]])
print("Predicted CPU usage:", future_prediction)

If the predicted CPU usage exceeds a threshold, AIOps can trigger auto-scaling using AWS Lambda or Kubernetes.

Results:

Reduced incident resolution time: The average time to resolve incidents dropped from hours to minutes because AIOps helped the team identify issues faster.
Reduced false positives: By using anomaly detection, the system significantly reduced the number of false alerts.
Increased automation: With automated responses in place, the system dynamically adjusted resources in real time, reducing the need for manual intervention.
Proactive issue management: Predictive analytics enabled the team to address potential problems before they became critical, preventing performance degradation.

Conclusion

AIOps transforms IT operations, enabling companies to build more efficient, responsive, and superior systems. By automating routine tasks, identifying issues before they worsen, and providing real-time data, AIOps is altering the function of IT teams.

AIOps is the most effective tool for increasing system speed, reducing downtime, and streamlining your IT procedures. You can begin modestly, and gradually include more functionality. Then you’ll start to see how AIOps opens your IT environment to fresh ideas and increases its efficiency.

infrastructure - freeCodeCamp.org

Building a Website in 2026: What Matters More Than Your Tech Stack

What We'll Cover:

The Tech Stack Has Become a Commodity

Performance Is Still a Competitive Advantage

Domains and Infrastructure Still Matter

Hosting Is No Longer Just About Servers

Structured Data Has Become Essential

The Rise of AI Search and Answer Engines

Content Quality Is More Important Than Ever

User Experience Is the New Differentiator

The Future Is About Outcomes, Not Frameworks

How Large-Scale Platforms Handle Millions of Daily Transactions

What We'll Cover:

Why Transaction Volume Creates Unique Challenges

Breaking Monoliths Into Services

Using Load Balancers to Distribute Traffic

Why Databases Become Bottlenecks

Caching Frequently Accessed Data

Processing Tasks Asynchronously

Preventing Duplicate Transactions

Monitoring Everything

Preparing for Traffic Spikes

Building for Failure

The Importance of Consistency and Reliability

Conclusion

Beyond NVIDIA: Where the AI Infra Trade Actually Shows Up

Table of Contents

Prerequisites

What We're Investigating

Import the Required Packages

Building the AI Capex Universe

Pulling the Financial Data Behind the Story

Fundamentals Data

Historical Prices Data

Separating Business Strength from Market Recognition

Fundamental Signal

Market Recognition Signal

The AI Capex Matrix: Where the Trade Actually Shows Up

Which AI Infrastructure Layers Has the Market Rewarded Most?

The Physical Infrastructure Layer Is No Longer Hidden

What the Market Has Already Noticed

What This Study Shows

Conclusion

GDPR Article 32 for Software Engineers: Technical Controls, Implementations, and Auditor Questions

Table of Contents

What You'll Learn

Prerequisites

Part 1: Understanding Article 32 — The Technical Requirements

1.1. What Article 32 Actually Requires

1.2. The Scope Question: What Data Is Covered?

Part 2: Article 32(1)(a) — Pseudonymisation and Encryption

2.1. How to Implement Pseudonymisation at the Database Layer

2.2. How to Implement Encryption at Rest with Customer-Managed Keys

2.3. How to Implement Application-Layer Encryption for Sensitive Fields

Part 3: Article 32(1)(b) — Confidentiality and Integrity

3.1. How to Implement Automatic Logoff

3.2. How to Implement Unique User Identification with IRSA

Part 4: Article 32(1)(c) — Availability and Resilience

4.1. How to Implement Multi-AZ and Backup Requirements

Part 5: Article 32(1)(d) — Regular Testing

5.1. How to Implement Automated Vulnerability Scanning

Part 6: Article 32(1)(d) — Penetration Testing

6.1. Why Automated Scanning Is Not Enough

Best Practices for GDPR Article 32 Compliance

Resources

Why Your “Simple Deploy” Turned Into a Week of Infrastructure Work

What We'll Cover:

The Promise You Were Sold

The Hidden Contract You Are Already Operating Under

You Are Already Acting Like a Platform Team

The Cost Is Not Complexity. It Is Time

Why “It Works on My Machine” Still Exists

Fragmentation Is the Root Problem

This Model Breaks as You Scale

The Shift Toward Platforms

What You Stop Paying For

From Infrastructure Work Back to Product Work

Collapsing the Stack

The Trade-Off You Are Actually Making