<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Hamdaan Ali - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Hamdaan Ali - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 22:23:52 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/hamdaan/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Elevate Your Database Game: Supercharging Query Performance with Postgres FDW ]]>
                </title>
                <description>
                    <![CDATA[ Foreign data wrappers (FDWs) make remote Postgres tables feel local. That convenience is exactly why FDW performance surprises are so common. A query that looks like a normal join can execute like a distributed system: rows move across the network, r... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/fdw-pushdown/</link>
                <guid isPermaLink="false">69963f00d35b661838993bd0</guid>
                
                    <category>
                        <![CDATA[ performance ]]>
                    </category>
                
                    <category>
                        <![CDATA[ PostgreSQL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Databases ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Hamdaan Ali ]]>
                </dc:creator>
                <pubDate>Wed, 18 Feb 2026 22:36:48 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771357398917/8db8c3fd-9f16-4631-aa48-2537e8a4cb45.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Foreign data wrappers (FDWs) make remote Postgres tables feel local. That convenience is exactly why FDW performance surprises are so common.</p>
<p>A query that looks like a normal join can execute like a distributed system: rows move across the network, remote statements get executed repeatedly, and the local planner quietly becomes a coordinator. In that world, “fast SQL” is not mainly about CPU or indexes. It’s about <strong>data movement</strong> and <strong>round-trips</strong>.</p>
<p>This handbook covers the mechanism that determines whether a federated query behaves like a clean remote query or a chatty distributed workflow: <strong>pushdown</strong>.</p>
<p>Pushdown is not “moving compute”. Pushdown determines whether filtering, joining, ordering, and aggregation occur at the data source or after the data has already crossed the wire. When pushdown works, the local server receives a reduced result set. When it doesn’t, Postgres often has to fetch broad intermediate sets and finish the work locally.</p>
<p>The chapters ahead will help you build a practical mental model of what is “shippable” in <code>postgres_fdw</code>, why some expressions are blocked, and how to read <code>EXPLAIN (ANALYZE, BUFFERS, VERBOSE)</code> without getting tricked by familiar plan shapes.</p>
<p>After the core method, the handbook covers tuning knobs that matter in production, schema and indexing considerations, benchmarking methodology, monitoring and logging, and a case study that shows what a real pushdown win looks like end-to-end.</p>
<p>The later sections go deeper into advanced shippability edge cases, cost model calibration, and regression-proofing FDW workloads.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-motivation">Motivation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-fdw-basics-without-the-setup-tax">FDW Basics Without the Setup Tax</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pushdown-mechanics">Pushdown Mechanics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shippable-operations-a-deep-dive">Shippable Operations: a Deep Dive</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pushdown-blockers-and-why-they-exist">Pushdown Blockers and Why They Exist</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-reading-explain-like-a-pro">Reading EXPLAIN Like a Pro</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-tune-postgresfdw">How to Tune postgres_fdw</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-schema-and-index-recommendations">Schema and Index Recommendations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-benchmarking-methodology">Benchmarking Methodology</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-monitoring-and-logging">Monitoring and Logging</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-case-study-refactoring-a-keycloak-coverage-query">Case Study: Refactoring a Keycloak Coverage Query</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-checklist-and-troubleshooting-guide">Checklist and Troubleshooting Guide</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-case-study-takeaways">Case Study Takeaways</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-advanced-operations-a-deeper-dive-into-shippability">Advanced Operations: A Deeper Dive into Shippability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-common-antipatterns-and-how-to-avoid-them">Common Anti‑Patterns and How to Avoid Them</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-extending-tuning-calibrating-cost-models">Extending Tuning: Calibrating Cost Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-further-case-studies-and-practical-examples">Further Case Studies and Practical Examples</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-monitoring-diagnostics-and-regression-testing">Monitoring, Diagnostics, and Regression Testing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-extended-guidelines-for-advanced-dbas">Extended Guidelines for Advanced DBAs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-bringing-it-all-together">Bringing it All Together</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>This handbook assumes basic comfort with Postgres query plans. It builds on <code>EXPLAIN (ANALYZE, BUFFERS)</code> rather than reintroducing SQL fundamentals, indexing, or join algorithms.</p>
<p>The focus here is federated execution: how foreign queries behave, and how to reason about them with the same clarity as local plans.</p>
<p>Here’s what you should already be comfortable with:</p>
<ul>
<li><p>Reading <code>EXPLAIN (ANALYZE, BUFFERS)</code> output and spotting obvious plan smells (row explosions, bad join order, missed indexes).</p>
</li>
<li><p>Basic join mechanics (nested loop, hash join, merge join) and why cardinality estimates matter.</p>
</li>
<li><p>Postgres statistics at a practical level (<code>ANALYZE</code>, correlation, and what “estimated rows vs actual rows” implies).</p>
</li>
</ul>
<p>And here’s what you need to follow along with the examples:</p>
<ul>
<li><p>A Postgres “local” instance that will run <code>postgres_fdw</code> and act as the coordinator.</p>
</li>
<li><p>A Postgres “remote” instance that holds the foreign tables.</p>
</li>
<li><p>Permission on the local side to:</p>
<ul>
<li><p><code>CREATE EXTENSION postgres_fdw;</code></p>
</li>
<li><p>create a <code>SERVER</code> and <code>USER MAPPING</code></p>
</li>
<li><p>create <code>FOREIGN TABLE</code> objects (or permission to use existing ones)</p>
</li>
</ul>
</li>
<li><p>A way to run queries and capture plans:</p>
<ul>
<li><code>psql</code> is enough, and so is any GUI, as long as you can run <code>EXPLAIN (ANALYZE, BUFFERS, VERBOSE)</code>.</li>
</ul>
</li>
</ul>
<p>We won’t go through a long environment setup walkthrough. The examples assume the FDW objects exist and focus on plans and behavior.</p>
<p>We also won’t go into general distributed systems theory. Only the pieces that show up in an FDW plan are used.</p>
<h2 id="heading-executive-summary">Executive Summary</h2>
<p>The single most important lesson of this handbook is that <strong>FDW pushdown reduces data movement</strong>. It’s tempting to think of pushdown as merely changing where a calculation happens (“move the work to the remote”). But what really matters is whether the remote server is asked for only the rows you need.</p>
<p>When pushdown is working, the remote server performs the selective join and filtering, and the local Postgres receives a small, already reduced result set. When pushdown fails, the local server becomes a distributed query coordinator: it pulls large intermediate sets over the network and then finishes the heavy lifting locally.</p>
<p>Why does this matter? Because a refactor that makes more of your query shippable to the remote server can slash end‑to‑end latency without changing a single row of output. In the case study we'll explore later, rewriting a query so that the FDW can ship a joined remote query instead of performing multiple foreign scans and local joins reduces runtime from approximately <strong>166 ms to 25 ms</strong>. The business logic did not change – the <em>shape</em> of the work changed.</p>
<p>Below is a simple bar chart illustrating that dramatic drop. The chart uses actual timings from the case study. If you run the experiment yourself, the numbers may differ depending on your hardware and network, but the relative difference should be clear.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771117284661/ecadfc8b-7e45-4122-921d-5b06215d627a.png" alt="Bar chart titled &quot;Query Execution Time: Before vs After Refactor.&quot; The chart shows execution time in milliseconds on the vertical axis. The &quot;Before&quot; bar is much taller, over 160 ms, compared to the &quot;After&quot; bar, which is below 20 ms, indicating a significant improvement in execution time after refactoring." class="image--center mx-auto" width="840" height="630" loading="lazy"></p>
<h2 id="heading-motivation">Motivation</h2>
<p>Foreign data wrappers let you query remote data using the same SQL syntax you use locally. That convenience is exactly why they can be so deceptive.</p>
<p>A federated query may look like a normal join, but under the hood, it behaves like a distributed system: some part of the plan runs on the remote server, some on the local server, and every boundary between them is a network hop. The slow path is rarely “bad SQL” – it’s usually a combination of two things:</p>
<ol>
<li><p><strong>Too many rows are pulled over the network.</strong> Without pushdown, the FDW retrieves a large slice of the remote table and applies your filters and joins locally. This may lead to tens of thousands or millions of rows being shipped across the network when you only needed hundreds or fewer.</p>
</li>
<li><p><strong>Too many round-trips.</strong> If the plan performs a nested loop that drives a foreign scan, it can end up executing the same remote query hundreds or thousands of times. Each call might be fast on its own, but latency adds up.</p>
</li>
</ol>
<p>This isn't speculation. PostgreSQL's documentation makes clear that a foreign table <strong>has no local storage</strong> and that Postgres “asks the FDW to fetch data from the external source” <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=It%20is%20generally%20recommended%20that,differently%20from%20the%20local%20server">[1]</a>. There is no local buffer cache or heap storage to hide mistakes. Every row you retrieve must traverse the network at least once. If your plan fetches more rows than it needs, or repeatedly does so, performance can degrade quickly.</p>
<p>That’s why you should treat the Remote SQL shown in <code>EXPLAIN (VERBOSE)</code> as part of your query plan. It tells you exactly what the remote server is being asked to do. If it’s missing your filters or joins, you know the local server will have to finish the job. The rest of this handbook will teach you how to read that plan, how to force pushdown when possible, and how to recognize the signs that something has gone wrong.</p>
<h2 id="heading-fdw-basics-without-the-setup-tax">FDW Basics Without the Setup Tax</h2>
<p>You might be tempted to skip this section if you've already created foreign tables in your own databases. Don't. Understanding the architecture of foreign data wrappers is essential to understanding why pushdown matters.</p>
<h3 id="heading-sqlmed-in-a-nutshell">SQL/MED in a nutshell</h3>
<p>PostgreSQL implements the <strong>SQL/MED</strong> (Management of External Data) standard through its FDW framework. To access a remote Postgres server via <code>postgres_fdw</code>, you perform four steps:</p>
<ol>
<li><p><strong>Install the extension</strong>: <code>CREATE EXTENSION postgres_fdw</code> tells Postgres to load the FDW code.</p>
</li>
<li><p><strong>Create a foreign server</strong>: <code>CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '...', port '...', dbname '...')</code>defines where the remote server resides and how to connect.</p>
</li>
<li><p><strong>Create a user mapping</strong>: <code>CREATE USER MAPPING FOR your_user SERVER foreign_server OPTIONS (user 'remote_user', password '...')</code> tells Postgres how to authenticate on the remote side.</p>
</li>
<li><p><strong>Create a foreign table</strong>: <code>CREATE FOREIGN TABLE remote_table (...) SERVER foreign_server OPTIONS (schema_name '...', table_name '...');</code> defines the columns and references the remote table.</p>
</li>
</ol>
<p>Once you've done that, you can run <code>SELECT</code> statements against the foreign table as if it were local. But the definition hides an important detail: there is no storage associated with that foreign table <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=It%20is%20generally%20recommended%20that,differently%20from%20the%20local%20server">[1]</a>. Every time you <code>SELECT</code>, <code>INSERT</code>, <code>UPDATE</code>, or <code>DELETE</code>, the FDW must connect to the remote server, build a remote query, send it, and read the results. This overhead is small for simple queries but becomes critical as queries get more complex.</p>
<h3 id="heading-what-postgresfdw-does-and-does-not-do">What postgres_fdw does and does not do</h3>
<p><code>postgres_fdw</code> does two things for you:</p>
<ol>
<li><p>It builds remote SQL from your query, including pushing down safe filters, joins, sorts, and aggregates when it can.</p>
</li>
<li><p>It fetches rows from the remote server and hands them to the local executor. If some part of your query cannot be executed remotely, the local executor performs that part.</p>
</li>
</ol>
<p>The FDW tries hard to minimize data transfer by sending as much of your <code>WHERE</code> clause as possible to the remote server and by not retrieving unused columns <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[2]</a>. It also has a number of tuning knobs that we'll explore later (such as <code>fetch_size</code>, <code>use_remote_estimate</code>, <code>fdw_startup_cost</code>, and <code>fdw_tuple_cost</code><a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>). But the real win often comes from structuring your query so that the FDW can push work down.</p>
<p>There's one last architectural point to keep in mind: the remote server runs with a restricted session environment. In remote sessions opened by <code>postgres_fdw</code>, the <code>search_path</code> is set to <code>pg_catalog</code> only, and <code>TimeZone</code>, <code>DateStyle</code>, and <code>IntervalStyle</code> are set to specific values <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=In%20the%20remote%20sessions%20opened,their%20expected%20search%20path%20environment">[4]</a>. This means that any functions you expect to run remotely must be schema‑qualified or packaged in a way that the FDW can find them. It also underscores why you should not override session settings for FDW connections unless you know exactly what you are doing <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=In%20the%20remote%20sessions%20opened,their%20expected%20search%20path%20environment">[4]</a>.</p>
<h2 id="heading-pushdown-mechanics">Pushdown Mechanics</h2>
<p>At a high level, “pushdown” means pushing as much of your SQL query as possible to the remote server. But the FDW cannot simply send arbitrary SQL. It must be <em>safe</em> and <em>portable</em> for remote evaluation. Postgres uses the term <strong>shippable</strong> to describe expressions and operations that can be evaluated on the foreign server.</p>
<h3 id="heading-what-shippable-means-in-practice">What “shippable” means in practice</h3>
<p>An expression is considered shippable if it meets several conditions:</p>
<ol>
<li><p><strong>It uses built‑in functions, operators, or data types</strong>, or functions/operators from extensions that have been explicitly allow‑listed via the extensions option on the foreign server <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[2]</a>. If you use a custom function or an extension that has not been declared, the FDW assumes it cannot run remotely.</p>
</li>
<li><p><strong>It’s marked IMMUTABLE.</strong> Postgres distinguishes between <code>IMMUTABLE</code>, <code>STABLE</code>, and <code>VOLATILE</code> functions. Only immutable functions – those that always return the same output for the same inputs and don’t depend on session state – are candidates for pushdown <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=functions%20in%20such%20clauses%20must,to%20reduce%20the%20risk%20of">[5]</a>. This rule prevents time‑dependent functions, such as <code>now()</code> or <code>random()</code> from being evaluated remotely, because the result might differ between the local and remote servers.</p>
</li>
<li><p><strong>It doesn’t depend on local collations or type conversions</strong>. PostgreSQL’s docs warn that type or collation mismatches can lead to semantic anomalies <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=It%20is%20generally%20recommended%20that,differently%20from%20the%20local%20server">[1]</a>. If the FDW cannot guarantee that a comparison behaves identically on both servers, it will refuse to push it down. For example, comparing a <code>citext</code> column to a <code>text</code> constant could be unsafe if the remote server doesn’t have the <code>citext</code> extension installed.</p>
</li>
</ol>
<p>From these rules, you can derive a mental checklist: avoid non‑immutable functions in your <code>WHERE</code> clause, keep your join conditions simple and typed correctly, and list any third‑party extensions you want to use in the foreign server’s extensions option so that they are considered shippable <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[2]</a>.</p>
<h3 id="heading-where-pushdown">WHERE pushdown</h3>
<p>If a <code>WHERE</code> clause consists entirely of shippable expressions, it will be included in the remote query. Otherwise, it will be evaluated locally. This matters because pushing a filter down reduces the number of rows returned to the local server.</p>
<p>Consider a predicate like this:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">WHERE</span> created_at &gt;= now() - <span class="hljs-type">interval</span> <span class="hljs-string">'30 days'</span>
</code></pre>
<p>Because <code>now()</code> is volatile (it returns a different value each time it’s called), Postgres cannot assume the remote server will interpret <code>now()</code> the same way. The FDW therefore pulls the entire table and applies the filter locally.</p>
<p>A better approach is to pass a parameter into the query or compute the cutoff timestamp once in the application and embed it into the SQL.</p>
<h3 id="heading-join-pushdown-conditions">Join pushdown conditions</h3>
<p>Joins are the next big lever. When <code>postgres_fdw</code> encounters a join between foreign tables on the <strong>same foreign server</strong>, it will send the entire join to the remote server unless it believes it will be more efficient to fetch the tables individually or unless the tables use different user mappings <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=When%20,clauses">[6]</a>.</p>
<p>It applies the same precautions described for <code>WHERE</code> clauses: the join condition must be shippable, and both tables must be on the same server. Cross‑server joins are never pushed down – the FDW will perform them locally.</p>
<h3 id="heading-shippability-decision-tree">Shippability decision tree</h3>
<p>It can be helpful to visualize the shippability rules as a flowchart. Below is a simple decision tree that you can use when inspecting an expression or join clause.</p>
<p>It starts with the question of whether an expression is in a WHERE or JOIN clause. Further decisions are made based on factors like using volatile functions, built-in functions, type mismatches, or cross-server joins. The flowchart concludes with outcomes like "Not shippable, evaluated locally" or "Shippable, included in Remote SQL."</p>
<p>If you reach the left side of the tree, the expression will be evaluated locally. If you reach the right side, the FDW can ship it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771109842865/9dafcd32-c390-487d-8b35-2911d6075b13.png" alt="Flowchart for determining SQL expression shippability. It starts with the question of whether an expression is in a WHERE or JOIN clause. Further decisions are made based on factors like using volatile functions, built-in functions, type mismatches, or cross-server joins. The flowchart concludes with outcomes like &quot;Not shippable, evaluated locally&quot; or &quot;Shippable, included in Remote SQL.&quot;" class="image--center mx-auto" width="8192" height="2404" loading="lazy"></p>
<h2 id="heading-shippable-operations-a-deep-dive">Shippable Operations: a Deep Dive</h2>
<p>Postgres has been expanding what <code>postgres_fdw</code> can be pushed down over several versions. This section walks through each operation class and the conditions required for pushdown.</p>
<h3 id="heading-filters-where-clauses">Filters (WHERE clauses)</h3>
<p>As explained above, simple filters that use built‑in operators and immutable functions are generally pushed down. If you see a <code>Filter:</code> node above a Foreign Scan in your plan, it means some part of your predicate didn’t qualify. Common reasons include using <code>now()</code>, <code>timezone()</code> or other volatile functions, referencing a non‑allow‑listed extension, or comparing different collation settings.</p>
<p>When this happens, the entire table (or at least all rows matching other shippable conditions) is fetched, and the filter is applied locally.</p>
<p><strong>Plan smell:</strong> Look for a Foreign Scan node with a <code>Filter:</code> line directly above it. That means filtering happened locally. Also look for broad Remote SQL such as:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> remote_table <span class="hljs-keyword">WHERE</span> (<span class="hljs-type">name</span> = <span class="hljs-string">'Hamdaan'</span>)
</code></pre>
<p>with no group constraints. That's a sign that the filter was not pushed down.</p>
<h3 id="heading-joins">Joins</h3>
<p>Simple inner joins between foreign tables on the same foreign server are usually pushable. The join condition must satisfy the same shippability rules as filters. If the join involves more than one foreign server, if the join condition uses an unshippable function, or if the foreign tables use different user mappings, the FDW will fetch each table separately and join them locally <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=When%20,clauses">[6]</a>. This can lead to large intermediate sets being transferred.</p>
<p><strong>Plan smell:</strong> A Hash Join or Merge Join where both inputs are Foreign Scan nodes indicates that the join was performed locally. Conversely, a single Foreign Scan representing a join and containing the <code>JOIN ... ON</code> clause in Remote SQL indicates that the join was pushed down.</p>
<h3 id="heading-aggregates-group-by-count-sum-and-so-on">Aggregates (GROUP BY, COUNT, SUM, and so on)</h3>
<p>Starting in PostgreSQL 10, aggregates can be pushed to the remote server when possible. The release notes state explicitly: “push aggregate functions to the remote server,” and explain that this <strong>reduces the amount of data that must be transferred from the remote server and offloads aggregate computation</strong> <a target="_blank" href="https://www.postgresql.org/docs/release/10.0/#:~:text=,Jeevan%20Chalke%2C%20Ashutosh%20Bapat">[7]</a>.</p>
<p>To qualify, both the grouping expressions and the aggregate functions themselves must be shippable. If the FDW cannot push an aggregate, it will fetch the raw rows and perform the aggregation locally.</p>
<p><strong>Plan smell:</strong> Look for a <code>GroupAggregate</code> node above a Foreign Scan that returns many rows. When the aggregate is pushed down, there will be no local aggregate node. Instead, the Remote SQL will include a <code>GROUP BY</code> clause.</p>
<h3 id="heading-order-by-and-limit">ORDER BY and LIMIT</h3>
<p>Prior to PostgreSQL 12, sorting and limiting were rarely pushed down. In version 12, Etsuro Fujita’s patch allows ORDER BY sorts and LIMIT clauses to be pushed to <code>postgres_fdw</code> foreign servers <strong>in more cases</strong> <a target="_blank" href="https://www.postgresql.org/docs/release/12.0/#:~:text=,Etsuro%20Fujita%29%20%C2%A7%20%C2%A7">[8]</a>. For the sort or limit to be pushed, the underlying scan must be pushable, and the ordering expression must be shippable. Partitioned queries or complicated join trees may still cause the sort or limit to be applied locally.</p>
<p><strong>Plan smell:</strong> A local Sort or Limit node above a Foreign Scan indicates the operation was not pushed down. Conversely, a Remote SQL statement containing ORDER BY and LIMIT indicates that pushdown succeeded.</p>
<h3 id="heading-distinct">DISTINCT</h3>
<p>Distinct operations can be pushed down when the distinct expression list is shippable. But if the distinct is combined with unshippable expressions, or if the distinct is applied after a join that cannot be pushed down, the FDW will retrieve all rows and perform the distinct locally.</p>
<h3 id="heading-window-functions">Window functions</h3>
<p>In practice, window functions are rarely pushed down through <code>postgres_fdw</code>. They often require ordering or partitioning semantics that are difficult to represent portably. If you see a <code>WindowAgg</code> node in your plan, it’s almost always local. That doesn’t mean you can't use window functions with foreign tables, but you should expect them to incur network and CPU costs.</p>
<h3 id="heading-version-differences">Version differences</h3>
<p>Postgres developers continue to improve the FDW layer. Here are some notable changes by version:</p>
<ol>
<li><p><strong>PostgreSQL 9.6</strong> introduced remote join pushdown and allowed UPDATE/DELETE pushdown. Before 9.6, all joins were local.</p>
</li>
<li><p><strong>PostgreSQL 10</strong> introduced aggregate pushdown, enabling remote GROUP BY and aggregate functions <a target="_blank" href="https://www.postgresql.org/docs/release/10.0/#:~:text=,Jeevan%20Chalke%2C%20Ashutosh%20Bapat">[7]</a>.</p>
</li>
<li><p><strong>PostgreSQL 12</strong> expanded ORDER BY and LIMIT pushdown <a target="_blank" href="https://www.postgresql.org/docs/release/12.0/#:~:text=,Etsuro%20Fujita%29%20%C2%A7%20%C2%A7">[8]</a>.</p>
</li>
<li><p><strong>PostgreSQL 15</strong> added pushdown for certain CASE expressions and other improvements.</p>
</li>
</ol>
<p>If you learned FDW behavior on an older version, revisit your assumptions.</p>
<h2 id="heading-pushdown-blockers-and-why-they-exist">Pushdown Blockers and Why They Exist</h2>
<p>When pushdown fails, it’s not due to bad luck. There’s always a reason grounded in safety or correctness. Here are the most common blockers and how to diagnose them.</p>
<h3 id="heading-nonimmutable-functions">Non‑immutable functions</h3>
<p>Functions marked <code>VOLATILE</code> or <code>STABLE</code> cannot be pushed down because their results may differ between the local and remote server. Examples include <code>now()</code>, <code>random()</code>, <code>current_user</code>, and user‑defined functions that look at session variables or query the database. Even functions you might think are harmless, like <code>age()</code> or <code>clock_timestamp()</code>, can cause pushdown to fail.</p>
<p><strong>Fix:</strong> Compute volatile values in your application or in a CTE before referencing the foreign table. For example, compute timestamp <code>'now' - interval '30 days'</code> as a constant and compare your <code>created_at</code> column against that constant. Alternatively, move the logic into a stored generated column on the remote table.</p>
<h3 id="heading-type-and-collation-mismatches">Type and collation mismatches</h3>
<p>The documentation warns that when types or collations don’t match between the local and remote tables, the remote server may interpret conditions differently <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=It%20is%20generally%20recommended%20that,differently%20from%20the%20local%20server">[1]</a>. This is particularly insidious when text comparisons, case‑insensitive collations, or non‑default locale settings are used. If Postgres can't guarantee the same semantics, it will pull rows locally and evaluate the expression.</p>
<p><strong>Fix:</strong> Make sure that your foreign table definition uses the same data types and collations as the remote table. When in doubt, explicitly cast values to a common type.</p>
<h3 id="heading-crossserver-joins">Cross‑server joins</h3>
<p>Joins across different foreign servers cannot be pushed down. The FDW can only ship a join when both tables reside on the same remote server and use the same user mapping <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=When%20,clauses">[6]</a>. Otherwise, it will perform two separate scans and join the results locally.</p>
<p><strong>Fix:</strong> If you frequently join tables across servers, consider consolidating the tables on a single server, materializing a view on one side, or pulling the smaller table into a temporary local table before joining.</p>
<h3 id="heading-mixed-local-and-foreign-joins">Mixed local and foreign joins</h3>
<p>A join between a local table and a foreign table will not be pushed down. Even though the foreign side might be pushdown‑eligible, the FDW cannot join it with local data on the remote server. A nested loop with a parameterized foreign scan is the typical pattern here, resulting in many remote calls.</p>
<p><strong>Fix:</strong> Filter or aggregate as much as possible on the foreign side first (via a CTE or by materializing a subset) before joining to local tables.</p>
<h3 id="heading-remote-session-settings-and-search-paths">Remote session settings and search paths</h3>
<p>Because <code>postgres_fdw</code> sets a restricted <code>search_path</code>, <code>TimeZone</code>, <code>DateStyle</code>, and <code>IntervalStyle</code> in remote sessions <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=In%20the%20remote%20sessions%20opened,their%20expected%20search%20path%20environment">[4]</a>, any functions you call must be schema‑qualified or otherwise compatible. If a function relies on the current search path or session settings, it may break or produce different results on the remote side.</p>
<p><strong>Fix:</strong> Schema‑qualify remote functions and ensure that any environment‑dependent logic is safe to execute under the default FDW session settings. If necessary, attach <code>SET search_path</code> or other settings to your remote functions.</p>
<h3 id="heading-troubleshooting-matrix">Troubleshooting matrix</h3>
<p>The table below maps symptoms in your <code>EXPLAIN</code> plan to likely causes and fixes. Use it as a quick diagnostic tool when something looks off.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Symptom in plan</strong></td><td><strong>Likely cause</strong></td><td><strong>Suggested fix</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Foreign Scan has loops much greater than 1</td><td>Parameterized remote lookup caused by nested loop, join conditions not shippable</td><td>Rewrite join so the FDW can ship a single joined query, or batch remote requests via an <code>IN</code> list or temporary table</td></tr>
<tr>
<td>Broad Remote SQL that lacks scope predicates</td><td><code>WHERE</code> clause contains non‑immutable functions or unsupported operators</td><td>Replace volatile functions with constants or allow‑list extension functions, ensure types and collations match</td></tr>
<tr>
<td>Local Hash Join or Merge Join between two foreign tables</td><td>Join could not be pushed down (different servers, user mappings, or unshippable join expression)</td><td>Consolidate tables on one server, align user mappings, or rewrite the join condition</td></tr>
<tr>
<td>Local Sort, Limit, or Unique on top of a Foreign Scan</td><td><code>ORDER BY</code>, <code>LIMIT</code>, or <code>DISTINCT</code> could not be pushed down</td><td>Simplify sort expressions, push filters deeper, check PG version for improvements</td></tr>
<tr>
<td>Plan runs but gives wrong results when pushdown is enabled</td><td>Semantic mismatch due to type/collation differences or remote session settings <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=It%20is%20generally%20recommended%20that,differently%20from%20the%20local%20server">[1]</a> <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=In%20the%20remote%20sessions%20opened,their%20expected%20search%20path%20environment">[4]</a></td><td>Align types/collations, schema‑qualify functions, use stable session settings</td></tr>
</tbody>
</table>
</div><h2 id="heading-reading-explain-like-a-pro">Reading EXPLAIN Like a Pro</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771117830315/62ca8fde-2638-4ae1-b968-1100ac5251bb.png" alt="SQL execution plan analysis table with columns: exclusive, inclusive, rows x, rows, loops, and node details. Rows display Nested Loop Join, Hash Join, and Seq Scan operations with costs, times, and buffers. Highlighted cells indicate notable metrics." class="image--center mx-auto" width="1579" height="823" loading="lazy"></p>
<p>Many developers skim <code>EXPLAIN</code> plans for local queries, looking at the top nodes and overall cost. For FDW queries, you must invert that habit: read the foreign parts first. The Remote SQL string tells you what the remote server is being asked to do, and the loops field tells you how many times that remote call is executed.</p>
<h3 id="heading-inspect-the-foreign-scan-nodes">Inspect the Foreign Scan nodes</h3>
<p>Start by finding the Foreign Scan node(s). In <code>EXPLAIN (VERBOSE)</code>, each foreign scan includes a line like:</p>
<pre><code class="lang-pgsql">Remote <span class="hljs-keyword">SQL</span>: <span class="hljs-keyword">SELECT</span> ...
</code></pre>
<p>This line is not a trivial – it’s the actual SQL that will run on the remote server. Read it carefully. Does it include your <code>WHERE</code> predicates? Does it include your join conditions? If not, you know the local server will pick up the slack.</p>
<p>Look at the loops column. If the loops exceed 1, the same remote query is executed multiple times. For example:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">Foreign</span> Scan <span class="hljs-keyword">on</span> <span class="hljs-built_in">public</span>.user_entity  (<span class="hljs-keyword">rows</span>=<span class="hljs-number">1</span> loops=<span class="hljs-number">416</span>)
  Remote <span class="hljs-keyword">SQL</span>: <span class="hljs-keyword">SELECT</span> id, tenant_id <span class="hljs-keyword">FROM</span> <span class="hljs-built_in">public</span>.user_entity <span class="hljs-keyword">WHERE</span> enabled <span class="hljs-keyword">AND</span> service_account_client_link <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NULL</span> <span class="hljs-keyword">AND</span> id = <span class="hljs-meta">$1</span>
</code></pre>
<p>This is the “N+1” problem in disguise. The plan executes the foreign scan once per outer row. Multiply the per‑loop cost by the number of loops to understand why the query is slow. The fix is to rewrite the query so that the join and filters are applied in a single remote call.</p>
<h3 id="heading-recognize-initplan-vs-subplan">Recognize InitPlan vs SubPlan</h3>
<p>An InitPlan runs once and caches its result. A SubPlan can run per outer row. In FDW queries, subplans often drive parameterized remote scans. If you see a SubPlan attached to a nested loop that feeds a foreign scan, suspect a parameterized remote lookup and look for ways to turn it into an InitPlan or merge it into a single remote query.</p>
<h3 id="heading-understand-cte-materialization">Understand CTE materialization</h3>
<p>Common table expressions (CTEs) behave differently depending on whether they are marked <code>MATERIALIZED</code> or <code>NOT MATERIALIZED</code>. A materialized CTE is computed once and stored in a temporary structure, then read by the rest of the query. A non‑materialized CTE is inlined into the parent query, allowing optimizations to span across the boundary.</p>
<p>In PostgreSQL 12 and later, CTEs are inlined by default unless they’re referenced multiple times or explicitly marked <code>MATERIALIZED</code>. Materializing a CTE that contains a foreign scan can freeze a broad remote fetch and prevent later clauses from being pushed down. On the other hand, materialization can prevent repeated remote scans if the CTE is referenced multiple times. Use this lever deliberately to control where remote work happens.</p>
<h3 id="heading-annotated-example">Annotated example</h3>
<p>Let's annotate a simplified excerpt from a real plan. The goal is to show how to quickly read the relevant parts.</p>
<pre><code class="lang-pgsql">Nested <span class="hljs-keyword">Loop</span>  (<span class="hljs-keyword">rows</span>=<span class="hljs-number">414</span> loops=<span class="hljs-number">1</span>)
  -&gt; Hash <span class="hljs-keyword">Join</span>  (<span class="hljs-keyword">rows</span>=<span class="hljs-number">416</span> loops=<span class="hljs-number">1</span>)
       -&gt; <span class="hljs-keyword">Foreign</span> Scan <span class="hljs-keyword">on</span> <span class="hljs-built_in">public</span>.user_entity (<span class="hljs-keyword">rows</span>=<span class="hljs-number">1</span> loops=<span class="hljs-number">416</span>)
            Remote <span class="hljs-keyword">SQL</span>: <span class="hljs-keyword">SELECT</span> id, tenant_id <span class="hljs-keyword">FROM</span> <span class="hljs-built_in">public</span>.user_entity <span class="hljs-keyword">WHERE</span> enabled <span class="hljs-keyword">AND</span> service_account_client_link <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NULL</span> <span class="hljs-keyword">AND</span> id = <span class="hljs-meta">$1</span>
  -&gt; <span class="hljs-keyword">Foreign</span> Scan <span class="hljs-keyword">on</span> <span class="hljs-built_in">public</span>.user_attribute (<span class="hljs-keyword">rows</span>=<span class="hljs-number">671</span> loops=<span class="hljs-number">1</span>)
       Remote <span class="hljs-keyword">SQL</span>: <span class="hljs-keyword">SELECT</span> ua.user_id, ua.<span class="hljs-keyword">value</span> <span class="hljs-keyword">FROM</span> user_attribute ua <span class="hljs-keyword">JOIN</span> user_entity u <span class="hljs-keyword">ON</span> ua.user_id = u.id <span class="hljs-keyword">JOIN</span> tenant r <span class="hljs-keyword">ON</span> u.tenant_id = r.id <span class="hljs-keyword">WHERE</span> ua.name = <span class="hljs-string">'attribute A'</span> <span class="hljs-keyword">AND</span> r.name = <span class="hljs-string">'demo'</span> <span class="hljs-keyword">AND</span> u.enabled <span class="hljs-keyword">AND</span> u.service_account_client_link <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NULL</span> <span class="hljs-keyword">AND</span> (g.name = <span class="hljs-string">'keycloak-group-a'</span> <span class="hljs-keyword">OR</span> g.parent_group = <span class="hljs-meta">$1</span>)
</code></pre>
<p>In the old plan, the first Foreign Scan executed 416 times, each time retrieving a single row. The Remote SQL only applies the filter on enabled and service_account_client_link – it doesn’t include the tenant or group scoping. That scoping is applied by the nested loop outside the foreign scan.</p>
<p>In the refactored plan, the second Foreign Scan results from combining user_attribute, user_entity, user_group_membership, keycloak_group, and tenant into a single remote query. It retrieves 671 rows in a single query and includes all relevant filters. There is no repeated remote call. The timing difference is driven by the different loop values and the selectivity of the Remote SQL.</p>
<h2 id="heading-how-to-tune-postgresfdw">How to Tune postgres_fdw</h2>
<p>Once you've structured your query for maximum pushdown, tuning knobs let you squeeze out further performance improvements and adjust planner decisions.</p>
<h3 id="heading-fetchsize">fetch_size</h3>
<p><code>fetch_size</code> controls how many rows <code>postgres_fdw</code> retrieves per network fetch. The default is <code>100</code> rows <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[9]</a>. A small fetch size means more round-trips and lower memory usage. A larger fetch size reduces network overhead at the cost of buffering more rows in memory.</p>
<p>In practice, increasing <code>fetch_size</code> to a few thousand can reduce latency for large result sets. It’s specified either at the foreign server or foreign table level:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">SERVER</span> foreign_server <span class="hljs-keyword">OPTIONS</span> (<span class="hljs-keyword">ADD</span> fetch_size <span class="hljs-string">'1000'</span>);
<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">FOREIGN</span> <span class="hljs-keyword">TABLE</span> remote_table <span class="hljs-keyword">OPTIONS</span> (<span class="hljs-keyword">ADD</span> fetch_size <span class="hljs-string">'1000'</span>);
</code></pre>
<h3 id="heading-useremoteestimate">use_remote_estimate</h3>
<p>By default, the planner estimates the cost of foreign scans using local statistics. This can be wildly inaccurate if the foreign table has a different data distribution. Setting <code>use_remote_estimate</code> to true tells <code>postgres_fdw</code> to run <code>EXPLAIN</code> on the remote server to get row count and cost estimates. This can dramatically improve join order selection at the cost of an additional remote query during planning <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>. You can set this per table or per server:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">SERVER</span> foreign_server <span class="hljs-keyword">OPTIONS</span> (<span class="hljs-keyword">SET</span> use_remote_estimate <span class="hljs-string">'true'</span>);
</code></pre>
<h3 id="heading-fdwstartupcost-and-fdwtuplecost">fdw_startup_cost and fdw_tuple_cost</h3>
<p>These cost parameters model the overhead of starting a foreign scan and the cost per row fetched. Adjusting them can influence the planner’s choice of join strategy. A higher <code>fdw_startup_cost</code> discourages the planner from choosing plans with many small foreign scans (which might generate many remote calls). A higher <code>fdw_tuple_cost</code> discourages plans that fetch large numbers of rows <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>. Use these only after you have solid evidence from <code>EXPLAIN</code> and experiments.</p>
<h3 id="heading-analyze-and-analyzesampling">ANALYZE and analyze_sampling</h3>
<p>Running <code>ANALYZE</code> on a foreign table collects local statistics by sampling the remote table <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>. Accurate stats are essential for good estimates when <code>use_remote_estimate</code> is false.</p>
<p>But if the remote table changes frequently, these stats become stale quickly. The <code>analyze_sampling</code> option controls whether sampling happens on the remote side or locally. When <code>analyze_sampling</code> is set to <code>random</code>, <code>system</code>, <code>bernoulli</code>, or <code>auto</code>, <code>ANALYZE</code> will sample rows remotely instead of pulling all rows into the local server<a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>.</p>
<h3 id="heading-extensions">extensions</h3>
<p>The extensions option lists extensions whose functions and operators can be shipped to the remote server <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[2]</a>. If you rely on functions from citext, <code>pg_trgm</code>, or other extensions, add them to the server definition:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">SERVER</span> foreign_server <span class="hljs-keyword">OPTIONS</span> (<span class="hljs-keyword">SET</span> extensions <span class="hljs-string">'citext,pg_trgm'</span>);
</code></pre>
<h3 id="heading-a-quick-knob-impact-table">A quick knob impact table</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Knob</strong></td><td><strong>Primary effect</strong></td><td><strong>When to change it</strong></td><td><strong>Possible downside</strong></td></tr>
</thead>
<tbody>
<tr>
<td>fetch_size</td><td>Number of rows per fetch</td><td>Result sets are large and latency dominates</td><td>Too large consumes memory</td></tr>
<tr>
<td>use_remote_estimate</td><td>Better row count/cost estimates</td><td>Planner misestimates foreign scans</td><td>Extra remote queries during planning</td></tr>
<tr>
<td>fdw_startup_cost</td><td>Penalty per foreign scan</td><td>Planner chooses many small foreign scans</td><td>Wrong values bias the planner</td></tr>
<tr>
<td>fdw_tuple_cost</td><td>Cost per row fetched</td><td>Planner pulls too many rows</td><td>Mis‑tuned values mislead planner</td></tr>
<tr>
<td>extensions</td><td>Which extension functions are shippable</td><td>Using extension functions in predicates</td><td>Extensions must exist and match on both servers</td></tr>
</tbody>
</table>
</div><h2 id="heading-schema-and-index-recommendations">Schema and Index Recommendations</h2>
<p>Pushdown doesn’t eliminate the need for good indexes. In fact, effective pushdown depends on the remote server having indexes that support the filter and join predicates you’re shipping.</p>
<p>Below are some patterns to watch for in FDW queries and the indexes that support them. You can adapt these to your own schema.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Table</strong></td><td><strong>Access pattern</strong></td><td><strong>Recommended index</strong></td><td><strong>Why</strong></td></tr>
</thead>
<tbody>
<tr>
<td>tenant (remote)</td><td>Filter by tenant.name</td><td>UNIQUE (name) or BTREE (name)</td><td>Resolves tenant ID quickly</td></tr>
<tr>
<td>keycloak_group (remote)</td><td>Filter by name, join by tenant_id, filter on parent_group</td><td>Composite (tenant_id, name) and (parent_group)</td><td>Supports resolving root group and walking one‑level hierarchy</td></tr>
<tr>
<td>user_group_membership (remote)</td><td>Join by user_id, filter by group_id</td><td>BTREE (group_id, user_id)</td><td>Efficiently finds users in a set of groups</td></tr>
<tr>
<td>user_attribute (remote)</td><td>Filter by name, join by user_id</td><td>Composite (name, user_id) (optionally include value)</td><td>Matches “attribute name → users → values” flow</td></tr>
<tr>
<td>user_entity (remote)</td><td>Filter by tenant_id, enabled, service_account_client_link IS NULL, join by id</td><td>Partial index on (tenant_id, id) with predicate on enabled and service_account_client_link IS NULL</td><td>Helps remote planner start from user table when tenant and user filters are applied</td></tr>
<tr>
<td>filtercategory (local)</td><td>Filter by category &amp;&amp; uuid[], join on (entitytype, entityid)</td><td>GIN index on category, BTREE (entitytype, entityid)</td><td>Speeds array overlap checks and join predicate</td></tr>
</tbody>
</table>
</div><p>In general, indexes should reflect the join order you expect the remote planner to use. If your Remote SQL starts with:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">FROM</span> user_attribute ua <span class="hljs-keyword">JOIN</span> user_entity u <span class="hljs-keyword">ON</span> ua.user_id = u.id <span class="hljs-keyword">JOIN</span> user_group_membership ugm <span class="hljs-keyword">ON</span> ...
</code></pre>
<p>ensure that indexes exist on <code>user_attribute(user_id</code>) and <code>user_group_membership(user_id)</code>.</p>
<h2 id="heading-benchmarking-methodology">Benchmarking Methodology</h2>
<p>It’s easy to claim a performance improvement without proper measurement. Here's a repeatable method you can use to benchmark FDW query changes.</p>
<ol>
<li><p><strong>Warm the caches.</strong> Run each query once to load data into the remote buffer cache and the local FDW connection. Discard the timings.</p>
</li>
<li><p><strong>Measure latencies.</strong> Use EXPLAIN (ANALYZE, BUFFERS, VERBOSE) to capture execution times, buffer usage, and remote row counts. Be aware that EXPLAIN ANALYZE adds overhead, so record the raw execution time if possible by running the query directly.</p>
</li>
<li><p><strong>Record remote metrics.</strong> On the remote server, enable pg_stat_statements and track the calls, total_time, and rows for each remote query. This gives you a per‑query breakdown and confirms what Remote SQL is executed.</p>
</li>
<li><p><strong>Control for concurrency and network latency.</strong> Run benchmarks during a quiet period or isolate the test cluster. If your environment has high network latency, record the round‑trip time separately to attribute delays.</p>
</li>
<li><p><strong>Compare apples to apples.</strong> Benchmark the old and new queries under identical conditions. Use the same sample data, same remote server, and same connection settings.</p>
</li>
<li><p><strong>Look at row counts.</strong> The primary goal of pushdown is to reduce the number of rows shipped. Compare the rows column of each Foreign Scan node.</p>
</li>
</ol>
<p>Here's a simple matrix you can use to record your experiments:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Scenario</strong></td><td><strong>What you're testing</strong></td><td><strong>Expected change in Remote SQL</strong></td><td><strong>Metrics to record</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Baseline (old query)</td><td>Starting point: broad remote scans + local joins</td><td>Remote SQL lacks scoping predicates</td><td>p50/p95 latency, remote row count, local sort/hash time</td></tr>
<tr>
<td>Refactor (new query)</td><td>Join + filter pushdown</td><td>Remote SQL includes joins and filters</td><td>Same metrics, plus remote row count</td></tr>
<tr>
<td>Introduce a volatile function</td><td>Pushdown blocker test</td><td>Clause removed from Remote SQL</td><td>Remote row count increases, local filter cost increases</td></tr>
<tr>
<td>Type or collation mismatch</td><td>Semantic risk test</td><td>Remote SQL might change behavior or lose pushdown</td><td>Compare correctness and row counts</td></tr>
<tr>
<td>ORDER/LIMIT pushdown</td><td>Version‑dependent test</td><td>Remote SQL includes ORDER BY, LIMIT</td><td>Sort time shifts to remote. Row count should remain</td></tr>
<tr>
<td>use_remote_estimate on/off</td><td>Planning accuracy test</td><td>Planner uses remote estimates</td><td>Planning time, join order, and runtime difference</td></tr>
</tbody>
</table>
</div><h2 id="heading-monitoring-and-logging">Monitoring and Logging</h2>
<p>In production, you need to know when a query starts misbehaving. There are two places to look: the local server and the remote server.</p>
<h3 id="heading-local-metrics">Local metrics</h3>
<ol>
<li><p><strong>pg_stat_statements.</strong> This extension tracks planning and execution times, row counts, and buffer hits for each query. Look for high total times relative to rows or calls.</p>
</li>
<li><p><strong>Auto Explain or auto_explain.</strong> Turn on <code>auto_explain.log_min_duration_statement</code> to capture slow queries with plans. This will show you the Remote SQL executed and whether the plan changed.</p>
</li>
<li><p><strong>Connection pool metrics.</strong> Monitor connection counts and wait events related to FDW operations (for example, PostgresFdwConnect, PostgresFdwGetResult) as described in the documentation <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=,Extension">[10]</a>.</p>
</li>
</ol>
<h3 id="heading-remote-metrics">Remote metrics</h3>
<ol>
<li><p><strong>pg_stat_statements on the remote server.</strong> This lets you see which Remote SQL queries are being executed, how often, and how long they take. Compare these with the Remote SQL strings in your local EXPLAIN plans.</p>
</li>
<li><p><strong>Server logs.</strong> Increase <code>log_statement</code> or <code>log_min_duration_statement</code> on the remote server to capture long-running remote queries.</p>
</li>
</ol>
<p>Correlating local and remote metrics can reveal patterns such as a new code path causing a surge in remote queries or pushdown failures, leading to heavy remote scans.</p>
<h2 id="heading-case-study-refactoring-a-keycloak-coverage-query">Case Study: Refactoring a Keycloak Coverage Query</h2>
<p>The theory above may seem abstract until you see it play out in practice. Let's walk through a real example inspired by a Keycloak integration.</p>
<p>The original query calculated coverage: given a list of category IDs, it returned the percentage of users who had attributes mapped to those categories and a JSON array of entity counts. The query used a CTE to build a list of scoped users, then joined it with user attributes, category mappings, and a few other tables.</p>
<h3 id="heading-symptom">Symptom</h3>
<p>In a test environment with 100K user records, the query averaged 166 ms. This was slower than expected. Running <code>EXPLAIN (ANALYZE, BUFFERS, VERBOSE)</code> showed two foreign scans on the Keycloak database. The first scanned <code>user_entity</code> 416 times (loops = 416). The second pulled all rows from <code>user_attribute</code> where <code>name = 'attributeA'</code> before filtering by tenant and group locally.</p>
<p>Here's a simplified excerpt (numbers are approximate):</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">Foreign</span> Scan <span class="hljs-keyword">on</span> <span class="hljs-built_in">public</span>.user_entity  (actual <span class="hljs-type">time</span>=<span class="hljs-number">0.117</span>.<span class="hljs-number">.0</span><span class="hljs-number">.117</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">1</span> loops=<span class="hljs-number">416</span>)
  Remote <span class="hljs-keyword">SQL</span>: <span class="hljs-keyword">SELECT</span> id, tenant_id <span class="hljs-keyword">FROM</span> <span class="hljs-built_in">public</span>.user_entity <span class="hljs-keyword">WHERE</span> (enabled <span class="hljs-keyword">AND</span> service_account_client_link <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NULL</span> <span class="hljs-keyword">AND</span> id = <span class="hljs-meta">$1</span>)
<span class="hljs-keyword">Foreign</span> Scan <span class="hljs-keyword">on</span> <span class="hljs-built_in">public</span>.user_attribute  (actual <span class="hljs-type">time</span>=<span class="hljs-number">41.267</span>.<span class="hljs-number">.80</span><span class="hljs-number">.352</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">80739</span> loops=<span class="hljs-number">1</span>)
  Remote <span class="hljs-keyword">SQL</span>: <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">value</span>, user_id <span class="hljs-keyword">FROM</span> <span class="hljs-built_in">public</span>.user_attribute <span class="hljs-keyword">WHERE</span> ((<span class="hljs-string">'attributeA'</span> = <span class="hljs-type">name</span>))
</code></pre>
<p>The first scan performed a single-row lookup 416 times. The second scan retrieved 80,739 rows because the only condition pushed down was <code>name = 'attributeA'</code>. Tenant and group scoping occurred locally. That meant 80k rows were transferred over the network and then filtered down to about 671 on the local side.</p>
<h3 id="heading-diagnosis">Diagnosis</h3>
<p>There were two main issues.</p>
<p>First was the N+1 remote calls on user_entity. The join to <code>user_entity</code> was not pushed down, so the plan executed a remote lookup for each row from <code>user_group_membership</code>. This created 416 remote queries.</p>
<p>Second was the unscoped attribute fetch. Because the <code>WHERE</code> clause included <code>user_entity.tenant_id = tenant.id</code> and <code>keycloak_group.name = 'groupA'</code> in a higher CTE, the FDW could not see those predicates when scanning <code>user_attribute</code>. It therefore fetched all rows with <code>name = 'attributeA'</code> and left the tenant and group filters to the local side.</p>
<h3 id="heading-refactor">Refactor</h3>
<p>The fix was to inline the tenant and group joins into the user_attribute scan to avoid the nested-loop pattern. The refactored <code>selected_user_attributes</code> CTE looked like this (simplified for readability):</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">WITH</span> selected_user_attributes <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span> ua.user_id, ua.<span class="hljs-keyword">value</span>
  <span class="hljs-keyword">FROM</span> <span class="hljs-built_in">public</span>.user_attribute ua
  <span class="hljs-keyword">JOIN</span> <span class="hljs-built_in">public</span>.user_entity u <span class="hljs-keyword">ON</span> u.id = ua.user_id
  <span class="hljs-keyword">JOIN</span> <span class="hljs-built_in">public</span>.user_group_membership ugm <span class="hljs-keyword">ON</span> ugm.user_id = u.id
  <span class="hljs-keyword">JOIN</span> <span class="hljs-built_in">public</span>.keycloak_group g <span class="hljs-keyword">ON</span> g.id = ugm.group_id
  <span class="hljs-keyword">JOIN</span> <span class="hljs-built_in">public</span>.tenant r <span class="hljs-keyword">ON</span> r.id = u.tenant_id
  <span class="hljs-keyword">WHERE</span> ua.name = <span class="hljs-string">'attributeA'</span>
    <span class="hljs-keyword">AND</span> u.enabled
    <span class="hljs-keyword">AND</span> u.service_account_client_link <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NULL</span>
    <span class="hljs-keyword">AND</span> r.name = <span class="hljs-string">'tenantA'</span>
    <span class="hljs-keyword">AND</span> (g.name = <span class="hljs-string">'groupA'</span> <span class="hljs-keyword">OR</span> g.parent_group = (
         <span class="hljs-keyword">SELECT</span> id <span class="hljs-keyword">FROM</span> <span class="hljs-built_in">public</span>.keycloak_group <span class="hljs-keyword">WHERE</span> <span class="hljs-type">name</span> = <span class="hljs-string">'groupA'</span> <span class="hljs-keyword">AND</span> tenant_id= r.id
    ))
)
</code></pre>
<p>This single query expresses the same scoping logic that previously lived in separate CTEs. Because all the join conditions are on the same foreign server and use built‑in operators, the FDW can push down the entire join. The new plan looked like this:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">Foreign</span> Scan  (actual <span class="hljs-type">time</span>=<span class="hljs-number">7.840</span>.<span class="hljs-number">.7</span><span class="hljs-number">.856</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">671</span> loops=<span class="hljs-number">1</span>)
  Remote <span class="hljs-keyword">SQL</span>: <span class="hljs-keyword">SELECT</span> ua.user_id, ua.<span class="hljs-keyword">value</span> <span class="hljs-keyword">FROM</span> user_attribute ua <span class="hljs-keyword">JOIN</span> user_entity u <span class="hljs-keyword">ON</span> ua.user_id = u.id <span class="hljs-keyword">JOIN</span> user_group_membership ugm <span class="hljs-keyword">ON</span> ugm.user_id = u.id <span class="hljs-keyword">JOIN</span> keycloak_group g <span class="hljs-keyword">ON</span> g.id = ugm.group_id <span class="hljs-keyword">JOIN</span> tenant r <span class="hljs-keyword">ON</span> u.tenant_id= r.id <span class="hljs-keyword">WHERE</span> ua.name = <span class="hljs-string">'attributeA'</span> <span class="hljs-keyword">AND</span> u.enabled <span class="hljs-keyword">AND</span> u.service_account_client_link <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NULL</span> <span class="hljs-keyword">AND</span> r.name = <span class="hljs-string">'tenantA'</span> <span class="hljs-keyword">AND</span> (g.name = <span class="hljs-string">'groupA'</span> <span class="hljs-keyword">OR</span> g.parent_group = <span class="hljs-meta">$1</span>)
</code></pre>
<p>Only one remote query is executed, and it returns 671 rows. Tenant and group scoping occur on the remote server. There is no nested loop or repeated remote scan. The final runtime dropped to <strong>about 25 ms</strong>.</p>
<h3 id="heading-why-it-improved">Why it improved</h3>
<ol>
<li><p><strong>Fewer rows crossing the network.</strong> The old plan fetched 80k attribute rows and filtered them locally. The new plan fetched only the 671 scoped rows.</p>
</li>
<li><p><strong>No repeated remote calls.</strong> The old plan executed 416 remote scans of <code>user_entity</code>. The new plan performs one joined remote query.</p>
</li>
<li><p><strong>Less local work.</strong> Because the join and filtering happen remotely, the local side no longer hashes or filters large sets.</p>
</li>
</ol>
<h3 id="heading-key-takeaway">Key takeaway</h3>
<p>If you see a Foreign Scan with a high loops count or a Remote SQL that doesn’t contain your filters and joins, you’re leaving performance on the table. Merging filters and joins into a single remote query (subject to shippability rules) often yields orders-of-magnitude improvements.</p>
<h2 id="heading-checklist-and-troubleshooting-guide">Checklist and Troubleshooting Guide</h2>
<p>The following steps summarize how to approach FDW performance tuning:</p>
<ol>
<li><p><strong>Inspect the Remote SQL.</strong> Always run <code>EXPLAIN (VERBOSE)</code> and look at what is being sent to the remote. If your predicates are missing, the FDW isn't pushing them down.</p>
</li>
<li><p><strong>Check loops.</strong> If the loops are greater than 1 on a Foreign Scan, you are paying for repeated remote calls. Rewrite the query or reorder the joins to make the foreign scan run once.</p>
</li>
<li><p><strong>Make predicates shippable.</strong> Replace volatile functions with constants or parameters. Ensure operators and functions are built‑in or explicitly allow‑listed via the extensions option <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[2]</a>.</p>
</li>
<li><p><strong>Align types and collations.</strong> Use the same data types and collations on both sides to avoid semantic mismatches <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=It%20is%20generally%20recommended%20that,differently%20from%20the%20local%20server">[1]</a>.</p>
</li>
<li><p><strong>Push joins to the same server.</strong> Consolidate tables on one foreign server if possible. Joins across servers cannot be pushed down <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=When%20,clauses">[6]</a>.</p>
</li>
<li><p><strong>Use use_remote_estimate when planning seems off.</strong> Enabling remote estimates can improve join order selection <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>.</p>
</li>
<li><p><strong>Tune fetch_size and costs</strong> if your queries transfer many rows. A bigger fetch_size reduces round-trip; adjusting <code>fdw_startup_cost</code> and <code>fdw_tuple_cost</code> influences the planner <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>.</p>
</li>
<li><p><strong>Analyze foreign tables</strong> if you rely on local cost estimates. Keep in mind that stats can get stale quickly <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>.</p>
</li>
<li><p><strong>Monitor both servers.</strong> Use <code>pg_stat_statements</code> on local and remote servers to see how often remote queries run and how long they take.</p>
</li>
<li><p><strong>Test version upgrades.</strong> Each major release improves FDW pushdown semantics (for example, aggregates in 10 <a target="_blank" href="https://www.postgresql.org/docs/release/10.0/#:~:text=,Jeevan%20Chalke%2C%20Ashutosh%20Bapat">[7]</a>, ORDER/LIMIT in 12 <a target="_blank" href="https://www.postgresql.org/docs/release/12.0/#:~:text=,Etsuro%20Fujita%29%20%C2%A7%20%C2%A7">[8]</a>). Retest after upgrading.</p>
</li>
</ol>
<h2 id="heading-case-study-takeaways">Case Study Takeaways</h2>
<p>Querying remote data with PostgreSQL’s <code>postgres_fdw</code> can be fast and convenient if you respect the underlying mechanics. Pushdown is the difference between streaming a trickle of relevant rows and hauling an ocean of data across the network. It isn't simply a matter of moving CPU cycles – it changes how much data moves, how many network round-trip occur, and how much your local server has to do.</p>
<p>The rules may seem restrictive – use only immutable functions, avoid cross‑server joins, align types and collations – but they exist to preserve correctness while enabling optimization.</p>
<p>By reading <code>EXPLAIN</code> from the bottom up, inspecting the Remote SQL, and understanding the shippability rules, you can spot slow patterns quickly. Armed with tuning knobs like <code>fetch_size</code> and <code>use_remote_estimate</code>, and a willingness to rewrite queries to make joins and filters pushable, you can often achieve dramatic performance gains without touching your hardware.</p>
<p>This case study shows that rewriting a query to enable a single-joined remote query reduced runtime from around <strong>166 ms to 25 ms</strong>. That sort of improvement is not rare. It’s what happens when you treat FDW queries as distributed queries rather than local queries in disguise.</p>
<p>The next time you debug a slow FDW query, remember this handbook. Check the Remote SQL. Count the loops. Ask yourself: “Am I doing the work close to the data, or am I bringing the data to the work?” Adjust accordingly, and you'll write queries that make the most of Postgres's federated capabilities while keeping your latency in check.</p>
<p>This section closes the case study loop and summarizes exactly what changed in the plan and why it produced a large end-to-end win. The following sections of the handbook turn that single win into a repeatable method: how Postgres determines what is shippable, how to quickly read FDW plans, which operations and versions matter, and how to debug common failure modes that prevent pushdown.</p>
<h2 id="heading-advanced-operations-a-deeper-dive-into-shippability">Advanced Operations: A Deeper Dive into Shippability</h2>
<p>The previous sections introduced the basic rules around what can be pushed to the remote and why. To really make sense of those rules, you need to see how they play out on the operations you use every day.</p>
<p>This section walks through filters, joins, aggregates, ordering, and limits, DISTINCT queries, and window functions in more detail. By the end, you should have a mental map of which operations to trust and which to double‑check when reading your plans.</p>
<h3 id="heading-filters-and-simple-predicates">Filters and simple predicates</h3>
<h4 id="heading-where-clauses-matter-more-than-you-think">WHERE clauses matter more than you think</h4>
<p>When you specify <code>WHERE attribute = 'value'</code> on a foreign table, the FDW will happily transmit that predicate to the remote server as long as the comparison uses built‑in types and immutable operators. For example:</p>
<ul>
<li><p><code>WHERE id = 42</code> is fine</p>
</li>
<li><p><code>WHERE lower(username) = 'hamdaan'</code> is fine if <code>lower()</code> is allow‑listed and immutable</p>
</li>
<li><p><code>WHERE created_at &gt;= now() - interval '7 days'</code> is not shippable because <code>now()</code> is volatile</p>
</li>
</ul>
<p>When such a predicate cannot be pushed, the FDW will fetch every row that matches all the shippable predicates and apply the rest locally. That means that a seemingly innocuous call to <code>now()</code> can blow up your network traffic.</p>
<p>The lesson is simple: compute volatile values up front (in your application or in a CTE) and reference them as constants in the query against the foreign table.</p>
<h4 id="heading-complex-expressions-are-not-automatically-unsafe">Complex expressions are not automatically unsafe</h4>
<p>Suppose you have <code>WHERE (status = 'active' AND (age BETWEEN 18 AND 29 OR age &gt; 65))</code>. This entire expression is shippable because it uses built‑in boolean logic, simple comparisons, and immutable operators. The FDW will deparse it into remote SQL and forward it. You only need to worry when one of the subexpressions introduces a function or operator that the FDW doesn’t recognize or cannot safely assume exists on the remote.</p>
<p>A good heuristic is: if you can express your filter using only simple comparisons, boolean logic, and built‑in functions, pushdown should work. When in doubt, check the Remote SQL.</p>
<h4 id="heading-array-and-json-operators">Array and JSON operators</h4>
<p>Modern Postgres makes heavy use of array and JSON functions. Many of these functions, like the array overlap operator <code>&amp;&amp;</code> used in the case study, are built‑in and can be shipped. But some JSON functions are provided by extensions (like <code>jsonb_path_query</code> or functions from the <code>pgjson</code> family).</p>
<p>If your filter uses one of these, ensure that the extension is available and allow‑listed on the foreign server. Otherwise, the FDW will fetch rows and perform the JSON logic locally. This is rarely what you want when dealing with large JSON columns.</p>
<h3 id="heading-joins-the-good-the-bad-and-the-ugly">Joins: the good, the bad, and the ugly</h3>
<h4 id="heading-sameserver-joins-are-your-friend">Same‑server joins are your friend</h4>
<p>If you join multiple foreign tables that are all defined on the same foreign server and user mapping, and if the join condition uses only shippable expressions, then the FDW can generate a single remote join. This is the ideal case.</p>
<p>For example, joining orders and customers on <code>orders.customer_id = customers.id</code> is pushable, as long as both tables reside on the same foreign server. The remote planner will use its own statistics and indexes to plan the join, and the local server will simply iterate through the result. Postgres 9.6 and later support this pattern <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=When%20,clauses">[6]</a>.</p>
<h4 id="heading-crossserver-joins-break-pushdown">Cross‑server joins break pushdown</h4>
<p>If you attempt to join two foreign tables that live on different servers (or even on the same remote server but with different user mappings), postgres_fdw will fetch the tables separately and join them locally. This is almost always slower than pushing the join down, because you end up transferring both tables in their entirety.</p>
<p>The FDW design team chose not to support cross‑server joins because there is no portable way to tell two remote servers to cooperate on a join. Your options are: replicate one table on the other server, materialize the smaller table locally before joining, or restructure the query to filter aggressively on each side before joining locally.</p>
<h4 id="heading-mixed-localforeign-joins-are-tricky">Mixed local/foreign joins are tricky</h4>
<p>Joining a local table to a foreign table cannot be pushed down, for straightforward reasons: the remote server has no access to your local data. A common pattern that triggers repeated remote calls looks like this:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> u.id, a.<span class="hljs-keyword">value</span>
<span class="hljs-keyword">FROM</span> users u
<span class="hljs-keyword">LEFT JOIN</span> user_attribute a
  <span class="hljs-keyword">ON</span> a.user_id = u.id <span class="hljs-keyword">AND</span> a.name = <span class="hljs-string">'favorite_color'</span>;
</code></pre>
<p>If <code>users</code> is a local table and <code>user_attribute</code> is foreign, the plan may use a nested loop: for each local u, it executes a remote lookup in user_attribute to retrieve attributes.</p>
<p>The fix is to flip the query: retrieve all relevant rows from <code>user_attribute</code> in one remote scan, then join them locally. Or, if possible, create a small temporary table on the remote side with your u.id values, perform the join entirely remotely, and then fetch the results.</p>
<h4 id="heading-join-conditions-matter">Join conditions matter</h4>
<p>Even when joining two foreign tables on the same server, an unshippable join condition will force the join to be local. For example, <code>JOIN ON textcol ILIKE '%foo%'</code> is not pushable because <code>ILIKE</code> might not exist or behave identically on the remote.</p>
<p>If you need case‑insensitive matching, consider lowercasing both sides: <code>LOWER(textcol) = 'foo'</code> (assuming the remote server has the <code>lower()</code> function available and allowed). Similarly, joining on a cast expression (for example, <code>JOIN ON CAST(a.id AS text) = b.text_id</code>) can block pushdown. Define your columns with matching types instead.</p>
<h3 id="heading-aggregates-and-grouping">Aggregates and grouping</h3>
<p>Aggregates are where the data movement story shines. When you can push down a <code>GROUP BY</code> and aggregate functions like <code>COUNT</code>, <code>SUM</code>, <code>AVG</code>, or <code>MAX</code>, you reduce the result set to just the aggregated rows. This can be a difference of several orders of magnitude.</p>
<p>Postgres 10 introduced aggregate pushdown <a target="_blank" href="https://www.postgresql.org/docs/release/10.0/#:~:text=,Jeevan%20Chalke%2C%20Ashutosh%20Bapat">[7]</a>. But not all aggregates are equal:</p>
<p><strong>Simple aggregates</strong> such as <code>COUNT(*)</code>, <code>SUM(col)</code>, <code>AVG(col)</code>, <code>MIN(col)</code>, and <code>MAX(col)</code> are shippable when applied to shippable expressions. Even <code>COUNT(DISTINCT col)</code> is often shippable, because the remote can deduplicate before counting. The FDW will wrap the aggregate in a remote query and return just the aggregated row.</p>
<p>If you see a GroupAggregate node on the local side, check whether all involved columns and functions are shippable. If they are, ensure that the join conditions above are also pushable.</p>
<p><strong>Filtered aggregates</strong> such as <code>COUNT(*) FILTER (WHERE x &gt; 5) or SUM(col) FILTER (WHERE status = 'active')</code> are often pushable, because they translate into <code>SUM(CASE WHEN condition THEN col ELSE 0 END) or COUNT(...)</code>. As long as the filter is shippable, the FDW will push it into the remote aggregate.</p>
<p><strong>User‑defined aggregates</strong> are rarely pushable. If you have a custom aggregate function, the FDW will not assume that it exists or behaves the same on the remote server. Even if you install the function on both servers, postgres_fdw won't push it unless the function is in an allow‑listed extension.</p>
<p><strong>Grouping sets and rollups</strong> are not currently pushable. When you write <code>GROUP BY GROUPING SETS (...) or ROLLUP(...)</code>, Postgres will compute the grouping locally even if the underlying scan is remote.</p>
<p>If you need complex rollups, consider performing them in two steps: push down the initial grouping to the remote server to reduce rows, then perform the rollup locally.</p>
<h3 id="heading-order-by-limit-and-distinct">ORDER BY, LIMIT, and DISTINCT</h3>
<p>Ordering and limiting rows may seem like purely cosmetic features, but they affect how much data is transferred. If the remote can sort and limit, the local server only receives the top N rows. If it cannot, the local server must sort everything.</p>
<p>Postgres 12 expanded the cases where <code>ORDER BY</code> and LIMIT are pushed down <a target="_blank" href="https://www.postgresql.org/docs/release/12.0/#:~:text=,Etsuro%20Fujita%29%20%C2%A7%20%C2%A7">[8]</a>. Here are guidelines:</p>
<ul>
<li><p><strong>Single foreign scan with simple sort:</strong> If your query selects from one foreign table and sorts by a shippable expression (for example, <code>ORDER BY created_at DESC</code>), the FDW will include <code>ORDER BY</code> in Remote SQL. It will also push down <code>LIMIT</code> and <code>OFFSET</code>. This is ideal because the remote server does the sort and sends only the top rows.</p>
</li>
<li><p><strong>Sort after join:</strong> If you sort after joining two foreign tables on the same server, and the join and sort expressions are shippable, the FDW may push both down. But if the sort requires columns from the local side or from a different remote server, the FDW cannot push it down.</p>
</li>
<li><p><strong>Sort after aggregation:</strong> Sorting aggregated results is often pushable as long as the aggregate itself is pushable. But when grouping occurs locally, the sort remains local.</p>
</li>
<li><p><strong>DISTINCT behaves like GROUP BY.</strong> If the distinct expression list is shippable, the FDW can push it down. If you write <code>SELECT DISTINCT ON (col1) col2, col3 FROM ...</code> and col3 is not part of the <code>DISTINCT</code> list, Postgres will treat this as <code>GROUP BY</code> and may push it. Be aware that <code>DISTINCT ON</code> semantics differ from plain <code>DISTINCT</code> and may not be pushable in older Postgres versions.</p>
</li>
</ul>
<h3 id="heading-window-functions-1">Window functions</h3>
<p>Window functions (for example, <code>ROW_NUMBER() OVER (PARTITION BY ...), RANK(), LAG(), LEAD()</code>) rely on ordering and partitioning across rows.</p>
<p>Postgres has not yet taught <code>postgres_fdw</code> how to push window functions. When you see a WindowAgg node in your plan, it’s almost always local. The FDW will fetch the rows, and the local server will sort, partition, and compute the window. If you need to run window functions on remote data, plan to transfer the data locally.</p>
<h3 id="heading-versionspecific-quirks">Version‑specific quirks</h3>
<p>The exact pushdown capabilities vary by release. When planning migrations or deciding whether to rely on a pushdown behavior, check the release notes:</p>
<ul>
<li><p><strong>9.6:</strong> first version to support pushdown of joins and sorts, and remote updates and deletes.</p>
</li>
<li><p><strong>10:</strong> introduced aggregate pushdown <a target="_blank" href="https://www.postgresql.org/docs/release/10.0/#:~:text=,Jeevan%20Chalke%2C%20Ashutosh%20Bapat">[7]</a>, significantly reducing network use for <code>GROUP BY</code> queries.</p>
</li>
<li><p><strong>11:</strong> improved partition pruning and join ordering for foreign tables.</p>
</li>
<li><p><strong>12:</strong> expanded <code>ORDER BY</code> and <code>LIMIT</code> pushdown <a target="_blank" href="https://www.postgresql.org/docs/release/12.0/#:~:text=,Etsuro%20Fujita%29%20%C2%A7%20%C2%A7">[8]</a>.</p>
</li>
<li><p><strong>15:</strong> added pushdown for simple <code>CASE</code> expressions and additional built‑in functions.</p>
</li>
<li><p><strong>17</strong> (development at the time of writing) continues to expand shippable constructs. Always test on your target version because subtle improvements can change what the FDW can ship.</p>
</li>
</ul>
<h2 id="heading-common-antipatterns-and-how-to-avoid-them">Common Anti‑Patterns and How to Avoid Them</h2>
<p>Everyone has run into FDW queries that seemed reasonable but turned out to be bottlenecks. Here are a few of the most common mistakes and how to correct them. These examples are deliberately simplified – so you can adapt them to your schema.</p>
<h3 id="heading-using-volatile-functions-in-predicates">Using volatile functions in predicates</h3>
<p><strong>Anti‑pattern:</strong></p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> audit_logs
<span class="hljs-keyword">WHERE</span> event_ts &gt;= now() - <span class="hljs-type">interval</span> <span class="hljs-string">'1 day'</span>;
</code></pre>
<p><code>now()</code> is a volatile function, so the FDW refuses to push this predicate. It pulls all rows from audit_logs and filters them locally.</p>
<p><strong>Better:</strong></p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> audit_logs
<span class="hljs-keyword">WHERE</span> event_ts &gt;= <span class="hljs-meta">$1</span>;
</code></pre>
<p>Compute <code>$1</code> (a timestamp) in your application or upstream query. Or compute it once in a CTE:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">WITH</span> cutoff <span class="hljs-keyword">AS</span> (<span class="hljs-keyword">SELECT</span> now() - <span class="hljs-type">interval</span> <span class="hljs-string">'1 day'</span> <span class="hljs-keyword">AS</span> ts) <span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> audit_logs, cutoff <span class="hljs-keyword">WHERE</span> event_ts &gt;= cutoff.ts;
</code></pre>
<p>The FDW sees a constant and pushes the predicate.</p>
<h3 id="heading-joining-local-and-foreign-data-first">Joining local and foreign data first</h3>
<p><strong>Anti‑pattern:</strong></p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> u.email, ua.<span class="hljs-keyword">value</span>
<span class="hljs-keyword">FROM</span> users u
<span class="hljs-keyword">LEFT JOIN</span> user_attribute ua <span class="hljs-keyword">ON</span> u.id = ua.user_id <span class="hljs-keyword">AND</span> ua.name = <span class="hljs-string">'favorite_movie'</span>;
</code></pre>
<p>This uses a local table (users) to drive a join to a foreign table (user_attribute). The FDW receives 10,000 individual remote queries if users have 10,000 rows. Each call fetches one or zero rows from user_attribute.</p>
<p><strong>Better:</strong></p>
<pre><code class="lang-pgsql"><span class="hljs-comment">-- Fetch all favorite movies remotely and join locally</span>
<span class="hljs-keyword">WITH</span> remote_movies <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span> ua.user_id, ua.<span class="hljs-keyword">value</span>
  <span class="hljs-keyword">FROM</span> user_attribute ua
  <span class="hljs-keyword">WHERE</span> ua.name = <span class="hljs-string">'favorite_movie'</span>
)
<span class="hljs-keyword">SELECT</span> u.email, rm.<span class="hljs-keyword">value</span>
<span class="hljs-keyword">FROM</span> users u
<span class="hljs-keyword">LEFT JOIN</span> remote_movies rm <span class="hljs-keyword">ON</span> u.id = rm.user_id;
</code></pre>
<p>Now the FDW issues one query to fetch all relevant attributes, and the join is done locally in one pass.</p>
<h3 id="heading-crossserver-joins-without-materialization">Cross‑server joins without materialization</h3>
<p><strong>Anti‑pattern:</strong></p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> remote_db1.orders o
<span class="hljs-keyword">JOIN</span> remote_db2.customers c <span class="hljs-keyword">ON</span> o.customer_id = c.id;
</code></pre>
<p>This is not pushable because the two tables are on different foreign servers. Postgres will fetch orders and customers separately and join them locally. If orders have 1 million rows and customers have 50,000 rows, you will transfer 1.05 million rows.</p>
<p><strong>Better:</strong> Replicate or materialize one side on the other server (or locally) before joining. For example, create a materialized view m_customers on remote_db1 containing just the id and name of the customers you need, then join orders and m_customers on the same server. Alternatively, copy customers into a temporary table on the local server and join there.</p>
<h3 id="heading-complex-expressions-on-join-keys">Complex expressions on join keys</h3>
<p><strong>Anti‑pattern:</strong></p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> remote_table a
<span class="hljs-keyword">JOIN</span> remote_table b <span class="hljs-keyword">ON</span> CAST(a.key <span class="hljs-keyword">AS</span> <span class="hljs-type">text</span>) = b.key_text;
</code></pre>
<p>Casting a numeric key to text prevents pushdown. The remote server cannot use indexes and must return both tables. The local server performs the join and cast.</p>
<p><strong>Better:</strong> Align your schemas so that the join columns use the same type. If you cannot change the schema, create a computed column on the remote server with the appropriate type and use it in the join.</p>
<h3 id="heading-ignoring-collation-and-type-mismatches">Ignoring collation and type mismatches</h3>
<p><strong>Anti‑pattern:</strong></p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> remote_table
<span class="hljs-keyword">WHERE</span> citext_col = <span class="hljs-string">'abc'</span>;
</code></pre>
<p>If the remote server doesn’t have the citext extension installed, the comparison semantics will differ, and the FDW will refuse to ship the filter. This appears harmless until you see the plan and realize all rows were fetched.</p>
<p><strong>Better:</strong> Install the same extensions and collations on the remote server, or convert the column to a base type like text on both sides.</p>
<h2 id="heading-extending-tuning-calibrating-cost-models">Extending Tuning: Calibrating Cost Models</h2>
<p>Earlier, we discussed <code>fetch_size</code>, <code>use_remote_estimate</code>, and the cost knobs. This section expands on how to use them strategically.</p>
<h3 id="heading-balancing-fetch-size-and-memory">Balancing fetch size and memory</h3>
<p><code>fetch_size</code> controls how many rows the FDW asks for in each round trip <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[9]</a>. Think of it as the batch size. The default (100) works well for small result sets. If you expect to retrieve tens of thousands of rows, a higher fetch size reduces the overhead of many network requests. But there are trade‑offs:</p>
<ul>
<li><p><strong>Memory consumption:</strong> Each foreign scan buffers rows until they are consumed. A huge fetch size (for example, 10,000) may allocate more memory than you expect, especially when multiple scans run concurrently. Monitor memory usage as you increase this setting.</p>
</li>
<li><p><strong>Latency hiding:</strong> If network latency is high, overlapping network requests with local processing can hide some latency. But <code>postgres_fdw</code> does not pipeline multiple fetches – it waits for one batch before requesting the next. This means that a larger batch size reduces the number of waits, but cannot overlap them. If you operate across data centers, consider using a connection pooler or caching layer instead of just increasing fetch_size.</p>
</li>
</ul>
<h3 id="heading-remote-estimates-vs-local-estimates">Remote estimates vs. local estimates</h3>
<p>The planner uses statistics to estimate how many rows each node will produce, which in turn influences join order. When <code>use_remote_estimate</code> is false (the default), the planner guesses based on local stats collected by <code>ANALYZE</code> on the foreign table. This can be wrong if the remote table has a different distribution than the local sample, or if the table has changed since the last <code>ANALYZE</code>.</p>
<p>Setting <code>use_remote_estimate</code> to true instructs the FDW to run <code>EXPLAIN</code> on the remote server during planning to obtain row counts and cost estimates <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>. This can improve join ordering, especially when joining multiple foreign tables or mixing local and foreign tables. The downside is increased planning time because each remote estimate runs an extra query.</p>
<p>In practice:</p>
<ul>
<li><p>Enable <code>use_remote_estimate</code> on queries with complex joins where the planner picks obviously wrong join orders. If enabling it improves the plan, consider leaving it on for that server or table.</p>
</li>
<li><p>Use <code>ANALYZE</code> on foreign tables periodically if your remote data is relatively static. This populates local stats and can avoid the overhead of remote estimates.</p>
</li>
<li><p>Don’t enable <code>use_remote_estimate</code> indiscriminately on simple lookups. The cost of additional round-trip remote flights may outweigh the benefit.</p>
</li>
</ul>
<h3 id="heading-tuning-cost-parameters">Tuning cost parameters</h3>
<p><code>fdw_startup_cost</code> and <code>fdw_tuple_cost</code> control how much the planner thinks it costs to start a foreign scan and fetch each row <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a>. If these are too low, the planner may choose a nested loop that generates many small remote calls. If they are too high, the planner might avoid remote scans even when they are efficient.</p>
<p>You can adjust these parameters based on empirical measurement:</p>
<ul>
<li><p>Increase <code>fdw_startup_cost</code> to discourage the planner from using nested loops that call the remote table repeatedly. You might set it to the average cost of a round-trip remote.</p>
</li>
<li><p>Increase <code>fdw_tuple_cost</code> if network bandwidth is limited or expensive. This indicates to the planner that each remote row incurs higher fetch costs than a local row. The planner will prefer plans that filter early on the remote side.</p>
</li>
</ul>
<p>Always adjust these settings gradually and observe the effect on the plan. Keep separate settings per foreign server if network conditions differ.</p>
<h3 id="heading-when-to-analyze-foreign-tables">When to analyze foreign tables</h3>
<p>Running <code>ANALYZE</code> on a foreign table collects sample statistics by pulling a subset of rows from the remote server. This helps the planner estimate row counts when <code>use_remote_estimate</code> is off. It also helps decide whether to use an index on the remote side. You should analyze foreign tables when:</p>
<ul>
<li><p>The remote table is large and static, and you want accurate local estimates without the overhead of remote estimates.</p>
</li>
<li><p>You have just defined a foreign table, and the default stats are empty.</p>
</li>
<li><p>You changed the extensions allow‑list to enable more pushdown and want the planner to see the effect.</p>
</li>
</ul>
<p>Conversely, if the remote data changes constantly, <code>ANALYZE</code> results will quickly become stale. In that case, rely on use_remote_estimate instead.</p>
<h2 id="heading-further-case-studies-and-practical-examples">Further Case Studies and Practical Examples</h2>
<p>The Keycloak coverage example is not the only place where pushdown matters. The following scenarios illustrate other patterns you may encounter.</p>
<h3 id="heading-reporting-on-a-sharded-logging-system">Reporting on a sharded logging system</h3>
<p>Imagine you store application logs across multiple shards, each a separate Postgres database. You want to produce a report of the number of error logs per service per day.</p>
<p>A naïve approach might join all shards in one query:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> shard, service, date_trunc(<span class="hljs-string">'day'</span>, log_time) <span class="hljs-keyword">AS</span> day, COUNT(*)
<span class="hljs-keyword">FROM</span> shard1.logs
<span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>
<span class="hljs-keyword">SELECT</span> shard, service, date_trunc(<span class="hljs-string">'day'</span>, log_time) <span class="hljs-keyword">AS</span> day, COUNT(*)
<span class="hljs-keyword">FROM</span> shard2.logs
...;
</code></pre>
<p>This approach will fetch all log rows to the local server and aggregate them locally. A better solution is to push the grouping to each shard:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> shard, service, day, sum(count)
<span class="hljs-keyword">FROM</span> (
  <span class="hljs-keyword">SELECT</span> <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> shard, service, date_trunc(<span class="hljs-string">'day'</span>, log_time) <span class="hljs-keyword">AS</span> day, COUNT(*) <span class="hljs-keyword">AS</span> count
  <span class="hljs-keyword">FROM</span> shard1.logs
  <span class="hljs-keyword">WHERE</span> log_time &gt;= <span class="hljs-meta">$1</span> <span class="hljs-keyword">AND</span> log_time &lt; <span class="hljs-meta">$2</span>
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> service, day
  <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>
  <span class="hljs-keyword">SELECT</span> <span class="hljs-number">2</span> <span class="hljs-keyword">AS</span> shard, service, date_trunc(<span class="hljs-string">'day'</span>, log_time) <span class="hljs-keyword">AS</span> day, COUNT(*)
  <span class="hljs-keyword">FROM</span> shard2.logs
  <span class="hljs-keyword">WHERE</span> log_time &gt;= <span class="hljs-meta">$1</span> <span class="hljs-keyword">AND</span> log_time &lt; <span class="hljs-meta">$2</span>
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> service, day
  ...
) x
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> shard, service, day;
</code></pre>
<p>Here, each foreign server returns a small set of aggregated rows instead of raw logs. The outer aggregation sums across shards. This pattern generalizes: push grouping and filtering to the remote side, then combine locally.</p>
<h3 id="heading-combining-remote-and-local-data-for-analytics">Combining remote and local data for analytics</h3>
<p>Suppose you have a local table <code>users</code> and a remote table <code>orders</code>. You want to compute the average order amount per user segment. A naïve query might look like:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> u.segment, AVG(o.amount)
<span class="hljs-keyword">FROM</span> users u
<span class="hljs-keyword">JOIN</span> orders o <span class="hljs-keyword">ON</span> o.user_id = u.id
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> u.segment;
</code></pre>
<p>This is a local join driving a remote nested loop. The better approach is to aggregate orders remotely by user_id and join on the small result:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">WITH</span> remote_totals <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span> user_id, SUM(amount) <span class="hljs-keyword">AS</span> total, COUNT(*) <span class="hljs-keyword">AS</span> n
  <span class="hljs-keyword">FROM</span> orders
  <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> user_id
)
<span class="hljs-keyword">SELECT</span> u.segment, AVG(rt.total / rt.n)
<span class="hljs-keyword">FROM</span> users u
<span class="hljs-keyword">JOIN</span> remote_totals rt <span class="hljs-keyword">ON</span> u.id = rt.user_id
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> u.segment;
</code></pre>
<p>This pushes the heavy aggregation to the remote and transfers only one row per user. The local join then groups by segment. As with other examples, the key is to reduce remote rows before they cross the network.</p>
<h3 id="heading-avoiding-pushdown-for-correctness">Avoiding pushdown for correctness</h3>
<p>There are legitimate cases where you should <em>prevent</em> pushdown because of semantic differences. Postgres allows you to do this by adding <code>OFFSET 0</code> or wrapping the foreign table in a CTE.</p>
<p>For example, if a built‑in function behaves differently on the remote due to a version mismatch, you can force local evaluation:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">WITH</span> local_eval <span class="hljs-keyword">AS</span> (<span class="hljs-keyword">SELECT</span>  <span class="hljs-keyword">FROM</span> remote_table)  <span class="hljs-comment">-- CTE prevents pushdown</span>
<span class="hljs-keyword">SELECT</span> 
<span class="hljs-keyword">FROM</span> local_eval
<span class="hljs-keyword">WHERE</span> some_complex_expression(local_eval.col) &gt; <span class="hljs-number">0</span>;
</code></pre>
<p>Alternatively, a <code>WHERE</code> clause like <code>random() &lt; 0.1</code> will not push down because <code>random()</code> is volatile – you don't need to force it. But adding <code>OFFSET 0</code> is a simple hack that prevents any pushdown:</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> remote_table <span class="hljs-keyword">OFFSET</span> <span class="hljs-number">0</span>;
</code></pre>
<p>Knowing how to disable pushdown intentionally helps you debug. If a query returns different results when pushdown occurs, suspect type/collation mismatches or remote session settings <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=In%20the%20remote%20sessions%20opened,their%20expected%20search%20path%20environment">[4]</a>.</p>
<h2 id="heading-monitoring-diagnostics-and-regression-testing">Monitoring, Diagnostics, and Regression Testing</h2>
<p>Monitoring doesn't end at counting remote rows. To make pushdown reliable in production, you need to set up mechanisms to detect regressions and gather evidence when performance changes.</p>
<h3 id="heading-automate-explain-regression-tests">Automate EXPLAIN regression tests</h3>
<p>In addition to unit tests and integration tests, you can add tests that assert the shape of your plans. For instance, if a mission‑critical report must always push down a <code>WHERE</code> clause, you can write a test that runs <code>EXPLAIN (VERBOSE)</code> and checks that the Remote SQL contains the filter. You might even parse loops and assert that it is 1. When a developer inadvertently adds a non‑immutable function or changes a join, the test will fail. This is akin to snapshot testing for SQL.</p>
<h3 id="heading-monitor-pgstatstatements-across-servers">Monitor pg_stat_statements across servers</h3>
<p>Enable <code>pg_stat_statements</code> on both the local and remote servers. On the local side, track the total time, planning time, and rows for each FDW query. On the remote side, track which queries are being executed.</p>
<p>Look for outliers: a query whose remote calls spike or whose average remote rows jump from hundreds to thousands. Those are early signs of pushdown failure.</p>
<h3 id="heading-log-remote-sql-with-autoexplain">Log remote SQL with auto_explain</h3>
<p>Setting <code>auto_explain.log_min_duration_statement</code> (for example, to 500ms) causes Postgres to automatically log slow queries with their plans. Combine this with <code>auto_explain.log_verbose = true</code> and <code>auto_explain.log_nested_statements = true</code> to capture remote SQL as well. When a federated query slows down, the log will show you exactly what remote SQL was executed and how often. This is invaluable in production, where you cannot always run EXPLAIN interactively.</p>
<h3 id="heading-use-connection-pooling-and-prepare-statements">Use connection pooling and prepare statements</h3>
<p><code>postgres_fdw</code> maintains a connection pool keyed on the user mapping. It reuses connections between queries, but you can also use connection pooling at the network level (for example, pgbouncer or pgcat).</p>
<p>Keeping connections warm reduces the startup cost, as captured by <code>fdw_startup_cost</code>. Meanwhile, preparing statements on the remote server (via <code>PREPARE</code> and <code>EXECUTE</code>) can save parse time when the same remote SQL is executed frequently. <code>postgres_fdw</code> can use server‑side prepared statements for parameterized scans.</p>
<h3 id="heading-regression-testing-after-version-upgrades">Regression testing after version upgrades</h3>
<p>Every major Postgres release brings improvements to postgres_fdw pushdown semantics. But new releases also change planner heuristics and remote SQL generation. After an upgrade, rerun your key queries with EXPLAIN (VERBOSE), compare the Remote SQL, and benchmark them.</p>
<p>In some cases, a release may push down something previously local, revealing a latent type mismatch or a function difference. In other cases, pushdown may be withheld due to a new rule. Don’t assume that an upgrade automatically improves performance – test it.</p>
<h2 id="heading-extended-guidelines-for-advanced-dbas">Extended Guidelines for Advanced DBAs</h2>
<p>To close this handbook, here are consolidated guidelines distilled from the previous sections. They go beyond simple bullet points to capture nuances. Keep them handy for reference or print them out for your team.</p>
<ol>
<li><p><strong>Respect the FDW safety model.</strong> Immutable functions and built‑in operators are your friends. Anything outside that scope must be explicitly allowed or evaluated locally. Understand which items belong to each category and plan accordingly.</p>
</li>
<li><p><strong>Always read the Remote SQL.</strong> Don’t trust your intuition about what is being pushed down. The Remote SQL string is the only source of truth. It indicates whether a predicate, join, sort, or limit operation is occurring remotely. It also shows parameter placeholders (for example, $1) that correspond to values passed from the local plan.</p>
</li>
<li><p><strong>Reduce before you fetch.</strong> The network is the highest cost. If the remote can reduce rows through filtering, grouping, or limiting, let it. If it cannot, structure your query to enable it. Avoid queries that require pulling large raw tables and processing them locally.</p>
</li>
<li><p><strong>Beware of join order.</strong> The planner sometimes chooses a nested loop with a foreign table as the inner side, resulting in repeated remote calls. Examine loops: if you see a high number, consider rewriting the query or adjusting cost parameters.</p>
</li>
<li><p><strong>Use CTEs strategically.</strong> A CTE can isolate remote scans and let you control whether they are materialized once or inlined. Use <code>MATERIALIZED</code> to avoid repeated remote scans when a CTE is referenced multiple times. Use <code>NOT MATERIALIZED</code> to allow optimizations across CTE boundaries.</p>
</li>
<li><p><strong>Instrument, monitor, iterate.</strong> Good FDW performance is not a one‑off fix. Monitor queries and plans. Use tests to catch regressions. Adjust tuning knobs and indexes as your data or workload changes. Document your reasoning so others can understand why a particular plan is expected.</p>
</li>
<li><p><strong>Educate your team.</strong> Federated queries invite subtle bugs and performance traps. Share the high‑level rules – immutable functions only, cross‑server joins are local, always check remote SQL – so engineers write safer queries by default. A 30‑minute training can save hours of debugging later.</p>
</li>
</ol>
<h2 id="heading-bringing-it-all-together">Bringing it All Together</h2>
<p>This handbook has covered a lot of ground: from the high‑level principle that pushdown is about data movement, to the nitty‑gritty of join conditions and tuning knobs, to troubleshooting steps and case studies. It is intentionally opinionated and personal: these are the patterns and pitfalls encountered in real systems, not abstract guidelines. By sharing specific examples, I hoped to make the rules memorable and show how they interplay with actual workloads.</p>
<p>The goal is not just to tell you what to do, but to show you how to think and problem solve: review the plan, trace data movement, and determine whether the query is doing the heavy work in the right place.</p>
<p>That thinking process, practiced enough times, becomes second nature. When you write a new query, you'll automatically consider whether your predicates are immutable, whether the join can be shipped, and whether you are about to trigger an N+1 pattern. When you review plans, you'll start from the Foreign Scan nodes and remote SQL, not the top‑level node. When you tune, you'll know which knobs to twist and in which order.</p>
<p>Keep experimenting. Use the examples here as starting points. Try different structures in a test environment and measure the difference. The more you play with pushdown, the more comfortable you'll become with its constraints and superpowers.</p>
<p>If this handbook helps you avoid one performance incident or saves you from shipping a broken query, it has done its job. Enjoy exploring the federated world of Postgres.</p>
<h2 id="heading-references">References</h2>
<p><a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=It%20is%20generally%20recommended%20that,differently%20from%20the%20local%20server">[1]</a> <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[2]</a> <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=This%20option%2C%20which%20can%20be,false">[3]</a> <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=In%20the%20remote%20sessions%20opened,their%20expected%20search%20path%20environment">[4]</a> <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=functions%20in%20such%20clauses%20must,to%20reduce%20the%20risk%20of">[5]</a> <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=When%20,clauses">[6]</a> <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=">[9]</a> <a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html#:~:text=,Extension">[10]</a> PostgreSQL: Documentation: 18: F.38. postgres_fdw – access data stored in external PostgreSQL servers (<a target="_blank" href="https://www.postgresql.org/docs/current/postgres-fdw.html">https://www.postgresql.org/docs/current/postgres-fdw.html</a>)</p>
<p><a target="_blank" href="https://www.postgresql.org/docs/release/10.0/#:~:text=,Jeevan%20Chalke%2C%20Ashutosh%20Bapat">[7]</a> PostgreSQL: Release Notes (<a target="_blank" href="https://www.postgresql.org/docs/release/10.0/">https://www.postgresql.org/docs/release/10.0/</a>)</p>
<p><a target="_blank" href="https://www.postgresql.org/docs/release/12.0/#:~:text=,Etsuro%20Fujita%29%20%C2%A7%20%C2%A7">[8]</a> PostgreSQL: Release Notes (<a target="_blank" href="https://www.postgresql.org/docs/release/12.0/">https://www.postgresql.org/docs/release/12.0/</a>)</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Cryptography Handbook: Exploring RSA PKCSv1.5, OAEP, and PSS ]]>
                </title>
                <description>
                    <![CDATA[ The RSA algorithm was introduced in 1978 in the seminal paper, "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems". Over the decades, as RSA became integral to secure communications, various vulnerabilities and attacks have emerg... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-cryptography-handbook-rsa-algorithm/</link>
                <guid isPermaLink="false">67edb47680c0ce2ff2faebb9</guid>
                
                    <category>
                        <![CDATA[ Cryptography ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Hamdaan Ali ]]>
                </dc:creator>
                <pubDate>Wed, 02 Apr 2025 22:04:38 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743630655223/f7e0c094-2103-42cd-97bd-be79d14fff67.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The RSA algorithm was introduced in 1978 in the seminal paper, "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems". Over the decades, as RSA became integral to secure communications, various vulnerabilities and attacks have emerged, underscoring the importance of understanding and implementing RSA correctly.</p>
<p>This handbook will help you understand the internal workings of the RSA algorithm, how they have evolved over the years, and the schemes defined under various RFCs. This knowledge will help you make informed choices about the most suitable RSA schemes depending on your business requirements.</p>
<p>In this handbook, we’ll begin by exploring the foundational principles of the RSA algorithm. By examining its mathematical underpinnings and historical evolution, you will gain insight into the diverse array of attacks that have emerged over the years.</p>
<p>The narrative unfolds as an evolutionary journey: from the original, straightforward (textbook) RSA implementation, through the discovery of vulnerabilities, to the development of effective countermeasures, and further refinements as new challenges were encountered. This progression illuminates how RSA has transformed over time and also demonstrates how modern cryptographic libraries have integrated these advancements to achieve secure implementations in today’s applications.</p>
<p>You can also watch the associated video here:</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/jpcLbsuHWbU" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-alice-bob-paradigm">The Alice-Bob Paradigm</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-birth-of-the-rsa-cryptosystem">The Birth of the RSA Cryptosystem</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-prime-numbers-and-composite-moduli">Prime Numbers and Composite Moduli</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-euler-totient-function">The Euler Totient Function</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-computing-the-keys">Computing the Keys</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-rsa-operations">RSA Operations</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-encryption">Encryption</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-decryption">Decryption</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-digital-signatures">Digital Signatures</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-issues-with-eulers-totient-function-in-rsa">Issues with Euler’s Totient Function in RSA</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-carmichael-function">The Carmichael Function</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-mathematical-implication-of-the-carmichael-function">Mathematical Implication of The Carmichael function</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-carmichael-function-in-modern-implementations">The Carmichael Function in Modern Implementations</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-issues-with-raw-rsa">Issues with Raw RSA</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-exploiting-textbook-rsas-determinism-and-malleability">Exploiting Textbook RSA’s Determinism and Malleability</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-key-generation-setup">Key Generation (Setup)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-encryption-process">Encryption Process</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-determinism-exploit-ciphertext-guessing-attack">Determinism Exploit (Ciphertext Guessing Attack)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-malleability-exploit-ciphertext-manipulation-attack">Malleability Exploit (Ciphertext Manipulation Attack)</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-low-exponent-attacks">Low-Exponent Attacks</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-hastads-broadcast-attack-low-exponent-meets-multiple-recipients">Håstad’s Broadcast Attack: Low Exponent Meets Multiple Recipients</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-introduction-to-padding-schemes-in-rsa">Introduction to Padding Schemes in RSA</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-public-key-cryptography-standards-pkcs1-v15">Public Key Cryptography Standards (PKCS#1 v1.5)</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-the-mathematics-behind-pkcs1-v15">The Mathematics Behind PKCS#1 v1.5</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-bleichenbacher-attack">The Bleichenbacher Attack</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-optimal-asymmetric-encryption-padding-oaep">Optimal Asymmetric Encryption Padding (OAEP)</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-the-mathematics-behind-oaep">The Mathematics Behind OAEP</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-why-sha-1-or-md5-are-safe-in-rsa-oaep">Why SHA-1 or MD5 Are Safe in RSA-OAEP</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-label-hashing">Label Hashing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-mask-generation-function-mgf1">Mask Generation Function (MGF1)</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-adoption-in-cryptographic-libraries-pkcs1-v15-vs-oaep">Adoption in Cryptographic Libraries (PKCS#1 v1.5 vs OAEP)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-enhancing-digital-signatures-the-transition-to-pss">Enhancing Digital Signatures: The Transition to PSS</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-problems-with-early-rsa-signature-schemes">Problems with Early RSA Signature Schemes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-birth-of-the-probabilistic-signature-scheme-pss">Birth of the Probabilistic Signature Scheme (PSS)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-mathematics-behind-pss">The Mathematics Behind PSS</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-road-ahead-assessing-rsas-long-term-viability">The Road Ahead: Assessing RSA’s Long-Term Viability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ol>
<li><strong>Linear Algebra:</strong> A foundational understanding of Linear Algebra and Modular Arithmetic will help you understand certain sections of the handbook, though it is not an absolute requirement. This handbook provides comprehensive explanations of mathematical expressions and their underlying concepts as they arise.</li>
</ol>
<p>For a concise and relevant introduction to the Chinese Remainder Theorem (CRT) in the context of the handbook, you may find this resource helpful: <a target="_blank" href="https://www.youtube.com/watch?v=Mt9v7-xBuaA">CRT, RSA, and Low Exponent Attacks | YouTube</a>.</p>
<ol start="2">
<li><strong>Patience (and a Sense of Adventure):</strong> RFCs can sometimes get dull to read, and research papers can feel intimidating at first glance. This handbook is designed to make standard cryptographic concepts accessible to everyone, guiding you through each step with clarity and intuition. Every concept is reinforced with clear, step-by-step examples, ensuring not only a thorough understanding but also familiarity with widely used standard notations. So take your time, take a deep breath, and embrace the journey.</li>
</ol>
<p>For visual learners, the associated video may offer a more engaging experience.</p>
<h2 id="heading-the-alice-bob-paradigm"><strong>The Alice-Bob Paradigm</strong></h2>
<p>Throughout this handbook, you will come across numerous sequence diagrams and mathematical proofs that use the Alice-Bob Paradigm.</p>
<p>The Alice-Bob paradigm is a common convention in cryptography where two generic entities, often named Alice and Bob, are used to illustrate various scenarios, protocols, or cryptographic principles.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742677993632/c9312974-4cb9-4496-8b23-b6d0d61c0a45.png" alt="The Alice Bob Paradigm" class="image--center mx-auto" width="818" height="483" loading="lazy"></p>
<p>These characters represent two parties engaged in communication, with Alice typically representing the sender or initiator, and Bob representing the receiver or responder.</p>
<p>We often introduce Eve as a third party, symbolizing an eavesdropper or potential attacker, adding an element of security risk, and illustrating scenarios where external entities might attempt to intercept or manipulate the communication.</p>
<h2 id="heading-the-birth-of-the-rsa-cryptosystem">The Birth of the RSA Cryptosystem</h2>
<p>The year 1978 witnessed the birth of a new era in cryptography with the introduction of the RSA cryptosystem, named after its inventors (Rivest, Shamir, and Adleman).</p>
<p>This development, introduced in the paper "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems", provided a method for secure digital communication and laid the foundation for modern public-key cryptography.  </p>
<p>At the heart of RSA lies elementary number theory – specifically, the properties of prime numbers and modular arithmetic. Let’s first understand how these key concepts form its mathematical foundations.</p>
<h3 id="heading-prime-numbers-and-composite-moduli">Prime Numbers and Composite Moduli</h3>
<p>The algorithm starts by selecting two large prime numbers, denoted as <em>p</em> and <em>q</em>. Their product (\(n = p \times q\)) forms the modulus for both the public and private keys.  </p>
<p>The security of RSA depends heavily on the fact that, while multiplying these primes is computationally straightforward, factoring the resulting large composite number <em>n</em> is considered infeasible for sufficiently large primes.  </p>
<p>At this point, it’s important to note that p and q must be large prime numbers to ensure RSA’s security. Fortunately, modern libraries handle this automatically by using well-established prime-generation algorithms. As a result, you can focus on higher-level aspects of your applications without having to manage the low-level details of prime selection.</p>
<p>For instance, let’s have a look at OpenSSL’s RSA key generation routine which performs several checks to ensure that the resulting modulus \(n = p \times q \) meets the desired bit-length requirements:</p>
<p>The below snippet right-shifts the product of the generated primes (stored in <code>r1</code>) by <code>bitse - 4</code> bits to isolate the top 4 bits, which are then checked to ensure that the modulus meets the desired size criteria.</p>
<pre><code class="lang-c"><span class="hljs-keyword">if</span> (!BN_rshift(r2, r1, bitse - <span class="hljs-number">4</span>))
    <span class="hljs-keyword">goto</span> err;
bitst = BN_get_word(r2);
</code></pre>
<p>The extracted bits (<code>bitst</code>) are then compared against a predefined range (from <code>0x9</code> to <code>0xF</code>). This range ensures that the most significant byte of the modulus isn’t too small or too large.</p>
<pre><code class="lang-c"><span class="hljs-keyword">if</span> (bitst &lt; <span class="hljs-number">0x9</span> || bitst &gt; <span class="hljs-number">0xF</span>) {
    bitse -= bitsr[i];
</code></pre>
<p>If the significant bits do not fall within the desired range, the bit length is adjusted and the prime-generation process is retried. If the number of retries exceeds a set limit, the entire process is restarted.</p>
<pre><code class="lang-c"><span class="hljs-keyword">if</span> (!BN_GENCB_call(cb, <span class="hljs-number">2</span>, n++))
    <span class="hljs-keyword">goto</span> err;
<span class="hljs-keyword">if</span> (primes &gt; <span class="hljs-number">4</span>) {
    <span class="hljs-keyword">if</span> (bitst &lt; <span class="hljs-number">0x9</span>)
        adj++;
    <span class="hljs-keyword">else</span>
        adj--;
} <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (retries == <span class="hljs-number">4</span>) {
    i = <span class="hljs-number">-1</span>;
    bitse = <span class="hljs-number">0</span>;
    sk_BIGNUM_pop_free(factors, BN_clear_free);
    factors = sk_BIGNUM_new_null();
    <span class="hljs-keyword">if</span> (factors == <span class="hljs-literal">NULL</span>)
        <span class="hljs-keyword">goto</span> err;
    <span class="hljs-keyword">continue</span>;
}
retries++;
<span class="hljs-keyword">goto</span> redo;
</code></pre>
<p>To ensure that the numbers are necessarily primes, these libraries use a combination of probabilistic tests, including the Rabin-Miler Primality Testing, and sieving methods to quickly eliminate non-prime candidates.</p>
<h3 id="heading-the-euler-totient-function">The Euler Totient Function</h3>
<p>For a number <em>n</em> that is the product of two primes, the Euler totient function is given by:</p>
<p>$$\varphi(n) = (p-1)(q-1)$$</p><p>This function counts the number of integers less than \(n\) that are co-prime to \(n\). Euler’s theorem, which states that for any integer <em>a</em> co-prime to <em>n</em>, \( a^{\varphi(n)} \equiv 1 \pmod{n}\) plays a central role in proving why RSA’s operations are reversible.</p>
<p>But most modern RSA cryptosystems use the Carmichael function instead of the Euler’s Totient Function. We will examine the reasoning behind this shift in the next few sections.</p>
<h3 id="heading-computing-the-keys">Computing the Keys</h3>
<p>Now we select an integer \(e\) such that \(1 &lt; e &lt; \varphi(n)\)and \(\gcd(e, \varphi(n)) = 1\). This \(e\) becomes the public exponent you see as a parameter in the RSA function calls you make.</p>
<p>With that done, now let’s determine \(d\) as the modular multiplicative inverse of \(e \, \, modulo \, \varphi(n)\). In other words, \(d\) is computed such that:</p>
<p>$$e \times d \equiv 1 \pmod{\varphi(n)}$$</p><p>This step is the mathematical linchpin ensuring that decryption is the inverse operation of encryption.</p>
<p>In the 1978 paper, the authors explicitly provided these formulas and steps. They showed that if you encrypt a message m using \(c = m^e \mod n\) and then decrypt using \(m = c^d \mod n \) , the original message is recovered – thanks to the properties of modular exponentiation and Euler’s theorem. This mathematical framework was novel at the time and immediately set the stage for a new era in cryptography.</p>
<h2 id="heading-rsa-operations">RSA Operations</h2>
<p>Now that the mathematical foundations are laid, the RSA algorithm can be seen as a set of three core operations: Encryption, Decryption, and Signing. Throughout this handbook's next sections, we will critically analyze these operations and learn about several pitfalls in each. Then we will examine how these were averted with the birth of new schemes, each to solve a new issue discovered on the way.</p>
<h3 id="heading-encryption">Encryption</h3>
<p>With the public key \((n, e)\) available to everyone, any user can encrypt a message \(m\) (where \(m\) is first encoded as an integer in the range \(0 \leq m &lt; n\) ) using the formula:</p>
<p>$$c = m^e \mod n$$</p><p>Here, c is the ciphertext. Because the operation is based on modular exponentiation, even if m is known, recovering m from c without knowing d is computationally hard.</p>
<h3 id="heading-decryption">Decryption</h3>
<p>The intended recipient, who possesses the private key \(d\), decrypts the cipher text \(c\) by computing:</p>
<p>$$m = c^d \bmod n$$</p><p>Using the relationship (\(e \times d \equiv 1 \pmod{\varphi(n)}\)) and properties from Euler’s theorem, the above operation exactly inverts the encryption step, recovering the original message \(m\).</p>
<p>This ensures that only the holder of the private key can read the encrypted message. This is the backbone of RSA’s use in secure communication.</p>
<p>The sequence diagram below wraps up our discussion so far:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742754978876/9b007639-8595-4d11-93ff-355820cb98c7.png" alt="Sequence Diagram: Textbook RSA Encryption" class="image--center mx-auto" width="732" height="540" loading="lazy"></p>
<h3 id="heading-digital-signatures">Digital Signatures</h3>
<p>Digital signatures fulfill a different security goal: authenticity and integrity rather than confidentiality. While encryption and decryption use the public key for “locking” and the private key for “unlocking,” digital signatures reverse these roles.</p>
<h4 id="heading-1-signing">1. Signing</h4>
<p>The author of a message uses their private key \(d\) to compute a signature \(s\) on the message \(m\), guided by the formula mentioned below:</p>
<p>$$s = m^d \bmod n$$</p><p>This can later be verified by others using the corresponding public key. The purpose here is not to recover a secret message but to create a proof of authenticity.</p>
<h4 id="heading-2-verification">2. Verification:</h4>
<p>Anyone with the public key \((n, e)\) can verify that the signature s indeed belongs to the message \(m\) by computing:</p>
<p>$$m \equiv s^e \bmod n$$</p><p>If the equivalence holds, it confirms two key points: That the message has not been tampered with (that is, integrity), and that the signature must have been generated using the private key d (that is, authenticity).<br>As long as \(d\) is kept secret, only the legitimate signer can produce a valid signature. Take at look at the sequence diagram below to understand the complete process.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742755268516/6dea4239-f214-42c4-96c7-5fc55c7249d9.png" alt="Sequence Diagram: Textbook RSA Signatures" class="image--center mx-auto" width="732" height="540" loading="lazy"></p>
<h2 id="heading-issues-with-eulers-totient-function-in-rsa">Issues with Euler’s Totient Function in RSA</h2>
<p>While using Euler’s Totient Function works well in theory, implementers of the scheme realized its practical downsides. Simply put, the primary issue was that Euler’s Totient Function can lead to a larger private exponent \(d\) than what was necessary.</p>
<p>To completely appreciate this fact, let’s take a step back to understand why the size of the private exponent \(d\) matters in RSA.</p>
<p>RSA decryption (or signing) involves computing \(m^d ~~mod ~n\) which is done via modular exponentiation. The time complexity of exponentiation algorithms (like square-and-multiply) grows with the number of bits in \(d\). A larger \(d\) means more multiplications and squarings, that is slower decryption.</p>
<p>In practice, if using the Euler’s Totient Function makes \(d\) roughly twice as large as what is required, then decryption can be almost twice as slow compared to using the minimal \(d\). This inefficiency is especially noticeable when \(e\) is small (common public exponents like 3 or 65537). A small \(e\) leads to a very large \(d\) under \(φ(n)\).</p>
<p>Beyond performance, having an unnecessarily large \(d\) can increase storage size slightly (a few more bytes for the key). This can also lead to interoperability quirks, which is why standards and protocols such as FIPS 186-4 [1] and RFC 8017 [2] expect \(d\) to be below a certain size. We will take a detailed look at this in the next section.</p>
<p>To combat these issues, cryptographers utilized the Carmichael function to generate RSA keys. Before we dive into how the Carmichael function helps our case, let’s quickly understand what the Carmichael function actually is.</p>
<h2 id="heading-the-carmichael-function">The Carmichael Function</h2>
<p>The Carmichael Function, represented by \(λ(n)\), also known as the reduced totient or least universal exponent, is defined as the smallest positive integer \(m\) such that for every integer \(a\) co-prime to \(n\), \( a^m ≡ 1 (mod n)\).</p>
<p>To put this in easy terms, \(λ(n)\) is the exponent of the multiplicative group modulo \(n\) (the least common multiple of the orders of all elements). For RSA-style moduli (product of primes), the Carmichael function is guided by the formula:</p>
<p>$$\lambda(n) = \operatorname{lcm}(p-1,\,q-1)$$</p><p>where \(n = p . q\) with \(p\) and \(q\) being the large primes.</p>
<p>You may now understand the Carmichael function better if we put it in the following way: \(λ(n)\) is the least common multiple of \(λ(n)\) of each prime power dividing n. So for a prime \(p\), \(λ(p) = φ(p) = p – 1\), and for two primes, we take the \(lcm\) of \(p-1 \) and \(q-1.\)</p>
<h3 id="heading-mathematical-implication-of-the-carmichael-function">Mathematical Implication of The Carmichael function</h3>
<p>The Carmichael function \(λ(n)\) is a “tighter” bound. What this means is that \(λ(n)\) divides \(φ(n)\) (since the exponent of a finite group always divides the group order by Lagrange’s Theorem [3])</p>
<p>If \(p\) and \(q\) are both odd primes, then \(p–1\) and \(q–1 \) are even, so their least common multiple is roughly half of \((p–1)(q–1)\). Mathematically:</p>
<p>$$λ(n) = \dfrac{(p–1)(q–1)} {gcd(p–1, q–1)}$$</p><p>We can observe that this \(λ(n)\) is lesser than or equal to \(φ(n)\) and often considerably smaller. This means \(λ(n)\) provides the minimal exponent needed for RSA’s correctness, whereas \(φ(n)\)might be a larger number that still works but isn’t necessary.</p>
<p>When you choose two large random primes \(p\) and \(q\), you have:</p>
<p>$$\varphi(n) = (p-1)(q-1) \approx n,$$</p><p>because for large primes, the subtracted ones make only a small difference compared to \(p\) and \(q\) themselves.</p>
<p>Now, since both \(p-1\) and \(q-1 \) are even, they each have a factor of 2. If those are their only common factors (which is often the case for random primes), then:</p>
<p>$$\lambda(n) = \mathrm{lcm}(p-1, q-1) \approx \frac{\varphi(n)}{2}.$$</p><p>When you compute the private exponent \(d\) as the modular inverse of \(e\) (a small number) modulo \( \varphi(n)\) versus modulo \(\lambda(n)\), the range from which \(d\) is chosen is roughly twice as large in the former case. That means the typical \(d\) when computed modulo \(\varphi(n)\) can be about twice as large as when computed modulo \(\lambda(n)\). A larger \(d\) means that during decryption (or signing) the modular exponentiation \(c^d \mod n\) takes slightly more time.</p>
<p>Intuitively, using \(λ(n)\) ensures we don’t “overshoot” the exponent required for the modular arithmetic to cycle back to 1.</p>
<p>A smaller \(d\) makes every RSA decryption and signature operation faster. For instance, if \(λ(n)\) is roughly half of \(φ(n)\), then \(d\) will have one less bit than it would otherwise, cutting the exponentiation work by about 50%. This is a free performance gain, as we aren’t changing the security assumptions or the key size \(n\), just using the mathematically tight value for the exponent. The RSA algorithm’s security is not weakened by this and now the \(d\) is different but functionally equivalent.</p>
<h3 id="heading-the-carmichael-function-in-modern-implementations">The Carmichael Function in Modern Implementations</h3>
<p>The critical property for RSA (\(e·d ≡ 1 ~mod ~~λ(n)\)) is both necessary and sufficient for correct decryption, thanks to Carmichael’s theorem. So there’s no need for \(d\) to also satisfy the stronger condition modulo \(φ(n)\).</p>
<p>By switching to computing \(d ~ modulo ~~ λ(n)\) (i.e., \(d = e^{-1} ~mod ~~λ(n)\)), we directly get the smallest working private exponent. Ronald Rivest himself noted this optimization in his 1999 seminal paper [4], stating that solving for \(d\) using \( λ(n)\) instead of \(φ(n)\) is slightly preferable because it can result in a smaller value for d.</p>
<p>Over time, the use of \( λ(n)\) in RSA moved from an academic suggestion to an industry standard. Today’s cryptographic standards explicitly acknowledge or require the \(λ(n)\) approach.</p>
<p>For example, the official RSA standard (PKCS #1 v2.2, RFC 8017 [2]) defines the RSA key generation in terms of \(λ(n)\). It specifies that the private exponent \(d\) is chosen such that \(e·d ≡ 1 (mod λ(n))\) (with \(λ(n) = lcm(p–1, q–1)\)). In other words, PKCS #1 expects the Carmichael function to be used for the modulus of the exponent. Likewise, NIST’s FIPS 186-4 (Digital Signature Standard) mandates that \(d\) be less than \(λ(n)\).</p>
<p>Any RSA key where \(d\) is larger than \(λ(n)\) is considered non-compliant in those strict contexts. This effectively forces implementations to use the smaller \(λ(n)\)-based exponent, since any “oversized” \(d\) can be reduced \(mod ~~λ(n)\) to meet the criterion.</p>
<p>Standards such as FIPS 186-4 [1] (the Digital Signature Standard) and RFC 8017 [2] (which specifies PKCS#1 v2.2 for RSA Cryptography) include requirements or recommendations that imply the private exponent \(d\) should be as small as possible and ideally less than \( \lambda(n)\). Using \(\lambda(n)\) (the least common multiple of \(p-1\) and \(q-1\)) directly produces the smallest valid \(d\), whereas using \(\varphi(n)\) often results in a \(d\) that is larger than necessary. This not only improves performance (by reducing the number of modular multiplications needed during decryption/signing) but also helps maintain interoperability with protocols that expect d to be below a certain size.</p>
<p>The Python cryptography library (PyCA cryptography) explicitly documents [5] that it uses Carmichael’s totient to generate the “smallest working value of \(d\),” noting that older implementations (including the original RSA paper) used Euler’s totient and ended up with larger exponents. OpenSSL also uses the Carmichael function in their low-level RSA APIs [6].</p>
<p>This shift to the Carmichael function ensures that under the hood your RSA key is a bit more efficient than the ones from the late 1970s while providing the same level of security.</p>
<h2 id="heading-issues-with-raw-rsa">Issues with Raw RSA</h2>
<p>Raw or “Textbook” RSA soon turned out to be insecure when two major weaknesses were discovered.</p>
<p>The operations involved in RSA are entirely deterministic, which means that for a given plaintext \(m\), encryption always produces the same cipher text \(C = m^e \mod n\).</p>
<p>An eavesdropper or an attacker, say Eve, can guess or derive plain texts by exploiting the predictability of outputs. Since RSA encryption is a public operation, an attacker can encrypt likely messages and compare results to a target cipher text – a trivial chosen plaintext <em>attack</em>.</p>
<p>Besides this, textbook RSA is also malleable. This means that its algebraic structure allows attackers to manipulate cipher texts in meaningful ways. For instance, given a cipher text \(C = RSA(M)\), an attacker can multiply it by the encryption of a known value (say, r) to produce a new cipher text \(C’ = C · r^e ~~mod ~n\), which decrypts to the plaintext \(M·r\). When the legitimate receiver decrypts \(C'\), the result is \(M·r\), from which the attacker can often recover \(M\).</p>
<p>Let’s understand these vulnerabilities with a small practical example.</p>
<h2 id="heading-exploiting-textbook-rsas-determinism-and-malleability">Exploiting Textbook RSA’s Determinism and Malleability</h2>
<h3 id="heading-key-generation-setup"><strong>Key Generation (Setup)</strong></h3>
<p>For our toy example, we’ll choose small prime numbers and generate an RSA key pair:</p>
<p>Let’s select the values of \(p =3\) and \(q=11\). Both of these values are prime. Now, compute the modulus and Totient Function as follows:</p>
<p>$$\begin{gather} \begin{split} n = p × q = 3 × 11 = 33 \\ φ(n) = (p – 1) × (q – 1) = 2 × 10 = 20 \end{split} \end{gather}$$</p><p>Now choose the public exponent. Let’s consider \(e=3\) since it is coprime with \( φ(n) = 20\), and \(gcd(3, 20) = 1\).</p>
<p>Now let’s compute the private exponent. We know that d is the modular inverse of \(e ~~mod ~φ(n)\). We need to find d such that \((d × e) ≡ 1~~ (mod ~20)\). Using this knowledge we can compute \(d = 7\) as \(3 × 7 = 21 ≡ 1 ~~ (mod~ 20)\).</p>
<p>Finally, the public key is \((n = 33, ~ e = 3)\) and the private key (secret) is \(d = 7\).</p>
<h3 id="heading-encryption-process">Encryption Process</h3>
<p>Now, let’s encrypt a simple message using the above key. Let us select our plaintext to be \(M = 4\). The cipher text in this case would be:</p>
<p>$$\begin{gather} \begin{split} C = 4^3 ~~mod ~33 \\ C = 64 ~~mod ~33 \\ C = 64 – 33×1 = 31 \end{split} \end{gather}$$</p><p>To consolidate the findings so far, if we encrypt message \(4\) with the public key \((e=3, n=33)\), we will produce the cipher text \(31\). Now, let’s try the exploits.</p>
<h3 id="heading-determinism-exploit-ciphertext-guessing-attack">Determinism Exploit (Ciphertext Guessing Attack)</h3>
<p>Textbook RSA is deterministic – the same plaintext always yields the same ciphertext (with no randomness involved). An attacker who intercepts the ciphertext \(C=31\) can exploit this by encrypting likely plaintext guesses and comparing results:</p>
<p>The adversary, say Eve, will try encrypting candidate plaintexts with the public key and see which one produces \(31\). They may pick randomized values to increase their efficiency:</p>
<p>$$\begin{gather} \begin{aligned} Guess~ M = 1 ⇒ 1^3~~ mod ~33 = 1 \\ Guess~ M = 2 ⇒ 2^3~~ mod ~33 = 8 \\ Guess~ M = 3 ⇒ 3^3~~ mod ~33 = 27 \\ Guess~ M = 4 ⇒ 4^3~~ mod ~33 = 31 \\ \end{aligned} \end{gather}$$</p><p>By simply comparing ciphertexts, the attacker finds that encrypting \(4\) yields 31, which matches the intercepted ciphertext. Thus, the attacker learns the original plaintext \(M\) was \(4\). This is possible because there’s no randomization in textbook RSA – an eavesdropper can identify a message by trial encryption of guesses, breaking confidentiality if the message space is small or guessable.</p>
<h3 id="heading-malleability-exploit-ciphertext-manipulation-attack">Malleability Exploit (Ciphertext Manipulation Attack)</h3>
<p>Raw RSA is also malleable. This means an attacker can take a ciphertext and modify it in a way that results in a predictable change in the decrypted plaintext. Let’s understand how this works.</p>
<p>RSA has a multiplicative property, that is, multiplying two ciphertexts corresponds to multiplying their plaintexts before encryption:</p>
<p>$$E(M_1) \cdot E(M_2) \mod n = (M_1^e \mod n)\times(M_2^e \mod n) \mod n = (M_1 \cdot M_2)^e \mod n$$</p><p>The sequence diagram below explains how the malleability exploit works in naive RSA.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1741314973046/6be306c5-3ca6-4ea8-8daf-d1937b6459df.png" alt="Sequence Diagram: Malleability Exploit" class="image--center mx-auto" width="828" height="640" loading="lazy"></p>
<p>Alice sends a ciphertext to Bob after the initialization phase. Note that by this point, n and e are public knowledge. Eve intercepts this ciphertext by using mechanisms such as a MiTM (Man in the Middle) attack.</p>
<p>Now, Eve picks a known value to manipulate the message. Let’s say the attacker chooses \(X = 2\) (with the intent to double the original plaintext).</p>
<p>Then they compute the encryption of X using the public key:</p>
<p>$$E(X) = 2^3 \mod 33 = 8.$$</p><p>Now, Eve multiplies the original ciphertext by this value (mod n) to get a new ciphertext:</p>
<p>$$\begin{gather} \begin{split} C{\prime} = C \times E(X) \mod n = 31 \times 8 \mod 33 \\ C{\prime} = 248~~ mod~ 33 = 248 – 33×7 = 248 – 231 = 17 \end{split} \end{gather}$$</p><p>This new ciphertext \(C{\prime}\) is the encryption of the product of the original plaintext and \(2\). If we directly encrypted \(M \times X = 4 \times 2 = 8\) with RSA, we would get \(8^3 \mod 33 = 512 \mod 33 = 17\). This means that \(C′\) corresponds to the plaintext \(8\), which is the original message \(4\) multiplied by \(2\).</p>
<p>In a real-world chosen ciphertext attack, the attacker may have access to a decryption oracle or observe a system response that reveals information about \(M{\prime}\). The decryption result \(8\) is exactly \(M \times 2\) (the original message multiplied by the attacker’s chosen factor). Knowing the factor \(X = 2\), the attacker can deduce the original message by dividing: \(8/ 2 = 4\).</p>
<p>Note that Eve has not broken the mathematical foundations behind RSA here. They have only used the public key to compute an encryption of \(2\), and then combined it with the intercepted ciphertext. They don’t know the original plaintext yet, but they have manipulated the ciphertext in a way that they know the new plaintext is twice the original message.</p>
<h2 id="heading-low-exponent-attacks">Low-Exponent Attacks</h2>
<p>Beyond determinism and malleability exploits, textbook RSA is also vulnerable to Low-Exponent Attacks. Using a small public exponent like \(e = 3\) (or sometimes \(17\)) was popular because it used to speed up encryption and signature verification. But this soon turned out to be a security concern.</p>
<p>When RSA uses a small public exponent (say, \(e = 3\)) and the plaintext is very short (so that \(M^3\) is smaller than the modulus \(n\)), the encryption does not “wrap around” modulo \(n\). Mathematically:</p>
<p>$$c = M^3 \mod n = M^3 \quad \text{(if \( M^3 &lt; n \))}$$</p><p>Let’s understand this with an easy example:</p>
<p>Consider our plaintext to be: \(M = 5\). We compute \(M^3\) as \(M^3 = 5^3 = 125\).</p>
<p>Now assume \(n\) is a \(4096\)‑bit number which is large compared to \(125\). In this case, the ciphertext is simply \(c = 125\). Eve intercepting \(c = 125\) can compute the cube root of \(125\) to get the plaintext: \(\sqrt[3]{125} = 5\) thus recovering \(M\) directly.</p>
<p>This shows that if \(M\) is small enough, the ciphertext leaks the plaintext when \(e\) is low.</p>
<h2 id="heading-hastads-broadcast-attack-low-exponent-meets-multiple-recipients">Håstad’s Broadcast Attack: Low Exponent Meets Multiple Recipients</h2>
<p>In 1985, Johan Håstad’s highlighted the broadcast attack that illustrates the danger of a low exponent, \(e\), when the same message is sent to multiple parties as a broadcast.</p>
<p>Imagine Alice wants to send the same plaintext message M to three different recipients. Each recipient has their own RSA public key with modulus \(N_1, N_2, N_3,\) but for speed all use \(e = 3\) (a common practice historically). Alice encrypts \(M\) with each public key, yielding ciphertexts:</p>
<p>$$\begin{gather} \begin{split} C_1 = M^3 \bmod N_1 \\ C_2 = M^3 \bmod N_2 \\ C_3 = M^3 \bmod N_3 \end{split} \end{gather}$$</p><p>Eve, who intercepts all three \(C_1, C_2, C_3\) can recover <em>M</em> without breaking any single RSA key.</p>
<p>Since each \(N_i \) is different (and we assume they are pairwise coprime, as RSA keys should be), the attacker can use the Chinese Remainder Theorem (CRT) to combine the three congruences \(x \equiv C_i \pmod{N_i}\). Note that at this point Eve only has \(C_1\), \(C_2\) and \(C_3\). They do not have the plaintext \(M\) or \(M^3\) and yet they can reconstruct \(M^3\) with the intercepted data. To understand the Chinese Remainder Theorem and this reconstruction, you may follow this: <a target="_blank" href="https://www.youtube.com/watch?v=Mt9v7-xBuaA">CRT, RSA, and Low Exponent Attacks | Youtube</a>.</p>
<p>There is a unique solution modulo \(N_1N_2N_3\) for \(x\), and that solution turns out to be an integer, \(x = M^3\) (because the true integer \(M^3\) is smaller than the product \(N_1N_2N_3\) of each \(M &lt; N_i \) ). In essence, CRT lets Eve reconstruct \(M^3\) exactly. Once they have \(M^3\) as an ordinary integer, they simply take the cube root to find \(M\). There’s no need to factor any modulus or invert the RSA function – the math falls out due to the low exponent.</p>
<p>The sequence diagram below aims to provide a high-level understanding of the attack:</p>
<p><img src="https://mermaid.ink/img/pako:eNqNlN9P2zAQx_-VmyWkIpWqSeostTQkFvawh-6h7AFNEcgk19RSY3e2A3RV__dd-gNoExB5is_f-_jum3PWLDcFMsEc_q1R53itZGlllWmgZymtV7laSu3haqFyhN4N6gLteXv_u3mA3hRppZCWQYckldYs3orCDtG1fMS3mqhD86ORXBWPaJ20KxLsJL-MRzAUPSlW7NdPlOvAG3AUhwqdkyXCpIn4uUUEezjVDQ7MYxJcXF62OxXQbDugyBytx2cPvfQ-gG8wuYugMgXo-4MfHbwOW7qJ4RExfJ_Y9rAbGB0Boxcjz862Fivt0ea4JNP2ZjnoFcZ7LEBaa57cvoST7wEX79j08xVI_nyQ22nJUXb4QXZX-0fJUXtiTkACKN1o522dU0rvdufUOdRO6RLS6e8DolUBFdCCpaZa1p7sS-sHhKkxHswMbj9RRhOgsWw2aXTnJLOqVFouXuf3S6ZZn1VoK6kKusrrBpoxEleYMUGvBc5kvfAZy_SGpLL25malcyaoPeyzellIf7j5TMzkwlGUbhoTa_bMRMCHAx7xMU84j5I4HPfZiolRMhgNR-NkxOM45FE8ijd99s8YIgwH4zCMEp4EQTLkX0MebHF_tpu7M7FQ3tjJ7uez_Qf1mTV1OX85v7RNNzu13Q53amrtmUjGm_8gLoH2?type=png" alt="Sequence Diagram: Håstad’s Broadcast Attack" width="1329" height="664" loading="lazy"></p>
<p>Now let’s see this attack in action with a sample:</p>
<p>Suppose three different RSA public keys all use exponent \(e=3\), with moduli \( n_b = 187\) (for Bob),<br>\(n_c = 115 \) (for Carol), and \(n_d = 87\)  (for Dave).</p>
<p>These \(n_i\) are pairwise coprime (\(gcd\) of each pair is \(1\)). Now assume the same plaintext message \(M\) is encrypted with each public key. Let’s take a concrete \(M\). For example with \(M=42\), we will have:</p>
<p>$$\begin{gather} \begin{split} c_b = M^3 \bmod n_b \\ c_c = M^3 \bmod n_c \\ c_d = M^3 \bmod n_d \\ \end{split} \end{gather}$$</p><p>On calculating these, we have:</p>
<p>$$\begin{gather} \begin{split} c_b = 42^3 \bmod 187 = 36 \\ c_c = 42^3 \bmod 115 = 28 \\ c_d = 42^3 \bmod 87 = 51 \\ \end{split} \end{gather}$$</p><p>So the three ciphertexts observed are \(36\), \(28\), and \(51\), respectively. Eve who knows \(n_b, n_c, n_d\) and these ciphertexts can now recover \(M\) as follows:</p>
<ol>
<li><p>Eve will compute the total modulus \(N = n_b \cdot n_c \cdot n_d = 187 \times 115 \times 87 = 1,870,935.\) (This is the modulus for the combined system of congruences).</p>
</li>
<li><p>Now Eve will compute the partial products for each congruence:</p>
</li>
</ol>
<p>$$\begin{gather} \begin{split} N_b = \frac{N}{n_b} = \frac{1,870,935}{187} = 10,005 \\ N_c = \frac{N}{n_c} = \frac{1,870,935}{115} = 16,269 \\ N_d = \frac{N}{n_d} = \frac{1,870,935}{87} = 21,505 \end{split} \end{gather}$$</p><ol start="3">
<li><p>At this point, Eve needs the inverses of each \(N_i\) modulo its corresponding \(n_i\):</p>
<ul>
<li><p>First Eve computes \(M_b = (N_b)^{-1} \bmod n_b\), i.e. the number \(M_b\) such that \(N_b \cdot M_b \equiv 1 \pmod{187}\). In this case, \(N_b = 10005\). Using the extended Euclidean algorithm, Eve can find \(M_b = 2\) (since \(10005 \times 2 = 20010 \equiv 1 \pmod{187}\)).</p>
</li>
<li><p>Then Eve computes \(M_c = (N_c)^{-1} \bmod n_c\). Here \(N_c = 16269\). The inverse mod \(115\) turns out to be \(M_c = 49\) (For verification: \(16269 \times 49 \equiv 1 \pmod{115}\)).</p>
</li>
<li><p>Next up, Eve computes \(M_d = (N_d)^{-1} \bmod n_d\). For \(N_d = 21505\), the inverse mod \(87\) is \(M_d = 49\) as well (coincidentally the same value in this case, since \(21505 \times 49 \equiv 1 \pmod{87}\)).</p>
</li>
</ul>
</li>
</ol>
<p>Now Eve reconstructs the combined value using the Chinese Remainder Theorem for three congruencies. The construction of this formula is beyond the scope of this handbook, but to completely understand how this springs into action, you may go through this video: <a target="_blank" href="https://www.youtube.com/watch?v=Mt9v7-xBuaA">CRT, RSA and Low Exponent Attacks | Youtube</a>.</p>
<p>$$C \;=\; c_b \cdot N_b \cdot M_b \;+\; c_c \cdot N_c \cdot M_c \;+\; c_d \cdot N_d \cdot M_d \pmod{N}$$</p><p>On substituting the numbers:</p>
<p>$$C = 36 \cdot 10005 \cdot 2 \;+\; 28 \cdot 16269 \cdot 49 \;+\; 51 \cdot 21505 \cdot 49 \pmod{1,870,935}$$</p><p>Let’s carefully evaluate each term:</p>
<p>$$\begin{gather} \begin{split} 36 \cdot 10005 \cdot 2 = 720,360 \\ 28 \cdot 16269 \cdot 49 = 22,341,348 \\ 51 \cdot 21505 \cdot 49 = 5,37,40,995 \\ \end{split} \end{gather}$$</p><p>Summing these gives a raw total of \(7,20,360 +  2,23,21,068 + 5,37,40,995 = 7,67,82,423\). Now reduce this modulo \(N = 1,870,935\):</p>
<p>$$\begin{align} \begin{split} C \equiv 7,67,82,423 \pmod{1,870,935}\\ C = 74,088 \\ \end{split} \end{align}$$</p><p>Now Eve will simply take the cube root of \(C: \sqrt[3]{74088} = 42\), which is the original plaintext.<br>Eve has successfully recovered \(M\).</p>
<p>The key takeaway from these attacks is that without proper defenses. RSA alone does not satisfy modern definitions of security. It is not resistant to chosen-plaintext or chosen-cipher text attacks. This gap between the theoretical one-way function (RSA’s trapdoor permutation) and a secure encryption scheme became evident as implementers found that naive RSA could be “broken” by various clever tricks.</p>
<p>To counter these weaknesses, standards bodies introduced padding schemes to strengthen RSA encryption. In the following sections, you will learn about each of these paddings schemes and how they’ve been exploited over the years.</p>
<h2 id="heading-introduction-to-padding-schemes-in-rsa">Introduction to Padding Schemes in RSA</h2>
<p>Before we dive into the padding schemes and how it helps our case, let’s quickly recap the need for padding in RSA.</p>
<p>Textbook RSA encryption is deterministic. The same plaintext always produces the same ciphertext under a given public key. This determinism makes raw RSA insecure. An attacker can guess possible messages, encrypt them with the public key, and compare with the target ciphertext to see which guess matches.</p>
<p>Beyond determinism, small-exponent attacks illustrate why padding is critical. If the message \(m\) is too small relative to the modulus, raising it to a small public exponent (like \(e=3\)) might not wrap around \(N\). Padding the plaintext with random data before encryption remedies these problems by making the ciphertext unpredictable and ensuring \(m^e\) spans the modulus’ range.</p>
<h2 id="heading-public-key-cryptography-standards-pkcs1-v15"><strong>Public Key Cryptography Standards (PKCS#1 v1.5)</strong></h2>
<p>In 1998, Kaliski and RSA Laboratories introduced PKCS#1 v1.5 to the world in a public publication [7]. In PKCS#1 v1.5, every RSA‐encrypted message is wrapped inside a special “encryption block” \(EB\). This block ensures that the raw message is both the right size for RSA and padded in a way that’s hard to tamper with.</p>
<p>In this scheme, the plaintext is padded to the size of the modulus \(N\) (in bytes) as:</p>
<p>$$EB = 00 ~||~ BT ~||~ PS ~||~ 00 ~||~ M$$</p><p>Here, \(0x00\) (Leading Zero Byte) is always at the front. It ensures that, when the concatenated string \(EB\) is converted to a big‐endian integer, the value is less than the RSA modulus (that is, we don’t end up with a number too large for RSA to handle). You will better appreciate this fact when we dive into the mathematics behind this.</p>
<p>The next octet is the Block Type, \(BT\), which tells us the “type” of padding being used. The standard defines three possible \(BT\) values: \(00, 01, \) and \(02\)- to support different operations. For example, \(BT=00\) and \(BT = 01\) is used for private-key operations (such as digital signatures) and \(BT = 02\) is used for public-key operations. For encryption under PKCS#1 v1.5, this is always \(0x02\). It’s basically a label that says, “This is an encryption block, not something else”.</p>
<p>The next block is the Padding String \(PS\). This is a string of nonzero random bytes. This is crucial for security because it introduces randomness into each encryption. If the same message is encrypted multiple times, these random bytes ensure that each ciphertext looks different, foiling many simple attacks that rely on seeing repeated patterns.</p>
<p>The next octet, \(0x00\), is a Delimiter<strong>.</strong> This single zero byte marks the end of the padding. During decryption, this helps the recipient quickly identify where the padding stops and the real message begins.</p>
<p>Finally, we have the actual data you want to protect – \(M\). Once the recipient has verified the padding, they know exactly where to find this message.</p>
<p>This mechanism helped solve the deterministic issue of naive RSA. In the next sections, let’s understand the mathematics involved in PKCS#1 v1.5 padding and its security implications.</p>
<h3 id="heading-the-mathematics-behind-pkcs1-v15">The Mathematics Behind PKCS#1 v1.5</h3>
<p>Before we begin, let’s get our symbols and abbreviations correct. We will use upper-case symbols (such as \(EB\)) to denote octet strings and bit strings. We will use lower-case symbols (such as \(n\)) to denote integers.</p>
<p>In PKCS#1 v1.5, we will use \(k\) to represents the length of the RSA modulus \(n\) in bytes. For example, if you have a \(1024\)-bit RSA key, then the RSA modulus \(n\) is a \(1024\)-bit number. Since there are \(8\) bits in a byte, if your RSA modulus is \(L\) bits long, then:</p>
<p>$$k = \left\lceil \frac{L}{8} \right\rceil = \frac{1024}{8} = 128 \text{ bytes}$$</p><p>The total length of the encryption block will be equal to this RSA key length \(k\) (in bytes). Now here the length of the data \(M\) shall not be more than \(k-11\) octets, since the 11 bytes are consumed by the blocks – \(0x00  ~||~ 0x02 ~||~ PS ~||~ 0x00\). This limitation guarantees that the length of the padding string \(PS\) is at least eight octets, which is a security condition in PKCS#1v1.5:</p>
<p>$$∣PS∣=k~−∣M∣−~3$$</p><p>For example, with a \(1024\)-bit RSA modulus, the value of \(k\) comes out to be \(128\). Here Alice could encrypt up to \(128 - 11 = 117\) bytes of data. The \(11\) bytes are used for the \(0x00  ~||~ 0x02 ~||~ PS ~||~ 0x00\) structure. The random \(PS \) ensures that each encryption of the same message produces a different ciphertext, preventing the deterministic encryption problem.</p>
<p>RSA doesn’t directly operate on the bytes. Once the padded string \(EB\) is ready, it needs to be converted into an integer guided by the Octet String to Integer Primitive (OS2IP) formula:</p>
<p>$$x = \sum_{i=1}^{k} 2^{8(k - i)} \,\mathrm{EB}_i$$</p><p>where \(EB_i\) are the octets of \(EB\) from first to last. In other words, \(EB_1\) (the first byte) is the most significant byte, and \(EB_k\) (the last byte) is the least significant. Now Alice can simply encrypt this block using \(C = x^c \mod n\).</p>
<p>To solidify our learnings so far, let’s apply this to a sample plaintext and find the padded blocks.</p>
<p>Let’s assume the RSA modulus is \(8\) bytes long (\(k=8\)). Suppose we want to encrypt a message \(M\) that is \(2\) bytes long. Then the padding string \(PS\) must fill the remaining space:</p>
<p>$$Total ~ bytes=k=8=1(0x00)+1(BT)+∣PS∣+1(delimiter)+∣M∣$$</p><p>Since \(∣M∣=2\) and there are \(∣M∣=2∣\) fixed bytes, can find the required length of the padding string:</p>
<p>$$∣PS∣=8−3−2=3 ~ bytes$$</p><p>Let’s pick 3 arbitrary nonzero bytes for \(PS\), say - \(0xA3, ~0x5F, ~0xC2\). And let’s say the message is the ASCII text “Hi”. In hexadecimal, that’s: \(0x48\) for 'H' and \(0x69\) for 'i'.</p>
<p>Thus, the complete encryption block becomes:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742368983011/f682532c-6664-4197-8e77-60ea034f82c5.png" alt="Sample Encryption Block in PKCS#1 v1.5" class="image--center mx-auto" width="1191" height="437" loading="lazy"></p>
<p>Now we will convert this octet string to an integer using the OS2IP formula we discussed above:</p>
<p>$$x = \sum_{i=1}^{k} 2^{8(k - i)} \,\mathrm{EB}_i$$</p><p>For our example, with \(k=8\) the conversion is:</p>
<p>$$x=  0x00×256^7+0x02×256^6+0xA3×256^5+0x5F×256^4+0xC2×256^3+0x00×256^2+0x48×256^1+0x69×256^0$$</p><p>Note that the hexadecimal values can be converted to decimal as needed. For instance, \(0xA3 = 163, 0x5F = 95, 0xC2 = 194, 0x48 = 72,\) and \(0x69 = 105\).</p>
<p>There is an interesting observation in the application of this formula. Because the first two bytes are fixed (\(0x00\) and \(0x02\)), the integer \(x\) has a known lower bound. The contribution of the first two bytes is:</p>
<p>$$0×256^ 7 +2×256^ 6 =2×256^ 6$$</p><p>The rest of the bytes (\(PS\), the delimiter, and \(M\)) add some value that is at least \(0\) and at most just less than \(256^6\) (since the second byte is fixed as \(0x02\) and cannot be \(0x03\)). Thus, \(x\) is in the range:</p>
<p>$$2×256 ^ 6 ≤x&lt;3×256 ^ 6$$</p><p>This property which makes the range predictable, paved the way for the Bleichenbacher attack (also known as the “padding oracle” attack). If a system reveals whether a decrypted block is “correctly padded,” an attacker can systematically probe different ciphertexts and narrow down the plaintext – because the attacker knows it must lie in that narrow range. Let’s take a detailed look at the Bleichenbacher attack in the next sections and understand how the exploit works.</p>
<h2 id="heading-the-bleichenbacher-attack">The Bleichenbacher Attack</h2>
<p>In 1998, Daniel Bleichenbacher published a seminal paper [8] demonstrating an adaptive chosen-ciphertext attack against RSA with PKCS#1 v1.5 padding. The Bleichenbacher Attack, also dubbed as the “million messages” attack, demonstrated that if an attacker has access to an oracle that tells whether a submitted ciphertext decrypts to a properly padded plaintext (that is, whether the PKCS#1 v1.5 formatting is correct), the attacker can gradually recover the full plaintext. Let’s break down how this attack works:</p>
<p>First, Eve needs an Oracle. The attack assumes the attacker can query a system, such as an SSL/TLS server, and find out if a given ciphertext \(C\) is PKCS#1 v1.5 conformant. In the 1998 paper, Bleichenbacher exploited the fact that a TLS server, when presented with an improperly padded RSA-encrypted premaster secret, would respond with a specific error alert if the padding was wrong. Essentially, the server acted as an oracle: it would decrypt \(C\) with its private key and simply tell the attacker “padding OK” or “padding error” (the error could be timing-based or an explicit alert).</p>
<p>Note that the oracle does not reveal the plaintext. It only reveals a single bit of information at a time: “valid padding or not.” This might seem harmless, but Bleichenbacher showed that it’s enough to eventually recover the plaintext.</p>
<p>To quickly recap, the attacker’s goal is to find the unknown message integer \(m\) (the PKCS#1-padded plaintext as an integer) given its ciphertext \(C = m^e \bmod N\), using the oracle. We know that if \(m\) is properly padded, it lies in a specific numeric range: \(2B \le m &lt; 3B\) where \(B = 2^{8*(k-2)}\), as defined earlier.</p>
<p>If \(k=128\) bytes, then \(B=2^{8*126}\), and a correctly padded \(m\) will start with \(0x00 ~||~0x02\), so it’s between \(2B\) and \(3B\). The attacker, Eve, initially only knows that \(m\) is in the range \([2B, 3B)\).</p>
<p>In the Bleichenbacher Attack, Eve will exploit RSA’s multiplicative property. They will choose a number \(s\) (called the multiplier) and compute a new ciphertext \(C' = (C  s^e) \bmod N\). This \(C'\) here corresponds to a new plaintext: \(m' = m  s \bmod N\) (because \(C' \equiv m^e * s^e \equiv (ms)^e \pmod{N}\)).</p>
<p>To begin the attack, Eve finds some \(s_0\) such that \(C_0 = C * (s_0)^e \mod N\) yields a valid padding. This is referred to as the Blinding step. This is usually easy – for example, \(s_0\) can be chosen so that \(m * s_0\) is just slightly above \(N\), which almost certainly will wrap around and land in \([2B,3B)\). The attacker does not know \(m\) to verify this directly. They rely on the padding oracle’s yes/no response to infer that the blinded plaintext \((m×s_0)\mod  N\) falls in the correct range.</p>
<p>If the oracle returns “valid padding” for a given \( s_0\), it tells the attacker that \(s_0 \mod N\)lies between \(2B\)and \(3B\). Mathematically:</p>
<p>$$2B≤(m×s_0)~mod  N&lt;3B$$</p><p>Now, Eve will try to try to narrow down this range in a loop, which is often referred to as the interval having step. Initially, Eve had one wide interval \([a, b] = [2B, 3B)\) that contains \(m\). In each iteration, Eve tries increasing values of \(s\) (starting from a certain minimum) until the oracle returns “padding OK” for \(C' = C_0 * s^e\). Suppose this happens at some \(s = s_i\). Given this feedback, Eve now knows:</p>
<p>$$2𝐵 ≤  (𝑚 × 𝑠_i) ~ mod 𝑁 &lt; 3𝐵$$</p><p>This congruence implies there exists some integer \(r\) such that:</p>
<p>$$2B  ≤ ( m×s_i)−rN  &lt;  3B$$</p><p>Rearranging, we get a constraint on \(m\):</p>
<p>$$\frac{2B+rN}{s_i}  ≤  m  &lt;  \frac{3B+rN}{s_i}$$</p><p>Eve doesn’t know \(r\) outright, but they can solve for the possible range of \(r\) by considering the current interval \([a,b]\) for \(m\). Essentially, Eve uses the previous bounds on \(m\) to guess which \(r\) would make the inequality true, then updates the new bounds \([a, b]\) as the intersection of all possible solutions for \(m\). This dramatically shrinks the interval.</p>
<p>Each oracle query yields such a constraint. Eventually, the interval \([a,b]\) collapses to a single value, \([a,a]\). Now, Eve can find the plaintext using:</p>
<p>$$m = (a × s_i^{-1}) ~ mod N$$</p><p>At that point, Eve has recovered the entire padded plaintext \(m\), and by stripping off the padding, the original message itself.</p>
<p>The sequence diagram below consolidates our learning of the attack:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742498318544/6e297215-ca3e-451d-9574-117c0f8a12cb.png" alt="Sequence Diagram: The Bleichenbacher’s Attack" class="image--center mx-auto" width="819" height="640" loading="lazy"></p>
<p>The Bleichenbacher attack showed that the format of the padding in PKCS#1 v1.5 leaked just enough info to enable a full private-key operation (decrypting the message) without ever factoring N. The attack leveraged the fact that it’s possible to craft ciphertexts that will decrypt to a valid-looking plaintext without knowing the plaintext​. In essence, PKCS#1 v1.5 padding allowed about \(1\) in \(2^{16}\) chance (roughly) for a random blob to appear as “valid padding.” That was enough for an adaptive attack to succeed with feasible queries.</p>
<p>This is precisely what later padding designs like OAEP fixed. OAEP’s design makes such random valid ciphertexts astronomically unlikely (plaintext aware). We will learn about RSA-OAEP in the next sections.</p>
<p>To mitigate the Bleichenbacher attack without immediately changing the padding scheme, practitioners implemented defensive measures. For example, TLS should treat all decryption failures the same way (so an attacker can’t distinguish padding vs. other errors), and servers would generate a fake premaster secret on padding failure to continue the handshake and avoid timing leaks. Nonetheless, the safest course has been to deprecate PKCS#1 v1.5 encryption in favor of schemes like RSA-OAEP.<a target="_blank" href="https://archiv.infsec.ethz.ch/education/fs08/secsem/bleichenbacher98.pdf#:~:text=plaintext%20is%20PKCS%20conforming,chosen%20ciphertexts%3B%20thus%2C%20we%20show">​</a></p>
<h2 id="heading-optimal-asymmetric-encryption-padding-oaep">Optimal Asymmetric Encryption Padding (OAEP)</h2>
<p>By the end of 1995, Bellare and Rogaway proposed Optimal Asymmetric Encryption Padding (OAEP) with the goal of achieving provable security. This padding aimed to make RSA encryption resistant not just to passive attacks but also to adaptive chosen-ciphertext attacks. In other words, even if an attacker can trick a system into decrypting chosen ciphertexts (as an “oracle”), they should learn nothing useful about the plaintext. OAEP was subsequently standardized in PKCS#1 v2.0 (published as RFC 2437 in 1998) and later versions.</p>
<p>Compared to PKCS#1 v1.5, OAEP has a more complex encoding that uses hash functions and a mask generation function (MGF) to thoroughly randomize the plaintext before RSA encryption, providing stronger guarantees.</p>
<p>OAEP’s design can be viewed as a two-layer Feistel-like network using a random seed. It takes the input message and randomizes it in a way that is reversible only with the correct seed. The scheme was proven plaintext-aware in the random oracle model which means that an adversary cannot concoct a valid ciphertext without knowing the corresponding plaintext. If an attacker tries to forge or tamper with ciphertexts, they almost surely produce an <em>invalid</em> padding that will be rejected. This property directly counters padding-oracle attacks.</p>
<p>OAEP (with a proper hash/MGF) is semantically secure against adaptive chosen ciphertext attacks, assuming RSA is hard to invert and treating the hash functions as random oracles. Unlike PKCS#1 v1.5, which lacked a formal proof, OAEP comes with a proof sketch that breaking RSA-OAEP is as hard as breaking RSA itself.</p>
<p>In practice, this means OAEP drastically reduces the risk of any padding oracle: an attacker can no longer easily find ciphertexts that slip through the padding check except by brute force which has a \(2^{-hLen*8}\) success probability. For example, the success probability with SHA-1 would be \(2^{-160}\).</p>
<p>The block diagram below is a visual representation of the OAEP encoding schema:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742663541136/1c418939-80f6-45ea-8667-cacdc5cdab2b.png" alt="Optimal Asymmetric Encryption Padding" class="image--center mx-auto" width="1434" height="1102" loading="lazy"></p>
<p>Let’s understand what these mathematical notions mean and the workings of RSA-OEAP, up next.</p>
<h3 id="heading-the-mathematics-behind-oaep">The Mathematics Behind OAEP</h3>
<p>Optimal Asymmetric Encryption Padding requires a hash function for two operations we will discuss in this section. We will choose SHA-1 as a hash function in OAEP and \(hLen\) denotes the length in octets of the hash function output. We will later demonstrate why even MD5 or SHA-1 is a secure choice for OAEP even if it is not collision resistant.</p>
<p>Before we dive into the mathematics, let’s recap a few notations and define the main pieces we’ll be using:</p>
<p>In RSA, \(N\)is the modulus, and \(k\) is the size of \(N\) in <em>bytes</em>. For a \(2048\)-bit modulus, \(k=256\) bytes.<br>\(M \) is the message or plaintext to be encrypted. This plaintext must be short enough to fit into the padded block (at most \(k−2⋅hLen−2\) bytes). In our notation, \(Hash\) refers to the cryptographic hash function (for example, SHA-1, SHA-256) of output length \(hLen\). For example: If using SHA-1, \(hLen=20\) bytes.</p>
<p>We will also use an optional string associated with the message (often empty). This is the Label \(L\). If this label is empty, its hash is a fixed value. (For example: the SHA-1 of an empty string).</p>
<p>The hash of this label \(L\) is represented by \(lHash\), where \(lHash=Hash(L)\). As mentioned earlier, if \(L\) is empty, \(lHash\) is simply \(Hash('')\). This means that in any case \(lHash\) will hold a value.</p>
<p>We will also use a Mask Generation Function, \(MGF\), which is often mentioned as \(MGF1\). This function takes an input (seed or masked data) and produces an output of a specified length by iterating the underlying hash function. We’ll write \(MGF(input,length)\) to indicate “generate a mask of \(length\) bytes from \(input\)”.</p>
<p>Now that you are familiar with all the necessary notations, we are ready to begin the encoding step.</p>
<h4 id="heading-step-1-constructing-the-data-block-db">Step 1: Constructing the Data Block (DB)</h4>
<p>We will compute \(lHash=Hash(L)\). If \(L\) is empty, \(lHash\) is a constant (For example, the SHA-1 of the empty string).</p>
<p>Form the padding string \(PS\), the length of \(PS\) is chosen so that the entire block \(DB\) has length \((k−hLen−1)\) bytes. Numerically, \(PS\) has \((k−mLen−2⋅hLen−2)\) bytes of \(0x00\), where \(mLen\) is the length of the message \(M\).</p>
<p>Now we simply concatenate the blocks to generate the octet string for the Data Block (\(DB\)):</p>
<p>$$DB=lHash~∣∣~PS~∣∣~0x01~∣∣~M$$</p><p>Here the single byte \(0x01\) acts as a delimiter which marks where the zero padding ends and the actual message \(M\) begins. \(DB\) ends up being \((k−hLen−1)\) bytes.</p>
<h4 id="heading-step-2-generating-a-mask-for-the-data-block">Step 2: Generating a Mask for the Data Block</h4>
<p>First, we pick a random string called \(seed\) of length \(hLen\) bytes. For example, when using SHA-1 where \(hLen=20\), then we say that the seed consists of \(20\) random bytes.</p>
<p>Now we use the mask generation function, \(MGF\), on the \(seed\) to create a mask the same length as \(DB\):</p>
<p>$$dbMask=MGF(seed,k−hLen−1)$$</p><p>The idea is to spread the randomness of the seed across the entire \(DB\).</p>
<h4 id="heading-step-3-mask-the-data-block">Step 3: Mask the Data Block</h4>
<p>Now, we will Combine \(DB\) and \(dbMask\) with the bitwise \(XOR\) operation:</p>
<p>$$maskedDB=DB \oplus dbMask$$</p><p>This step “scrambles” \(DB\) with the random seed.</p>
<h4 id="heading-step-4-generate-a-mask-for-the-seed">Step 4: Generate a Mask for the Seed</h4>
<p>Next, we will produce a mask for the seed itself, based on \(maskedDB\):</p>
<p>$$seedMask=MGF(maskedDB,hLen)$$</p><p>This step simply ensures that the seed is not left in the clear.</p>
<h4 id="heading-step-5-mask-the-seed">Step 5: Mask the Seed</h4>
<p>Now we will combine the original seed and the new mask with an \(XOR\) operation:</p>
<p>$$maskedSeed=seed \oplus seedMask$$</p><p>Now the seed is also “scrambled” by the data block.</p>
<h4 id="heading-step-6-form-the-final-encoded-message-em">Step 6: Form the Final Encoded Message (EM)</h4>
<p>We are now ready to build our final block. Simply concatenate everything into a \(k\)-byte string:</p>
<p>$$EM=0x00~∣∣~maskedSeed~∣∣~maskedDB$$</p><p>The leading \(0x00\) byte ensures that when \(EM\) is interpreted as an integer, it’s less than the RSA modulus \(N\). At this point, \(EM\) is your OAEP-padded message of length \(k\).</p>
<h4 id="heading-step-7-covert-concatenated-string-to-integer">Step 7: Covert concatenated String to Integer</h4>
<p>Remember from our discussion before on PKCS#1v1.5 that RSA cannot directly operate on this concatenated string of bytes. We need to convert the \(EM\) block to a non-negative integer using the OS2IP formula:</p>
<p>$$x = \sum_{i=1}^{k} 2^{8(k - i)} \,\mathrm{EB}_i$$</p><h4 id="heading-step-8-perform-rsa-encryption">Step 8: Perform RSA Encryption</h4>
<p>Now that we have the encoded message (\(EM\)) as an integer \(x\), we are ready to perform RSA guided by the formula:</p>
<p>$$C =x^e \bmod N$$</p><p>where \((e,N)\) is the public key. The thus computed \(C\) is our ciphertext generated using RSA-OAEP.</p>
<p>When decrypting, the process is reversed: the recipient uses their private key \(d\) to compute \(m = c^d \bmod N\), recovers the \(EM\), then splits it into the \(0x00\), \(maskedSeed\), and \(maskedDB\), and uses the same \(MGF\) and hash function to unravel the \(XORs\) in reverse order​. Finally, they check that the recovered \(lHash'\) matches the expected hash and that the block contains the proper structure​ (\(...||0x01||...\)).</p>
<p>If any check fails, the padding is invalid. Only if everything checks out is the message \(M\) returned. The result is that an invalid ciphertext will almost always be detected and rejected without giving an attacker any useful information.</p>
<p>By design, OAEP effectively foiled the padding oracle problem. The chance that a random guess produces a valid OAEP encoding is negligible: on the order of \(2^{-hLen*8}\)). In fact, Daniel Bleichenbacher (after breaking PKCS#1 v1.5) advocated for exactly such a “plaintext-aware” padding where forging a valid padding is infeasible.</p>
<h2 id="heading-why-sha-1-or-md5-are-safe-in-rsa-oaep"><strong>Why SHA-1 or MD5 Are Safe in RSA-OAEP</strong></h2>
<p>Earlier in the section above, we mentioned that we’d be using SHA-1 for our mathematical formulation and examples. When you see SHA-1 or MD5 used in the context of RSA-OAEP, don’t let the fact that these hash functions are considered broken for collision resistance alarm you. If you notice carefully in the previous section, the hash functions serve two very specific roles that do not rely on their collision resistance. Let’s break them down one by one:</p>
<h3 id="heading-label-hashing"><strong>Label Hashing</strong></h3>
<p>The hash function is used to compute a fixed-length hash of an optional label \(L\) (often empty).</p>
<p>Now let’s see why is this safe in the context. This hash, called \(lHash\), acts as a domain separator. Its job is simply to ensure that the label is correctly associated with the ciphertext during decryption. As long as the label is chosen wisely (that is, not built from adversary-controlled parts), collision resistance isn’t critical here.</p>
<h3 id="heading-mask-generation-function-mgf1"><strong>Mask Generation Function (MGF1)</strong></h3>
<p>The hash function is also used inside \(MGF1\) to create a pseudorandom mask. This mask is applied both to the data block \(DB\) and to the random seed used in the encoding process.</p>
<p>In this context, the hash function is treated as a random oracle. The job is to spread the randomness of the seed across a larger block of data. For this purpose, properties like length extension or collision resistance are not relevant. What matters is that the output appears random, and even SHA-1 or MD5 can deliver that when used in this controlled, fixed-input scenario.</p>
<h2 id="heading-adoption-in-cryptographic-libraries-pkcs1-v15-vs-oaep">Adoption in Cryptographic Libraries (PKCS#1 v1.5 vs OAEP)</h2>
<p>After the Bleichenbacher attack, standards and libraries migrated to OAEP or at least added support for it, while treating PKCS#1 v1.5 as a legacy option. Modern cryptographic libraries and protocols reflect these lessons.</p>
<p>In 1998, the RSA standard was updated. PKCS#1 v2.0 introduced RSAES-OAEP as the new recommended encryption scheme, and by PKCS#1 v2.1 and v2.2 (RFC 3447 and RFC 8017), OAEP is required for new applications, with PKCS#1 v1.5 included only for backward compatibility.</p>
<p>OpenSSL discourages users from using PKCS#1 v1.5 as it leaks information that can potentially be used to mount a Bleichenbacher padding oracle attack [10]. The documentation clearly mentions that it is highly recommended to use <code>RSA_PKCS1_OAEP_PADDING</code> in new applications.</p>
<p>The Python cryptography library (PyCA cryptography) also asks developers to use OAEP for encryption instead of PKCS#1 v1.5 [11].</p>
<p>After Bleichenbacher’s 1998 attack, it was impractical to instantly replace PKCS#1 v1.5 everywhere. Instead, protocol designers issued countermeasures.</p>
<p>TLS, for example, responded by changing the error handling: the server would not reveal a padding failure distinctly. It would generate a fake premaster secret and proceed to prevent timing clues, and always return a generic handshake failure at a later stage, making it harder for the attacker to distinguish why decryption failed.</p>
<p>These countermeasures reduced the oracle’s fidelity but were tricky to get right across different implementations. In fact, not everyone got it right – the Bleichenbacher attack continued to resurface in various forms when implementations made mistakes in error handling.</p>
<p>In 2018, researchers discovered the ROBOT attack (Return Of Bleichenbacher’s Oracle Threat): several TLS implementations had subtle bugs that recreated a padding oracle, allowing the attack to succeed 19 years later. The ROBOT paper showed that even with countermeasure guidelines, the complexity of uniformly handling errors led to slip-ups in popular products.</p>
<p>This underscores that patching an insecure scheme is often error-prone – a design that is secure by construction (like OAEP) is preferable.</p>
<p>PKCS#1 v1.5 continues to exist because of these patchwork security measures and the fact that it cannot be abruptly removed from all existing systems. It is generally regarded as "legacy" or maintained "for compatibility" purposes. The collective wisdom is clear: use OAEP for RSA encryption whenever possible.</p>
<h2 id="heading-enhancing-digital-signatures-the-transition-to-pss">Enhancing Digital Signatures: The Transition to PSS</h2>
<p>Now that you understand how OAEP transformed RSA encryption by mitigating vulnerabilities in deterministic padding, it’s time to turn our attention to RSA digital signatures – a critical function for ensuring message integrity and authenticity.</p>
<p>Early RSA signature schemes suffered from similar problems as raw encryption: their deterministic nature made them prone to forgery and replay attacks. This vulnerability paved the way for an improvement: the Probabilistic Signature Scheme (PSS).</p>
<p>Before we dive into PSS itself, let’s quickly understand the pain points with early RSA signatures.</p>
<h3 id="heading-problems-with-early-rsa-signature-schemes">Problems with Early RSA Signature Schemes</h3>
<p>Traditional RSA signatures were generated by simply applying the RSA decryption function on a message digest (often with minimal formatting):</p>
<p>$$s=m^d \bmod N$$</p><p>where \(m\) is the hash (or encoded hash) of the message. This approach was deterministic which meant that each time the same message was signed, the exact signature was produced. Such determinism had two major drawbacks:</p>
<ol>
<li><h4 id="heading-predictability-and-replay">Predictability and Replay</h4>
<p> Since the signature for a given message was always identical, an attacker could replay a captured signature with impunity or forge signatures if they could deduce patterns in the signature scheme.</p>
</li>
<li><h4 id="heading-forgery-risks">Forgery Risks</h4>
<p> In a deterministic setting, if an attacker finds any structure or mathematical relationship in the signature, they might be able to forge a valid signature for a new message. In certain scenarios, weak formatting could allow an adversary to create a “signature transformation” that produces a valid signature without having access to the private key.</p>
</li>
</ol>
<p>These issues highlighted that a signature scheme must be probabilistic to be secure against adaptive forgery attempts and to ensure non-repudiation. This means that the signer should not be able to repudiate a signature because it is bound to a random value known only at signing time.</p>
<h3 id="heading-birth-of-the-probabilistic-signature-scheme-pss">Birth of the Probabilistic Signature Scheme (PSS)</h3>
<p>Towards the end of 1998, Bellare and Rogaway also proposed a scheme to overcome the inherent limitations of deterministic RSA signatures [12]. The core idea was to introduce randomness into the signature generation process so that even when signing the same message twice, the resulting signatures would be different. This randomness comes from a salt value and a carefully designed encoding process. The result is a signature method with strong, provable security guarantees.</p>
<p>This randomness prevents attackers from exploiting patterns in the signature process. The probabilistic Signature Scheme was designed to be provably secure in the random oracle model, meaning that forging a signature would be as hard as breaking RSA itself under certain assumptions [13].</p>
<p>The block diagram below is a visual representation of the PSS encoding schema:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742669558156/8137f535-deb7-4437-887a-53cf7a412089.png" alt="Probabilistic Signature Scheme" class="image--center mx-auto" width="1580" height="1258" loading="lazy"></p>
<p>Let’s understand what these mathematical notions mean as well as the workings of RSA-PSS, up next.</p>
<h3 id="heading-the-mathematics-behind-pss">The Mathematics Behind PSS</h3>
<p>Before diving into the mechanics of RSA-PSS, it’s helpful to define the notations and terms you’ll see in the steps ahead.</p>
<p>In RSA, \(N\)is the modulus, a large integer that is the product of two primes. \(k\) is the length of \(N\) in <em>bytes</em>. For an \(2048\)-bit key, \(k=256\) bytes.</p>
<p>\(M\)represents the message data or document you want to sign. In RSA-PSS, you’ll typically first compute a hash of \(M\). \(Hash\) refers to a cryptographic hash function (for example, SHA-256) that maps data to a fixed-size output. The output length is denoted \(hLen\). For SHA-256, \(hLen=32\) bytes.</p>
<p>We will use a salt, \(S\), randomly generated string of fixed length (often the same as \(hLen\)). This randomness is essential in ensuring that each signature is unique, even for the same message.</p>
<p>\(H\) or \(mHash\) is the hash of the message \(M\)and \(H'\) is a secondary hash that includes both \(M\) and the salt \(S\). This appears in the PSS encoding step.</p>
<p>The Mask Generation Function, \(MGF\), is a function that uses the hash internally to produce a pseudorandom output of arbitrary length. In PSS, it is used to “mask” parts of the data block so that the signature is hard to forge.</p>
<p>A fixed byte, \(0xbc\) (in hex) is appended at the end of the encoded message to mark the boundary of the PSS structure. This serves as a simple integrity check during decoding. After a successful encoding we receive an encoded message \(EM\) which is an octet string of length \(emLen = \left\lceil{\frac{emBits}{8}}\right\rceil\).</p>
<p>Now that you are familiar with all the necessary notations, we are ready to begin the encoding step.</p>
<h4 id="heading-step-1-message-hashing-and-salt-generation">Step 1: Message Hashing and Salt Generation</h4>
<p>We compute the hash of the message as \(H~( mHash)=Hash(M)\) where \(M\) is our message. We will also create a random salt \(S\) (of fixed length, say 20 bytes if you use SHA-1).</p>
<h4 id="heading-step-2-encoding-the-hash-with-the-salt-pss-encode">Step 2: Encoding the Hash with the Salt (PSS-Encode)</h4>
<p>We will construct a Data Block, \(DB\), by combining a padding with the hash and the salt. The padding is a sequence of \(0\)’s that fills space and ensures a fixed length. Mathematically:</p>
<p>$$M' = (0x)~00 ~00 ~00 ~00 ~00 ~00 ~00 ~00 ~||~ mHash ~||~ salt$$</p><p>Now we compute the Hash of this block as \(H' = Hash(M')\). We will generate another octet string \(PS\) and concatenate it with the salt and \(0x01\) as a delimiter:</p>
<p>$$DB = PS ~||~ 0x01 ~||~ salt$$</p><p>Note that DB is an octet string of length \(emLen - hLen - 1\). The mask that you see in the visual representation above must be of this length. Mathematically:</p>
<p>$$dbMask = MGF(H, emLen - hLen - 1)$$</p><p>We will then apply this mask on the \(DB\) block using an \(XOR\) operation to produce our \(maskedDB\):</p>
<p>$$maskedDB = DB \oplus dbMask$$</p><p>Recollect that \(emLen\) is the intended length of the Encoded Message \(EM\) and \(hLen\) is the length of the hash output. Now we append a fixed trailer field \(0xbc\) and produce the encoded message in its octet string representation:</p>
<p>$$EM = maskedDB ~||~ H ~||~ 0xbc$$</p><p>This encoding process ensures that both the salt and the hash are mixed together in a non-reversible, pseudorandom manner. The randomness from the salt is “spread” over the data block by the \(MGF\), making it extremely difficult for any adversary to manipulate the signature.</p>
<h4 id="heading-step-3-rsa-signature-generation">Step 3: RSA Signature Generation</h4>
<p>Once you have the encoded message \(EM\), the RSA signature is produced by using the RSA private key. First, convert the Octet String to its integer representation using the OS2IP method we’ve discussed before. Then apply the RSA Private Key Operation:</p>
<p>$$s=m^d \bmod N$$</p><p>where \(d\) is the private exponent and \(N\) is the RSA modulus.</p>
<h4 id="heading-step-4-signature-verification">Step 4: Signature Verification</h4>
<p>At the receiver end, when any recipient wants to verify a signature, they reverse the process:</p>
<p>$$m′= s^e \bmod N$$</p><p>and convert \(m'\) back to an encoded message \(EM\). The verifier then extracts the components \((MaskedDB, H′, trailer)\) and recomputes \(H'\) from the message and salt. The verifier confirms that the hash and salt embedded in \(EM\) match what is expected. If everything checks out, the signature is valid.</p>
<h2 id="heading-the-road-ahead-assessing-rsas-long-term-viability"><strong>The Road Ahead: Assessing RSA’s Long-Term Viability</strong></h2>
<p>In 1994, Peter Shor’s algorithm [14], demonstrated that a quantum computer can factor large integers in polynomial time, thereby efficiently breaking RSA’s underlying hard problem – the difficulty of factoring \(N = p \times q\).</p>
<p>Although experimental quantum computers have made progress, they remain far from having the number of stable qubits required to break RSA keys of practical sizes (2048 or 4096 bits).</p>
<p>In anticipation of large-scale quantum computers, the cryptographic community is actively developing and standardizing algorithms believed to be resistant to quantum attacks. These include lattice-based schemes (such as CRYSTALS-Kyber and NTRU), code-based cryptography (such as the McEliece cryptosystem), hash-based signatures (such as XMSS), and multivariate polynomial cryptosystems.</p>
<p>It’s important to note that while OAEP and PSS improve the security of RSA against classical attacks, they do not protect RSA from quantum attacks. In a post-quantum world, even the most secure classical padding will not prevent a quantum computer from breaking RSA using Shor’s algorithm.</p>
<p>In the near term, RSA remains in widespread use and, when implemented with padding schemes such as OAEP and PSS, continues to provide strong security against classical adversaries. But looking ahead, it’s expected that organizations will gradually migrate to post-quantum algorithms as they mature and become standardized.</p>
<h2 id="heading-references">References</h2>
<p>[1] FIPS 186-5: <a target="_blank" href="https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.186-5.pdf">Digital Signature Standard (DSS)</a></p>
<p>[2] RFC 8017 PKCS #1: <a target="_blank" href="https://www.rfc-editor.org/rfc/rfc8017.html">RSA Cryptography Specifications</a></p>
<p>[3] <a target="_blank" href="https://en.wikipedia.org/wiki/Lagrange%27s_theorem_\(number_theory\)">Lagrange's theorem</a></p>
<p>[4] Ronald L. Rivest, Robert D. Silverman: <a target="_blank" href="https://people.csail.mit.edu/rivest/pubs/pubs/RS01.version-1999-11-22.pdf">Are Strong Primes Needed for RSA</a>?</p>
<p>[5] <a target="_blank" href="https://cryptography.io/en/latest/hazmat/primitives/asymmetric/rsa/">pyca/cryptography</a></p>
<p>[6] <a target="_blank" href="https://github.com/openssl/openssl/blob/85cabd94958303859b1551364a609d4ff40b67a5/crypto/rsa/rsa_chk.c">OpenSSL Github</a>: <code>rsa_chk.c</code></p>
<p>[7] RFC 2313: <a target="_blank" href="https://www.rfc-editor.org/rfc/rfc2313.html">PKCS #1: RSA Encryption</a></p>
<p>[8 ] Daniel Bleichenbacher: <a target="_blank" href="https://archiv.infsec.ethz.ch/education/fs08/secsem/bleichenbacher98.pdf">Chosen Ciphertext Attacks Against Protocols Based on the RSA Encryption Standard PKCS #1</a></p>
<p>[9] RFC 8017: <a target="_blank" href="https://www.rfc-editor.org/rfc/rfc8017#section-7.1">PKCS #1 RSA Cryptography Specifications Version 2.2</a></p>
<p>[10] RSA_public_encrypt: <a target="_blank" href="https://docs.openssl.org/3.5/man3/RSA_public_encrypt/#warnings">Warnings</a></p>
<p>[11] <a target="_blank" href="https://cryptography.io/en/latest/hazmat/primitives/asymmetric/rsa/#cryptography.hazmat.primitives.asymmetric.padding.PKCS1v15">pyca/PKCS1v1</a></p>
<p>[12] <a target="_blank" href="https://en.wikipedia.org/wiki/Probabilistic_signature_scheme">Probabilistic signature scheme</a></p>
<p>[13] RFC 8017: <a target="_blank" href="https://www.rfc-editor.org/rfc/rfc8017#section-8.1">RSASSA-PSS</a></p>
<p>[14] <a target="_blank" href="https://ieeexplore.ieee.org/abstract/document/365700/">Algorithms for quantum computation</a>: discrete logarithms and factoring</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Protect Data in Transit using HMAC and Diffie-Hellman in Node.js [Full Handbook] ]]>
                </title>
                <description>
                    <![CDATA[ Data integrity refers to the assurance that data will remain accurate, unaltered, and consistent throughout its lifecycle. In communication, data integrity is important in safeguarding against unintended alterations and malicious interventions during... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/hmac-diffie-hellman-in-node/</link>
                <guid isPermaLink="false">66c4c5351b22d2d8d9040ebd</guid>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Hamdaan Ali ]]>
                </dc:creator>
                <pubDate>Mon, 18 Mar 2024 23:00:22 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/03/Protect-Data-in-Transit-Handbook-v2--1-.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Data integrity refers to the assurance that data will remain accurate, unaltered, and consistent throughout its lifecycle. In communication, data integrity is important in safeguarding against unintended alterations and malicious interventions during data transmission.</p>
<p>The integrity of Digital Data is accomplished using Hashing Algorithms. The <code>crypto</code> module in Node provides various built-in vetted library functions to provide means to not only verify the integrity of data but also the authenticity of its origin.</p>
<p>This handbook aims to highlight the internal workings of the functions in the <code>crypto</code> library and give you some insights into the internal workings of HMAC and Diffie-Hellman Key Exchange. This will help you make informed decisions about hash algorithms and key lengths depending on your business requirements.</p>
<p>The primary focus of this handbook is to emphasize the crucial aspect of data integrity rather than discussing the various encryption algorithms available. </p>
<p>Encryption is used to protect information by converting it into a secure format, which ensures its confidentiality. But data integrity is concerned with ensuring that the data remains accurate and unaltered.</p>
<p>You can also watch the associated video here:</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/FAfzQo_eJHI" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></li>
<li><a class="post-section-overview" href="#heading-the-alice-bob-paradigm">The Alice-Bob Paradigm</a></li>
<li><a class="post-section-overview" href="#heading-message-detection-code-mdc">Message Detection Code (MDC)</a></li>
<li><a class="post-section-overview" href="#heading-message-authentication-code-mac">Message Authentication Code (MAC)</a></li>
<li><a class="post-section-overview" href="#heading-hash-based-macs-hmac">Hash-based MACs (HMAC)</a></li>
<li><a class="post-section-overview" href="#heading-the-diffie-hellman-merkle-protocol">The Diffie-Hellman-Merkle Protocol</a></li>
<li><a class="post-section-overview" href="#heading-connecting-the-dots">Connecting the Dots</a></li>
<li><a class="post-section-overview" href="#heading-invoking-the-apis">Invoking the APIs</a></li>
<li><a class="post-section-overview" href="#heading-wrapping-up">Wrapping Up</a></li>
<li><a class="post-section-overview" href="#heading-references">References</a></li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<ol>
<li><strong>Node and Express:</strong> We'll create a TypeScript sample application using the Express framework. A basic understanding of the framework would be helpful. You will need the <a target="_blank" href="https://nodejs.org/en/download/">Node Runtime Environment</a> to execute the scripts.</li>
<li><strong>Postman Client:</strong> To make an API request and to test out the sample application, you will need a tool to make HTTP Requests. You may use your web browser's "Edit and Send" feature under the Networks tab, but since not all browsers allow this, it's best to use a tool like <a target="_blank" href="https://www.postman.com/downloads/">Postman</a> which provides a better UI to observe responses.</li>
</ol>
<h2 id="heading-the-alice-bob-paradigm">The Alice-Bob Paradigm</h2>
<p>Throughout this handbook you will come across numerous sequence diagrams and mathematical proofs that use the Alice-Bob Paradigm. </p>
<p>The Alice-Bob paradigm is a common convention in cryptography where two generic entities, often named Alice and Bob, are used to illustrate various scenarios, protocols, or cryptographic principles. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/03/alice-bob-1.svg" alt="The Alice-Bob Paradigm" width="600" height="400" loading="lazy">
<em>The Alice-Bob Paradigm</em></p>
<p>These characters represent two parties engaged in communication, with Alice typically representing the sender or initiator, and Bob representing the receiver or responder. </p>
<p>We often introduce Eve as a third party, symbolizing an eavesdropper or potential attacker, adding an element of security risk and illustrating scenarios where external entities might attempt to intercept or manipulate the communication.  </p>
<p>The sample application shown in the later sections models after this Alice-Bob Paradigm to use Boost Inc. and Account Aggregator (AA) as the parties engaged in communication.</p>
<h2 id="heading-message-detection-code-mdc">Message Detection Code (MDC)</h2>
<p>When Alice needs to send critical data to Bob over the internet, the data changes hands, jumping between routers and servers, each step carrying the potential risk of unintended alterations. </p>
<p>If Eve manages to get their hands on Alice's data, they might modify it. So the integrity of the data becomes questionable, emphasizing that its original state may have been compromised during transmission.</p>
<p>Note that we are talking about the integrity and not the confidentiality of the data. Say even after Alice encrypts the data, it doesn't inherently guarantee that the data hasn't been tampered with during transit.</p>
<p>Consider this scenario: even though Eve may be unable to decrypt the encrypted message, they might attempt to modify the ciphertext in transit. This could involve altering bits, rearranging packets, or injecting malicious code, potentially leading to unintended consequences upon decryption.</p>
<p>This is where a Message Detection Code (MDC) or a hash comes in picture. A modification detection code (MDC) is a message digest or a checksum that can prove the integrity of a message: that the message has not been changed [1].<br>The figure below explains how MDC is used to verify the integrity of a message:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/03/MDC.png" alt="Image" width="600" height="400" loading="lazy">
<em>Modification Detection Code [1]</em></p>
<p>A Hash Function is used to generate the digest for any given message. This hash function processes the entire content of the message, producing a fixed-size string of characters that uniquely represents the message's contents. This is called the message digest or MDC.</p>
<p>Note that any hash function, such as SHA-256, SHA-3, or MD5, can be used depending on your specific security requirements and preferences. </p>
<p>Once the digest is generated, it serves as a unique fingerprint for the original message. When Alice sends both the message and its corresponding digest to Bob, they can independently apply the same hash function to the received message. If the calculated digest matches the one received from Alice, it serves as irrefutable evidence that the message has not undergone any modifications during transmission.</p>
<h2 id="heading-message-authentication-code-mac">Message Authentication Code (MAC)</h2>
<p>While MDC or the checksum is typically transferred over a safe channel, it may so happen that the safety of the channel or the trusted party itself is compromised. In such a case Eve can easily modify both the message and the digest and Bob will never know if the message actually came from Alice as intended.</p>
<p>What MDC lacks is a definitive guarantee of the message origin, leaving a potential vulnerability in confirming the true sender. </p>
<p>This is where Message Authentication Code (MAC) comes in. MACs not only ensure the integrity of the message, detecting any unauthorized alterations, but they also provide a mechanism for authenticating the origin of the data. In other words, MACs offer assurance that the message is indeed originating from Alice and not by someone else.</p>
<p>The figure below explains how MAC can help authenticate the origin of a message besides providing integrity check:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/03/MAC.png" alt="Image" width="600" height="400" loading="lazy">
<em>Message Authentication Code [1]</em></p>
<p>Notice that the difference between a MDC and a MAC is that MAC also includes a secret key (K) between Alice and Bob. The hash function also takes in a key (K) along with the message (M) to generate a MAC.</p>
<p>                                                          $$ h (K | M) = MAC $$</p>
<p>Now both the message and MAC can be sent over the same insecure channel. When Bob receives this ( M + MAC ), he can separate out the message M and compute the MAC for it using the same hash function and the secret key (K). </p>
<p>Bob will then compare the newly computed MAC with the one he received. If the two MACs match, the message is authentic and has not been modified by an adversary.</p>
<p>$$ Alice: S(K,M) = MAC \\ Bob: V(M, K, MAC) = Accept/ Reject $$</p><p>Since Eve does not have this secret key (K), they cannot modify the message and generate a valid MAC. Consequently, the resulting MAC becomes a unique fingerprint, signifying not only the integrity of the message but also authenticating its origin.</p>
<h2 id="heading-hash-based-macs-hmac">Hash-based MACs (HMAC)</h2>
<p>While MAC do provide a guarantee of authentication of the origin of a message, it is still falls short in ensuring unforgeability. It is easy for Eve to perform a Man in The Middle (MiTM) attack, intercept the MAC + Message pair and then perform the Length Extension Attack.</p>
<p>Given ( S = h( K || M) ) and the message (M), Eve can extend (M) to (M' = M || Pad || w) and create (MAC(M')); where (MAC(M')) is evaluated as<br>( S = h( K || M || Pad || w) ).</p>
<p>Eve does not require knowledge of the secret key (K) to extend the message (M) to (M'). When Alice receives this modified (M') and (MAC(M')), they are unable to determine the modification. </p>
<p>HMAC or a Hash-based MAC is a specific method for constructing a MAC algorithm out of a collision resistant hash function. HMAC uses two passes of hash computation and provides a better immunity against length extension attacks. The figure below explains the construction of HMACs. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/03/HMAC-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Hash-based Message Authentication Code [1]</em></p>
<p>There are several steps involved in the implementation of HMACs [1]:</p>
<ol>
<li>Divide the message into N blocks, each of b bits</li>
<li>Select a secret key and left-padded with 0’s to create a b-bit key and exclusive-ored with a constant called (ipad) (input pad). </li>
<li>Use the same secret key and XOR it with an another constant called (opad).<br>The value of (ipad) and (opad) are fixed constants as defined in the HMAC Standards [3]. The value of (ipad) is taken as b/8 repetition of the sequence<br>00110110 (hex: 36) and the value of (opad) is taken as b/8 repetition of the sequence 01011100 (hex: 5C).<br>These values are defined in such a way to have the most "non-regular" <a target="_blank" href="https://en.wikipedia.org/wiki/Hamming_distance">Hamming distance</a> from each other.<br>The Hamming distance between (ipad) and (opad) 4, meaning exactly half of the bits are flipped.</li>
<li>Prepend the result produced in Step 2 to the message block. Use the hash function on this (N+1) block to create a n bit message digest called the intermediate HMAC.</li>
<li>The intermediate HMAC is prepended with (0)s to make a b bit block and then the result of Step 3 is prepended to this block.</li>
<li>Use the hash function again on the result of step 5 to get a final n bit HMAC.</li>
</ol>
<p>Mathematically, this can be represented as: </p>
<p>$$ S(k, m) = H(k \oplus \text{opad}  ||  H(k \oplus \text{ipad}  || m)) $$</p><p>Now if Eve tries to extend (M) to (M' = M || Pad || w), the resulting HMAC construction this would be:</p>
<p>$$ HMAC(K, M')=H(K||opad, H(K||ipad, M || Pad || w)) $$</p><p>Due to the unique application of (opad) in the outer hash, the attacker cannot construct (H(K||opad, &lt;...&gt; )) without knowing the key (K). The outer padding disrupts the internal state for any additional input, thwarting the attacker's attempt.              </p>
<h2 id="heading-the-diffie-hellman-merkle-protocol">The Diffie-Hellman-Merkle Protocol</h2>
<p>One of the main challenges in Symmetric-key Ciphers is the distribution of keys. A fundamental question naturally arises: How will Bob know what keys Alice has used?</p>
<p>A very intuitive answer to this problem could be to use a Key Exchange or a Key Distribution Center (KDC). However, the utilization of a KDC or a Key Exchange introduces a notable caveat: the requirement of a secure channel for transmitting keys. </p>
<p>The security of a system employing a Key Distribution Center (KDC), such as in the case of the Kerberos authentication protocol, is heavily dependent on the security of the KDC itself. If the KDC is compromised, the cryptographic keys it manages and distributes can be exposed, leading to potential security vulnerabilities throughout the system as seen in a Golden Ticket Attack.</p>
<p>In the year 1979, <a target="_blank" href="https://en.wikipedia.org/wiki/Ralph_Merkle">Ralph Merkle</a>, <a target="_blank" href="https://en.wikipedia.org/wiki/Whitfield_Diffie">Whitfield Diffie</a> and <a target="_blank" href="https://en.wikipedia.org/wiki/Martin_Hellman">Martin Hellman</a> came up with a way to Securely exchange Cryptographic Keys over Public Insecure Channels. </p>
<p>The Diffie-Hellman-Merkle Protocol provides a way for two parties to agree upon a shared secret key over an insecure channel without directly exchanging that key. The <code>crypto</code> module in Node.js contains the <code>DiffieHellman</code> class, which is a utility for creating Diffie-Hellman key exchanges. </p>
<p>Before we go through all of the functions defined in this class, it is important to understand the mathematics that goes around in The Diffie-Hellman-Merkle Protocol. The UML Sequence Diagram below explains the steps involved in The Diffie-Hellman-Merkle Protocol:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/03/dh-1.svg" alt="Image" width="600" height="400" loading="lazy">
<em>The Diffie-Hellman-Merkle Protocol</em></p>
<p>The process begins with either of the party who wants to establish a secure communication with the other. In this case, Alice wants to start the communication. </p>
<p>Alice will first pick a randomly chosen Generator <code>g</code> and a large prime number <code>p</code>. Increasing the length of the prime number results in heightened security, as it amplifies the difficulty for adversaries to execute certain cryptographic attacks.</p>
<p>However, enlarging the prime number also comes with computational costs. Longer prime numbers require more computational resources to perform the key generation.</p>
<p>Now, Alice needs to select a Private <code>a</code> and compute a modular exponentiation:</p>
<p>$$ A = g^a  (\text{mod} , p) $$</p><p>Alice will send over the Generator <code>g</code>, the large prime <code>p</code> and Alice's Public Key <code>A</code> to Bob. At this point, Bob has all the values he needs to evaluate his own modular exponentiation of:</p>
<p>$$ A = g^b  (\text{mod} , p) $$</p><p>He will send back this Public Key <code>B</code> to Alice.</p>
<p>Note that up until this point, all communication are occurring over insecure channel. The values <code>g</code>, <code>p</code>, <code>A</code> and <code>B</code> "might" as well be sent as plaintext. The Actual Secret Key is evaluated when Alice and Bob use these data to compute what is known as a "Shared Secret". </p>
<p>Shared Secret computed by A:</p>
<p>$$ S = A^b  (\text{mod} , p) \\  S = g^{\left(ab\right)}  (\text{mod} , p) $$</p><p>Shared Secret computed by B:</p>
<p>$$ S = B^a  (\text{mod} , p) \\  S = g^{\left(ab\right)}  (\text{mod} , p) $$</p><p>Notice how the Shared Secret computed by both parties at their end are the same. </p>
<p>This symmetrical outcome is the essence of the Diffie-Hellman key exchange, where each party independently computes the shared secret using their private key and the public key received from the other party. This ensures that both Alice and Bob arrive at an identical Shared Secret, establishing a secure foundation for further encrypted communication.</p>
<h3 id="heading-why-is-the-shared-secret-secure">Why is the Shared Secret Secure?</h3>
<p>Diffie-Hellman key exchange relies on the mathematical principles of discrete logarithm, primitive roots and Modular exponentiation.</p>
<p>Modular exponentiation is the problem of computing (a^b  mod n), where (a), (b), and (n) are known integers. Discrete logarithm is the problem of finding (x) such that (a^x  mod n = b), where (a), (b), and (n) are known integers and (a) is a primitive root modulo (n).</p>
<p>The security of Diffie-Hellman is rooted in the computational complexity of calculating discrete logarithms. </p>
<p>For example, given <code>g</code>, <code>p</code> and <code>a</code>, it's easy to compute <code>A</code> as Modular exponentiation is in P, meaning that there is a polynomial-time algorithm to solve it. </p>
<p>But, the other way can't be said true. Given <code>g</code>, <code>p</code>, and <code>A</code>, computing <code>a</code> requires solving the discrete logarithm problem, which is widely believed to be a computationally infeasible task [2].</p>
<p>Remember that both parties will compute the Shared Secret at their end and there is no need to send over this secret to the other party. This eliminates the risk of the Shared Secret getting intercepted by Eve and the only option they are left with is to solve the discrete logarithm problem.</p>
<h2 id="heading-connecting-the-dots">Connecting the Dots</h2>
<p>The key (K) that we provide in an HMAC has to be the same for both Alice and Bob. Now that we know how a Diffie-Hellman-Merkle key exchange works, it becomes intuitive that we can plug in the shared secret as the key for an HMAC.</p>
<p>Alice can use the shared key (S) in the HMAC function as a parameter and Bob can use the same shared secret (S), computed at their end, in the verification algorithm. </p>
<p>The <code>crypto</code> module in Node.js provides various built-in functions to implement cryptographic constructs such as HMACs and Diffie-Hellman Key Exchange. It is always recommended to use vetted cryptographic libraries and avoid implementing cryptographic algorithms yourselves over the concerns of <a target="_blank" href="https://en.wikipedia.org/wiki/Side-channel_attack">Side Channel Attacks</a> or a <a target="_blank" href="https://en.wikipedia.org/wiki/Heartbleed">Heartbleed</a>.</p>
<p>Let's create a TypeScript/ Node.js application to understand the implementation and prototypes of these functions. The two entities involved in communication in this application would be Boost Inc. and Account Aggregator. Boost needs to send a critical data over to the Account Aggregator. </p>
<p>We will first utilize the <code>DiffieHellman</code> class to create Secret Keys for both entities. Boost will then use the Secret Key to create a HMAC using the <code>Hmac</code> Class in Node. Account Aggregator will recieve this HMAC along with the message. They will verify this HMAC against the newly generated HMAC from the message they received.</p>
<p>Note that the code at Account Aggregator's end will be simulated and we will create API endpoints for each operation to show separation of concerns in this sample application.</p>
<p>The following sequence diagram explains what the application does:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/03/Sample-Application-1.svg" alt="Image" width="600" height="400" loading="lazy">
<em>UML Sequence Diagram for the sample application</em></p>
<h3 id="heading-project-setup">Project Setup</h3>
<p>In the root of your workspace, install Express, Axios, type definitions of Node, and type definitions of Express using the following command:</p>
<pre><code class="lang-bash">npm init -y | npm install axios express
npm install -D nodemon ts-node @types/express @types/node typescript
</code></pre>
<p>Configure <code>tsconfig</code> as per your liking and create a file called <code>cryto.utils.ts</code> under <code>src/utils</code>.  Let's create an interface and import all necessary modules from the <code>crypto</code> library:</p>
<pre><code class="lang-ts"><span class="hljs-keyword">import</span> { createHmac, createDiffieHellman, DiffieHellman, KeyObject, BinaryLike } <span class="hljs-keyword">from</span> <span class="hljs-string">'crypto'</span>;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">interface</span> KeyPair {
  publicKey: Buffer;
  privateKey: Buffer;
  generator: Buffer;
  prime: Buffer;
  diffieHellman: DiffieHellman;
}
</code></pre>
<p>This interface will function as a blueprint for managing cryptographic key pairs throughout this application. It encapsulates the public and private keys, generator, prime, and a Diffie-Hellman object. </p>
<p>By using this interface we will ensure a structured and standardized approach to handle cryptographic key pair information, thus promoting clarity and consistency in cryptographic operations within a Node.js environment.</p>
<h3 id="heading-the-creatediffiehellman-function">The createDiffieHellman Function</h3>
<p>Next, we will define the function <code>generateKeyPair</code> which will allow us to generate the private and public keys, (A) and (B) along with the large prime (p) and the generator (g) using the <code>createDiffieHellman</code> and <code>generateKeys</code> functions.</p>
<pre><code class="lang-ts"><span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">generateKeyPair</span>(<span class="hljs-params">prime?: <span class="hljs-built_in">any</span>, generator?: <span class="hljs-built_in">any</span></span>): <span class="hljs-title">KeyPair</span> </span>{
  <span class="hljs-keyword">const</span> diffieHellman = prime &amp;&amp; generator ? createDiffieHellman(prime, <span class="hljs-string">'hex'</span>, generator, <span class="hljs-string">'hex'</span>) : createDiffieHellman(<span class="hljs-number">2048</span>);
  diffieHellman.generateKeys();

  <span class="hljs-keyword">return</span> {
    publicKey: diffieHellman.getPublicKey(),
    privateKey: diffieHellman.getPrivateKey(),
    generator: diffieHellman.getGenerator(),
    prime: diffieHellman.getPrime(),
    diffieHellman,
  };
}
</code></pre>
<p>Notice that the parameters to this function – <code>prime</code> and <code>generator</code> – are optional. This is because the underlying <code>createDiffieHellman</code> has five defined overloads:</p>
<pre><code class="lang-ts"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">createDiffieHellman</span>(<span class="hljs-params">primeLength: <span class="hljs-built_in">number</span>, generator?: <span class="hljs-built_in">number</span></span>): <span class="hljs-title">DiffieHellman</span></span>;

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">createDiffieHellman</span>(<span class="hljs-params">
    prime: <span class="hljs-built_in">ArrayBuffer</span> | NodeJS.ArrayBufferView,
    generator?: <span class="hljs-built_in">number</span> | <span class="hljs-built_in">ArrayBuffer</span> | NodeJS.ArrayBufferView,
</span>): <span class="hljs-title">DiffieHellman</span></span>;

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">createDiffieHellman</span>(<span class="hljs-params">
    prime: <span class="hljs-built_in">ArrayBuffer</span> | NodeJS.ArrayBufferView,
    generator: <span class="hljs-built_in">string</span>,
    generatorEncoding: BinaryToTextEncoding,
</span>): <span class="hljs-title">DiffieHellman</span></span>;

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">createDiffieHellman</span>(<span class="hljs-params">
    prime: <span class="hljs-built_in">string</span>,
    primeEncoding: BinaryToTextEncoding,
    generator?: <span class="hljs-built_in">number</span> | <span class="hljs-built_in">ArrayBuffer</span> | NodeJS.ArrayBufferView,
</span>): <span class="hljs-title">DiffieHellman</span></span>;

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">createDiffieHellman</span>(<span class="hljs-params">
    prime: <span class="hljs-built_in">string</span>,
    primeEncoding: BinaryToTextEncoding,
    generator: <span class="hljs-built_in">string</span>,
    generatorEncoding: BinaryToTextEncoding,
</span>): <span class="hljs-title">DiffieHellman</span></span>;
</code></pre>
<p>The first function creates a Diffie-Hellman object with a randomly generated prime of the specified length. The <code>createDiffieHellman(2048);</code> creates a Diffie-Hellman object where the length of the randomly generated prime is 2048 bits. </p>
<p>When no generator value is provided to this function, it takes the default value of 2. The length of the prime necessarily has to be large and if you select a small value Node will throw an error signifying that this length will not make a secure key.</p>
<p>Instead of passing in the length of the prime, we can pass the prime as a buffer. This is what Account Aggregator will to at their end when Boost sends over the necessary details. </p>
<p>Similarly you can use the other function declarations as per your use case to pass the prime and generator as <code>ArrayBuffer</code> or <code>ArrayBufferView</code> types or as <code>string</code> with a specified encoding.</p>
<h3 id="heading-the-computesecret-function">The computeSecret Function</h3>
<p>Now, let's define a method <code>generateSharedSecret</code> that takes in a Key pair and a public key as parameter and computes the shared secret (S): </p>
<pre><code class="lang-ts"><span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">generateSharedSecret</span>(<span class="hljs-params">keyPair: KeyPair, publicKey: Buffer</span>): <span class="hljs-title">Buffer</span> </span>{
  <span class="hljs-keyword">return</span> keyPair.diffieHellman.computeSecret(publicKey);
}
</code></pre>
<p>The <code>computeSecret</code> function also has four overrides, which allows you to either provide the Public key parameter as <code>string</code> or <code>ArrayBufferView</code> as well as options to specify the <code>inputEncoding</code> and <code>outputEncoding</code>.</p>
<h3 id="heading-the-createhmac-function">The createHmac Function</h3>
<p>Now that we've computed our shared secret, let's create a function <code>generateHMAC</code> that consumes this shared secret and generates a digest against it. </p>
<pre><code class="lang-ts"><span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">generateHMAC</span>(<span class="hljs-params">data: <span class="hljs-built_in">any</span>, secretKey: KeyObject | BinaryLike</span>): <span class="hljs-title">any</span> </span>{
  data = <span class="hljs-built_in">JSON</span>.stringify(data);
  <span class="hljs-keyword">const</span> hmac = createHmac(<span class="hljs-string">'sha256'</span>, secretKey);
  hmac.update(data);
  <span class="hljs-keyword">return</span> hmac.digest(<span class="hljs-string">'hex'</span>);
}
</code></pre>
<p>The first parameter of the <code>createHmac</code> function takes an algorithm. This is where you need to specify what underlying hash function do you want to use. </p>
<p>Remember that the security of HMAC relies on various factors, including the cryptographic strength of the underlying hash function, the size of its hash output, and the quality and size of the key.</p>
<p>The options given to you under this algorithms parameter depends on the available algorithms supported by the OpenSSL version on the platform. To check what algorithms are available to you, execute the following command in the terminal:</p>
<pre><code class="lang-bash">openssl list -digest-algorithms
</code></pre>
<p>This will give you a list from which you can select your desired algorithm for the underlying hash function:</p>
<pre><code class="lang-bash">RSA-MD4 =&gt; MD4
RSA-MD5 =&gt; MD5
RSA-MDC2 =&gt; MDC2
RSA-RIPEMD160 =&gt; RIPEMD160
RSA-SHA1 =&gt; SHA1
RSA-SHA1-2 =&gt; RSA-SHA1
RSA-SHA224 =&gt; SHA224
RSA-SHA256 =&gt; SHA256
...
</code></pre>
<p>The secret key that the <code>createHmac</code> function takes could either be of type <code>KeyObject</code> or of type <code>BinaryLike</code>. Note that the type <code>BinaryLike</code> is a union type in TypeScript. It is a type that can be either a <code>string</code> or a <code>NodeJS.ArrayBufferView</code>.</p>
<p>The <code>createHmac</code> function's <code>data</code> parameter is designed to accepts <code>strings</code>, <code>Buffer</code>, <code>TypedArray</code> and <code>DataView</code>. To simplify the developer experience and minimize complexity, we intentionally set the <code>data</code> parameter type in the generateHMAC function as <code>any</code>. Internally, we handle the conversion to a string using <code>JSON.stringify</code>.</p>
<h3 id="heading-initializing-communication">Initializing communication</h3>
<p>Now on Boost's end create a file <code>verification.controller.ts</code> under <code>src/controllers</code>: </p>
<pre><code class="lang-ts"><span class="hljs-keyword">import</span> { generateKeyPair, generateSharedSecret, generateHMAC, KeyPair } <span class="hljs-keyword">from</span> <span class="hljs-string">'@boost/v1/utils/crypto.utils'</span>;
<span class="hljs-keyword">import</span> { KeyObject, BinaryLike } <span class="hljs-keyword">from</span> <span class="hljs-string">'crypto'</span>;

<span class="hljs-keyword">const</span> boostKeyPair: KeyPair = generateKeyPair();

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">shareKeys</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">const</span> boostPublicKey: Buffer = boostKeyPair.publicKey;
    <span class="hljs-keyword">const</span> boostPrivateKey: Buffer = boostKeyPair.privateKey;
    <span class="hljs-keyword">const</span> boostGenerator: Buffer = boostKeyPair.generator;
    <span class="hljs-keyword">const</span> boostPrime: Buffer = boostKeyPair.prime;
    <span class="hljs-keyword">const</span> boostDiffieHellman = boostKeyPair.diffieHellman;

    <span class="hljs-keyword">return</span> {
        boostPublicKey,
        boostPrivateKey,
        boostGenerator,
        boostPrime,
        boostDiffieHellman,
    };
}

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">hmacDigest</span>(<span class="hljs-params">data: <span class="hljs-built_in">any</span>, secretKey: KeyObject | BinaryLike</span>): <span class="hljs-title">any</span> </span>{
    <span class="hljs-keyword">return</span> generateHMAC(<span class="hljs-built_in">JSON</span>.stringify(data), secretKey);
}
</code></pre>
<p>This file imports the interface and all necessary modules from <code>cryto.utils.ts</code> and defines two wrapper functions – <code>shareKeys</code> and <code>hmacDigest</code>. <code>shareKeys</code> will only serve as a wrapper around <code>generateKeyPair</code> which will allow developers at Boost to send only the required keys over to the Account Aggregator. </p>
<h3 id="heading-setting-up-the-account-aggregator">Setting up the Account Aggregator</h3>
<p>At the Account Aggregator's end, we need to set up a function that computes AA's public key and sends it over to Boost Inc. We will also need a function to verify the received HMAC of a data by comparing it against one that AA generates:</p>
<pre><code class="lang-ts"><span class="hljs-keyword">import</span> { generateKeyPair, generateSharedSecret, generateHMAC, KeyPair } <span class="hljs-keyword">from</span> <span class="hljs-string">'../utils/crypto.utils'</span>;  
<span class="hljs-keyword">import</span> axios <span class="hljs-keyword">from</span> <span class="hljs-string">'axios'</span>;

<span class="hljs-keyword">let</span> sharedSecret: Buffer;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">sendAAPublicKey</span>(<span class="hljs-params"></span>): <span class="hljs-title">Promise</span>&lt;<span class="hljs-title">Buffer</span>&gt; </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> response = <span class="hljs-keyword">await</span> axios.get(<span class="hljs-string">'http://localhost:3000/init'</span>);

    <span class="hljs-keyword">const</span> boostPublicKey: Buffer = Buffer.from(response.data.boostPublicKey, <span class="hljs-string">'hex'</span>);
    <span class="hljs-keyword">const</span> boostGenerator: Buffer = Buffer.from(response.data.boostGenerator, <span class="hljs-string">'hex'</span>);
    <span class="hljs-keyword">const</span> boostPrime: Buffer = Buffer.from(response.data.boostPrime, <span class="hljs-string">'hex'</span>);

    <span class="hljs-keyword">const</span> AA: KeyPair = generateKeyPair(boostPrime, boostGenerator);
    sharedSecret = generateSharedSecret(AA, boostPublicKey);

    <span class="hljs-keyword">return</span> AA.publicKey;
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Error sending AA public key:'</span>, (error <span class="hljs-keyword">as</span> <span class="hljs-built_in">Error</span>).message);
    <span class="hljs-keyword">throw</span> error;
  }
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">verifyData</span>(<span class="hljs-params">data: <span class="hljs-built_in">any</span>, hmac: <span class="hljs-built_in">string</span></span>): <span class="hljs-title">Promise</span>&lt;<span class="hljs-title">string</span>&gt; </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> calculatedHMAC = generateHMAC(<span class="hljs-built_in">JSON</span>.stringify(data), sharedSecret);
    <span class="hljs-keyword">return</span> calculatedHMAC === hmac ? <span class="hljs-string">"Integrity and authenticity verified"</span> : <span class="hljs-string">"Integrity or authenticity compromised"</span>;
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Error verifying data:'</span>, (error <span class="hljs-keyword">as</span> <span class="hljs-built_in">Error</span>).message);
    <span class="hljs-keyword">throw</span> error;
  }
}
</code></pre>
<p>We make an Axios request to the <code>/init</code> endpoint defined at Boost and fetch (p), (g) and (A). Once we've computed the public key, we'll send that back to Boost. We will also compute our shared secret here which we'll use while verifying the HMAC in the <code>verifyData</code> method.</p>
<h3 id="heading-setting-up-the-express-apis">Setting up the Express APIs</h3>
<p>Now that all the controllers and utility functions are in place, we'll create a few endpoints to facilitate communication between Boost Inc. and the Account Aggregator.</p>
<h4 id="heading-boost">Boost:</h4>
<pre><code class="lang-ts"><span class="hljs-keyword">import</span> express, { Request, Response } <span class="hljs-keyword">from</span> <span class="hljs-string">'express'</span>;
<span class="hljs-keyword">import</span> { hmacDigest, shareKeys } <span class="hljs-keyword">from</span> <span class="hljs-string">'@boost/v1/controllers/verification.controller'</span>;
<span class="hljs-keyword">import</span> { KeyPair, generateSharedSecret } <span class="hljs-keyword">from</span> <span class="hljs-string">'@boost/v1/utils/crypto.utils'</span>;
<span class="hljs-keyword">import</span> { DiffieHellman } <span class="hljs-keyword">from</span> <span class="hljs-string">'crypto'</span>;
<span class="hljs-keyword">import</span> axios <span class="hljs-keyword">from</span> <span class="hljs-string">'axios'</span>;

<span class="hljs-keyword">const</span> appBoost = express();
<span class="hljs-keyword">const</span> PORT_BOOST = <span class="hljs-number">3000</span>;

<span class="hljs-keyword">let</span> boostPublicKey: Buffer, boostPrivateKey: Buffer;
<span class="hljs-keyword">let</span> boostGenerator: Buffer, boostPrime: Buffer;
<span class="hljs-keyword">let</span> sharedSecret: Buffer;
<span class="hljs-keyword">let</span> boostKeyPair: KeyPair, boostDiffieHellman: DiffieHellman;

appBoost.get(<span class="hljs-string">'/init'</span>, <span class="hljs-keyword">async</span> (req: Request, res: Response) =&gt; {
    ({ boostPublicKey, boostPrivateKey, boostGenerator, boostPrime, boostDiffieHellman } = shareKeys());
    res.send({ boostPublicKey, boostGenerator, boostPrime });
});

<span class="hljs-comment">// Simulated Data</span>
<span class="hljs-keyword">const</span> data = {
    name: <span class="hljs-string">'Boost User 1'</span>,
    phone: <span class="hljs-string">'1234567890'</span>,
};

appBoost.get(<span class="hljs-string">'/fetchData'</span>, <span class="hljs-keyword">async</span> (req: Request, res: Response) =&gt; {
    <span class="hljs-keyword">const</span> hmac = hmacDigest(data, sharedSecret);
    res.send({ data, hmac });
});

appBoost.listen(PORT_BOOST, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Boost server is running on http://localhost:<span class="hljs-subst">${PORT_BOOST}</span>`</span>);
});
</code></pre>
<p>The <code>/init</code> endpoint, hosted by Boost, is invoked by AA within its <code>sendAAPublicKey</code> function. When the shared secret is calculated, AA will invoke the endpoint <code>/fetchData</code> to retrieve the critical information.</p>
<h4 id="heading-account-aggregator-aa">Account Aggregator (AA):</h4>
<pre><code class="lang-ts"><span class="hljs-keyword">import</span> express, { Request, Response } <span class="hljs-keyword">from</span> <span class="hljs-string">'express'</span>;
<span class="hljs-keyword">import</span> { sendAAPublicKey, verifyData } <span class="hljs-keyword">from</span> <span class="hljs-string">'@AA/v1/controllers/aa.controller'</span>;
<span class="hljs-keyword">import</span> { KeyPair, generateSharedSecret } <span class="hljs-keyword">from</span> <span class="hljs-string">'@boost/v1/utils/crypto.utils'</span>;
<span class="hljs-keyword">import</span> { DiffieHellman } <span class="hljs-keyword">from</span> <span class="hljs-string">'crypto'</span>;
<span class="hljs-keyword">import</span> axios <span class="hljs-keyword">from</span> <span class="hljs-string">'axios'</span>;

<span class="hljs-keyword">const</span> appAA = express();
<span class="hljs-keyword">const</span> PORT_AA = <span class="hljs-number">3001</span>;

<span class="hljs-keyword">let</span> boostPublicKey: Buffer, boostPrivateKey: Buffer;
<span class="hljs-keyword">let</span> boostGenerator: Buffer, boostPrime: Buffer;
<span class="hljs-keyword">let</span> AAPublicKey: Buffer;
<span class="hljs-keyword">let</span> sharedSecret: Buffer;
<span class="hljs-keyword">let</span> boostKeyPair: KeyPair, boostDiffieHellman: DiffieHellman;

appAA.get(<span class="hljs-string">'/fetchAAPublicKey'</span>, <span class="hljs-keyword">async</span> (req: Request, res: Response) =&gt; {
    AAPublicKey = <span class="hljs-keyword">await</span> sendAAPublicKey();
    res.send({ AAPublicKey: AAPublicKey.toString(<span class="hljs-string">'hex'</span>) });

    boostKeyPair = {
        publicKey: boostPublicKey,
        privateKey: boostPrivateKey,
        generator: boostGenerator,
        prime: boostPrime,
        diffieHellman: boostDiffieHellman,
    }

    sharedSecret = generateSharedSecret(boostKeyPair, AAPublicKey);
});

appAA.get(<span class="hljs-string">'/verifyData'</span>, <span class="hljs-keyword">async</span> (req: Request, res: Response) =&gt; {
    <span class="hljs-keyword">const</span> response = <span class="hljs-keyword">await</span> axios.get(<span class="hljs-string">'http://localhost:3000/fetchData'</span>);
    <span class="hljs-keyword">const</span> { data, hmac } = response.data;
    <span class="hljs-keyword">const</span> verified = <span class="hljs-keyword">await</span> verifyData(data, hmac);
    res.send({ verified });
});

appAA.listen(PORT_AA, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`AA server is running on http://localhost:<span class="hljs-subst">${PORT_AA}</span>`</span>);
});
</code></pre>
<p>The <code>fetchAAPublicKey</code> endpoint, hosted as AA's end, will be invoked by Boost when it wants to evaluate the Shared Secret. The <code>verifyData</code> method is encapsulated within a <code>GET</code> request, enabling either party to confirm the integrity of the transmitted data.</p>
<h2 id="heading-invoking-the-apis">Invoking the APIs</h2>
<p>Head over to your Postman Client to test out these APIs. Since the <code>sendAAPublicKey</code> method takes care of the initiation, we need to start our communication using the <code>/fetchAAPublicKey</code> endpoint:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/03/image-35.png" alt="Image" width="600" height="400" loading="lazy">
<em>Postman Client: fetchAAPublicKey Endpoint</em></p>
<p>You will observe the AA's public key as a response. Now, Boost Inc. will use this Public Key and evaluate the Shared Secret.</p>
<p>Once that is done, it will use the Shared Secret to compute the message digest in the <code>/fetchData</code> endpoint. Since <code>/verifyData</code> invokes the former endpoint, we'll check this in action on our Postman Client:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/03/image-36.png" alt="Image" width="600" height="400" loading="lazy">
<em>Postman Client: verifyData Endpoint</em></p>
<p>You will notice that the <code>/verifyData</code> response declares the successful verification of both integrity and authenticity. This acknowledgment ensures that the transmitted data remains untampered and originates from the authenticated source, providing a layer of security for communication between the two entities.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>And there you have it: by utilizing HMACs and the Diffie-Hellman-Merkle Key Exchange, you can verify the integrity and authenticity of your transmitted data, enhancing the security of your applications and ensuring a reliable API communication framework for developers. </p>
<p>By understanding the intricacies and mathematical underpinnings of these practices, you can now make informed decisions, fortifying your system against tampering threats.</p>
<p>Find the complete code snippets here — <a target="_blank" href="https://gist.github.com/HamdaanAliQuatil/8e0942eddfe708aafd4f95b739802c0c">GitHub Gist | HamdaanAliQuatil</a>.<br>You may find me on X (formerly Twitter) – <a target="_blank" href="https://twitter.com/violinblackeye">Hamdaan Ali Quatil</a>.</p>
<h3 id="heading-references">References</h3>
<p>[1] Behrouz A. Forouzan – Introduction to Cryptography and Network Security</p>
<p>[2] New Directions in Cryptography, Whitfield Diffie and Martin E. Hellman <a target="_blank" href="https://www.cs.jhu.edu/~rubin/courses/sp03/papers/diffie.hellman.pdf">diffie.hellman.pdf (jhu.edu)</a></p>
<p>[3] Keying Hash Functions for Message Authentication, Mihir Bellare, Ran Canetti, Hugo Krawczyk <a target="_blank" href="https://cseweb.ucsd.edu/~mihir/papers/kmd5.pdf">https://cseweb.ucsd.edu/~mihir/papers/kmd5.pdf</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Defend Against Server-Side Request Forgery ]]>
                </title>
                <description>
                    <![CDATA[ Server-Side Request Forgery (SSRF) has been a consistent issue in application security and is among the OWASP Top 10 vulnerabilities. In this walkthrough, you'll first learn what Server-Side Request Forgery is and how it differs from Client-Side Requ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/defending-against-ssrf-attacks/</link>
                <guid isPermaLink="false">66c4c5321b22d2d8d9040ebb</guid>
                
                    <category>
                        <![CDATA[ Back end development  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ information security ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Hamdaan Ali ]]>
                </dc:creator>
                <pubDate>Fri, 05 Jan 2024 17:21:50 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730833444910/4e483988-c1f3-4637-af6c-fcf2fbedbbb6.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Server-Side Request Forgery (SSRF) has been a consistent issue in application security and is among the OWASP Top 10 vulnerabilities.</p>
<p>In this walkthrough, you'll first learn what Server-Side Request Forgery is and how it differs from Client-Side Request Forgery. We will create a sample application to gain a better understanding of how Server-Side Request Forgery attacks work, and explore various methods to safeguard our application against SSRF vulnerabilities.</p>
<h2 id="heading-table-of-contents">Table of Contents:</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-server-side-request-forgery">What is Server-Side Request Forgery?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-does-ssrf-differ-from-csrf">How Does SSRF Differ from CSRF?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-identifying-code-smells">Identifying Code Smells</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-the-pain-points">Understanding the Pain Points</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-exploit-the-vulnerability">How to Exploit the Vulnerability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-defend-against-ssrf-attacks">How to Defend Against SSRF Attacks</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ol>
<li><p><strong>Node and Express:</strong> We'll create a JavaScript sample application using the Express framework. A basic understanding of the framework would be helpful. You will need the <a target="_blank" href="https://nodejs.org/en/download/">Node Runtime Environment</a> to execute the scripts.</p>
</li>
<li><p><strong>Postman Client:</strong> To make an API request and to exploit the vulnerability, you will need a tool to make HTTP Requests. You may use your web browser's "Edit and Send" feature under the Networks tab, but since not all browsers allow this, it's best to use a tool like <a target="_blank" href="https://www.postman.com/downloads/">Postman</a> which provides a better UI to observe responses.</p>
</li>
</ol>
<h2 id="heading-what-is-server-side-request-forgery">What is Server-Side Request Forgery?</h2>
<p>Server-Side Request Forgery, or SSRF, is a security vulnerability that allows malicious actors to manipulate the server into making unintended requests on behalf of the server itself.</p>
<p>SSRF provides a window for such malicious actors to make requests "from" the server when they should be making requests "to" the server.</p>
<p>To appreciate what this means, let's look at a normal request execution using the sequence diagrams below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/12/image-104.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>UML Sequence Diagram for normal request execution</em></p>
<p>In a typical scenario, a server processes incoming requests from clients. Users or external systems initiate these requests, and the server responds accordingly. This is a standard client-server interaction where the server acts upon the requests it receives.</p>
<p>Now let's look at what SSRF looks like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/01/image-7.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>UML Sequence Diagram for SSRF attacks</em></p>
<p>In applications vulnerable to SSRF, attackers exploit the server's ability to make HTTP requests to resources that should not be directly accessible from the public internet. These resources may include internal protected resources, APIs, websites, or databases that can only be accessed from the server.</p>
<p>Attackers achieve this by tricking the server into making unintended requests to various destinations, including internal APIs, internal HTML pages, and internal databases.</p>
<h2 id="heading-how-does-ssrf-differ-from-csrf">How Does SSRF Differ from CSRF?</h2>
<p>SSRF is an attack where an attacker can make the server perform requests on their behalf. This involves manipulating the server to make requests to internal resources, which can result in unauthorized actions or information disclosure.</p>
<p>On the other hand, in CSRF, or Client-Side Request Forgery, the attacker tricks a user's browser into making unintended requests to a specific web application for which the user is already authenticated. This means that actions are performed on behalf of the user without their consent.</p>
<p>Backend Developers must be aware of SSRF to make secure applications. In contrast, front-end developers must be mindful of and implement client-side security measures to prevent CSRF attacks.</p>
<h2 id="heading-identifying-code-smells">Identifying Code Smells</h2>
<p>SSRF attacks often occur when web applications improperly mishandle user-controlled input, leading to network requests based on inadequately sanitized user input. Processing un-sanitized URLs in API requests is a common entry point for SSRF attacks.</p>
<p>Another common giveaway to identifying SSRF vulnerabilities in your applications is to check for instances where XML parsing occurs without adequate validation of external entities. Applications that fail to validate and secure their XML parsers properly may inadvertently expose themselves to SSRF risks.</p>
<p>In this walkthrough, you will make a server that takes a URL and uses it to make network requests without proper validation and sanitization. You will then see ways to mitigate this issue.</p>
<h2 id="heading-understanding-the-pain-points">Understanding the Pain Points</h2>
<p>To better understand the issue of SSRF attacks, lets create a sample application using Express and JavaScript. Below is a Mermaid Sequence Diagram where we explain what the code base does:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/12/image-169.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>UML Sequence Diagram for the sample application</em></p>
<p>We will create an Express app with two endpoints — <code>/fetch</code>, a <code>GET</code> request designed to fetch content from a specified URL, and <code>/admin</code>, another <code>GET</code> request, which is an internal API within the organization that accesses an internally protected resource.</p>
<p>We will discover a security vulnerability associated with Server-Side Request Forgery (SSRF) in implementing the first <code>GET</code> request.</p>
<p>We will also create another helper function at the <code>/uploads</code> endpoint to allow our clients to fetch and view their recently uploaded content.</p>
<h2 id="heading-project-setup">Project Setup</h2>
<p>To get started, let's quickly set up our repository and install all the required packages. In the root of your workspace, install Express and Axios using the following command:</p>
<pre><code class="lang-bash">npm init -y | npm i axios express
</code></pre>
<p>Executing this command will create a <code>package.json</code> file with default settings and install the specified packages.</p>
<p>To simulate the internal protected resource, let's create a <code>data.json</code> in the root of your workspace:</p>
<pre><code class="lang-json">{   
    <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Hamdaan Ali Quatil"</span>,
    <span class="hljs-attr">"password"</span>: <span class="hljs-string">"violinblackeye"</span>
}
</code></pre>
<p>Now, create a file called <code>app.js</code> in the root of your repository. Here, we will define all of our endpoints. Import all required packages like this:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> express = <span class="hljs-built_in">require</span>(<span class="hljs-string">'express'</span>);
<span class="hljs-keyword">const</span> axios = <span class="hljs-built_in">require</span>(<span class="hljs-string">'axios'</span>);
<span class="hljs-keyword">const</span> fs = <span class="hljs-built_in">require</span>(<span class="hljs-string">'fs'</span>).promises;
</code></pre>
<p>We use the <code>fs</code> (File System) module to interact with the local file system. Within the Express application, we use <code>fs.promises</code> to read the contents of a file. The <code>fetchPrivateResource</code> function asynchronously reads the contents of the <code>data.json</code> file, which is an internal resource.</p>
<p>Let's create an instance of the Express app to handle HTTP requests and define the <code>fetchPrivateResource</code> method. In the sample application, only the admin should be able to fetch this internal resource, but you will observe how a malicious actor can access this using an SSRF attack.</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> app = express();
<span class="hljs-keyword">const</span> port = <span class="hljs-number">3000</span>;

<span class="hljs-comment">// Function to fetch private resource</span>
<span class="hljs-keyword">const</span> fetchPrivateResource = <span class="hljs-keyword">async</span> () =&gt; {
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> content = <span class="hljs-keyword">await</span> fs.readFile(<span class="hljs-string">'data.json'</span>, <span class="hljs-string">'utf-8'</span>);
    <span class="hljs-keyword">return</span> content;
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Error reading private resource:'</span>, error.message);
    <span class="hljs-keyword">throw</span> error;
  }
};
</code></pre>
<h3 id="heading-the-fetch-endpoint">The Fetch Endpoint</h3>
<p>Now, let's define our first endpoint, <code>/fetch</code> which expects a query parameter <code>url</code> containing the target URL. Upon receiving a request, the server uses the Axios library to make a GET request to the specified URL.</p>
<pre><code class="lang-js">app.get(<span class="hljs-string">"/fetch"</span>, <span class="hljs-keyword">async</span> (req, res) =&gt; {
  <span class="hljs-keyword">const</span> url = req.query.url;

  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> response = <span class="hljs-keyword">await</span> axios.get(url);
    <span class="hljs-keyword">const</span> responseData = <span class="hljs-built_in">JSON</span>.stringify(response.data);

    <span class="hljs-keyword">const</span> filename = path.basename(url);
    <span class="hljs-keyword">const</span> textFilePath = path.join(__dirname, <span class="hljs-string">"uploads"</span>, <span class="hljs-string">"upload-data.txt"</span>);

    <span class="hljs-keyword">await</span> fs.writeFile(textFilePath, responseData, <span class="hljs-string">"utf-8"</span>);

    res.send(<span class="hljs-string">"Upload Successful"</span>);
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-built_in">console</span>.error(<span class="hljs-string">"Error:"</span>, error.message);
    res.status(<span class="hljs-number">500</span>).send(<span class="hljs-string">"Internal Server Error"</span>);
  }
});
</code></pre>
<p>The <code>axios.get</code> method is used to perform the HTTP GET request, and the response data is then converted to a JSON string. The resulting string is written to a text file named <code>upload-data.txt</code> in the <code>uploads</code> folder of the server. Finally, a success message or an error message is sent back to the client, depending on the outcome of the operation.</p>
<h3 id="heading-the-uploads-endpoint">The Uploads Endpoint</h3>
<p>With that done, let's create an endpoint to allow users to access and verify their uploaded files. The server will check if the requested file exists, and if so, it sends the file to the client. When a file cannot be found, the server returns a 404 error.</p>
<pre><code class="lang-js">app.get(<span class="hljs-string">"/uploads/:filename"</span>, <span class="hljs-keyword">async</span> (req, res) =&gt; {
  <span class="hljs-keyword">const</span> filename = req.params.filename;
  <span class="hljs-keyword">const</span> filePath = path.join(__dirname, <span class="hljs-string">"uploads"</span>, filename);
  <span class="hljs-built_in">console</span>.log(filePath);

  <span class="hljs-keyword">try</span> {
    <span class="hljs-comment">// Check if file exists</span>
    <span class="hljs-keyword">await</span> fs.access(filePath);

    <span class="hljs-comment">// If file exists, send it to the client</span>
    res.sendFile(filePath);

  } <span class="hljs-keyword">catch</span> (error) {
    res.status(<span class="hljs-number">404</span>).send(<span class="hljs-string">"File not found: "</span> + error);
  }
});
</code></pre>
<h3 id="heading-the-admin-endpoint">The Admin Endpoint</h3>
<p>Now, we need to make an internal API – the <code>/admin</code> route – which is intentionally shielded from public access. The objective is to ensure this API is only accessible from localhost or the local machine (127.0.0.1).</p>
<p>We can do this by implementing a middleware that acts as a protective barrier, permitting requests to proceed to the <code>/admin</code> route only if they originate from the local host.</p>
<p>The middleware checks whether the <code>req.hostname</code> property, which represents the hostname specified in the HTTP request, matches <code>localhost</code> or <code>127.0.0.1</code>. If the request is from a different host, the middleware responds with a <code>403</code> Forbidden status, thereby restricting access.</p>
<pre><code class="lang-js"><span class="hljs-comment">// middleware to protect admin API</span>
app.use(<span class="hljs-string">'/admin'</span>, <span class="hljs-keyword">async</span> (req, res, next) =&gt; {
  <span class="hljs-keyword">const</span> isLocalhost = req.hostname === <span class="hljs-string">'localhost'</span> || req.hostname === <span class="hljs-string">'127.0.0.1'</span>;

  <span class="hljs-keyword">if</span> (isLocalhost) {
    next();
  } <span class="hljs-keyword">else</span> {
    res.status(<span class="hljs-number">403</span>).send(<span class="hljs-string">'Forbidden'</span>);
  }
});

<span class="hljs-comment">// Route to access the admin API</span>
app.get(<span class="hljs-string">'/admin'</span>, <span class="hljs-keyword">async</span> (req, res) =&gt; {
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> content = <span class="hljs-keyword">await</span> fetchPrivateResource();
    res.send(content);
  } <span class="hljs-keyword">catch</span> (error) {
    res.status(<span class="hljs-number">500</span>).send(<span class="hljs-string">'Internal Server Error'</span>);
  }
});
</code></pre>
<p>Once all routes are configured, we start the server using the <code>app.listen</code> method, and it begins listening on port 3000 for incoming requests.</p>
<pre><code class="lang-js">app.listen(port, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Server is running on http://localhost:<span class="hljs-subst">${port}</span>`</span>);
});
</code></pre>
<p>With our <code>app.js</code> now set up to process incoming requests, let's run the sample application using <code>nodemon</code>:</p>
<pre><code class="lang-bash">npm i -D nodemon | nodemon app.js
</code></pre>
<p>The server has started on the port <code>3000</code>. Now, we are ready to test our sample application and look for code smells that may lead to SSRF attacks. You may find the complete code here — <a target="_blank" href="https://gist.github.com/HamdaanAliQuatil/c7db6f3dd0666bd9396a7f4e6ebe6665">GitHub Gist | HamdaanAliQuatil</a>.</p>
<h2 id="heading-how-to-exploit-the-vulnerability">How to Exploit the Vulnerability</h2>
<p>Let's try to make a <code>GET</code> request to the fetch API. We are simulating the process of uploading a text file using the URL to the file. In this demonstration, we will fetch the contents of an example file and save it on our servers. Here is the <a target="_blank" href="https://example-files.online-convert.com/document/txt/example.txt">link to the text file</a>.</p>
<p>Open your Postman Client and execute a <code>GET</code> request with the URL <code>http://localhost:3000/fetch?url=https://example-files.online-convert.com/document/txt/example.txt</code>. We are adding the link to the file as a Query Parameter in the <code>/fetch</code> endpoint. When you hit send, you will see a response <code>"Upload Successful"</code>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/01/image-3.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Postman Client: Fetch Endpoint</em></p>
<p>You'll see that your repository now has a newly created file in the <code>uploads</code> directory. Clients can now access their uploaded information using the <code>/uploads</code> API endpoint to view their files.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/01/image-4.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Postman Client: Uploads Endpoint</em></p>
<p>Now, let's send a malicious request by changing our Query param to <code>http://120.0.07/admin</code> in the same request to the <code>/fetch</code> endpoint. The updated URL will now look like this: <code>http://localhost:3000/fetch?url=http://127.0.0.1:3000/admin</code>.</p>
<p>In the Query parameter, <code>127.0.0.1</code> is a Loopback Address. A loopback address is a reserved IP address used to establish network connections with the same host (the local machine) for testing and communication within the device.</p>
<p>The malicious actor is attempting is to make a request to the server's <code>/admin</code> route from the server itself using the loopback address. This simulates an internal resource access scenario.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/01/image-5.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Postman Client: Admin Endpoint</em></p>
<p>You'll notice that an <code>"Upload Successful"</code> message comes as a response to this request. Now try accessing your uploaded file again using the <code>GET</code> request at the <code>/upload</code> endpoint.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/01/image-6.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Postman Client: Uploads Endpoint</em></p>
<p>You'll see that the contents of the uploaded file have been altered. This alteration highlights a successful SSRF (Server-Side Request Forgery) attack, where a malicious actor took advantage of the server's capability to initiate internal requests.</p>
<p>The file, which initially contained specific data, has now been tampered with. This showcases the potential for unauthorized access and manipulation of sensitive information through SSRF exploits.</p>
<h2 id="heading-how-to-defend-against-ssrf-attacks">How to Defend Against SSRF Attacks</h2>
<p>Now, let’s see the ways in which we can fix our application's vulnerability to SSRF. The most intuitive solution that comes to your mind could be to never allow a client to enter a URL. This is certainly the most powerful defense. The server should create a URL it needs.</p>
<p>But many times, allowing URLs in your business logic becomes an absolute necessity. In such cases, our goal is to prevent the attack or at least reduce the risk if an attack occurs.</p>
<p>If you really must allow a URL as it is, here are some precautionary steps you can take:</p>
<h3 id="heading-sanitization-and-validation">Sanitization and Validation</h3>
<p>As with most vulnerabilities, a pain-point in SSRF attacks is the use of untrusted data. Always treat any data coming from the client side as untrusted.</p>
<p>Sanitizing and validating the client-supplied data should go a long way to defend against SSRF attacks. A very intuitive validation is to restrict any URL containing localhost or the loopback address.</p>
<p>Let's create a helper function <code>isValidUrl</code> and call it in the function for the <code>/fetch</code> endpoint.</p>
<pre><code class="lang-js"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">isValidUrl</span>(<span class="hljs-params">url</span>) </span>{
  <span class="hljs-comment">// Restrict URLs to HTTP only. This blocks FTP and other protocols</span>
  <span class="hljs-keyword">const</span> validUrlRegex = <span class="hljs-regexp">/^http:\/\/\S+$/</span>;

  <span class="hljs-keyword">if</span> (!validUrlRegex.test(url)) {
    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
  }

  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> parsedUrl = <span class="hljs-keyword">new</span> URL(url);

    <span class="hljs-comment">// Check if the host is localhost or a loopback IP address</span>
    <span class="hljs-keyword">const</span> isLocalhost = parsedUrl.hostname === <span class="hljs-string">'localhost'</span>;
    <span class="hljs-keyword">const</span> isLocalIP = <span class="hljs-regexp">/^127\.\d+\.\d+\.\d+$/g</span>.test(parsedUrl.hostname);

    <span class="hljs-keyword">return</span> !(isLocalhost || isLocalIP);
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
  }
}
</code></pre>
<p>Your updated function for <code>/fetch</code> endpoint should look like this:</p>
<pre><code class="lang-js">app.get(<span class="hljs-string">"/fetch"</span>, <span class="hljs-keyword">async</span> (req, res) =&gt; {
  <span class="hljs-keyword">const</span> url = req.query.url;

  <span class="hljs-keyword">if</span> (!isValidUrl(url)) {
    res.status(<span class="hljs-number">400</span>).send(<span class="hljs-string">"Loopback URLs are not allowed"</span>);
    <span class="hljs-keyword">return</span>;
  }

  <span class="hljs-keyword">try</span> {
    ...
    res.send(<span class="hljs-string">"Upload Successful"</span>);
  } <span class="hljs-keyword">catch</span> (error) {
    ...
  }
});
</code></pre>
<p>Now, go back to the Postman Client and resend the malicious request. You will observe that previously uploaded file is not tampered and you receive <code>"Loopback URLs are not allowed"</code> in the response.</p>
<h3 id="heading-whitelisting-via-an-allow-list">Whitelisting via an Allow List</h3>
<p>You may create a positive allow list to only allow certain trusted IP Addresses, URL Schema, and Port. Let's implement an allow list and improve our <code>isValidUrl</code> function:</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> whitelist = [<span class="hljs-string">"boost.com"</span>, <span class="hljs-string">"boost.in"</span>, <span class="hljs-string">"trustedDomain3.com"</span>];
<span class="hljs-keyword">const</span> allowedPorts = [<span class="hljs-string">'80'</span>, <span class="hljs-string">'443'</span>];
</code></pre>
<p>Now use your declared <code>whitelist</code> in the <code>isValidUrl</code> function:</p>
<pre><code class="lang-js"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">isValidUrl</span>(<span class="hljs-params">url</span>) </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> parsedUrl = <span class="hljs-keyword">new</span> URL(url);

    <span class="hljs-keyword">if</span> (!whitelist.includes(parsedUrl.hostname)) {
      <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
    }

    <span class="hljs-keyword">if</span> (!allowedPorts.includes(parsedUrl.port)) {
      <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
    }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
  }
}
</code></pre>
<p>Notice how we've removed the need for regex. This brings us to another mitigation technique that you must avoid:</p>
<h3 id="heading-dont-use-a-deny-list">Don't Use a Deny List</h3>
<p>You must never mitigate SSRF vulnerabilities using a deny list or regex. Restricting the use of IP Addresses is not straightforward. To understand why we must avoid a deny list, look at the following example.</p>
<p>A Loopback Address is typically represented using <code>127.0.0.1</code> . Its quite easy to spot this address and reject it. But a problem arises when a malicious request is sent using any other forms of this Loopback address that also points to the local machine. For example, <code>127.1</code>, <code>::1</code>, <code>localhost</code> ,<code>::ffff:7f00:1</code> all point to the local machine.</p>
<p>A regular expression to spot all such variations is much more complex. Malicious actors can easily bypass a deny list by passing an octal representation of decimal encoding of the IP address.</p>
<h3 id="heading-enforce-a-url-scheme">Enforce a URL Scheme</h3>
<p>In absence of this measure, a client might send requests that use any protocols other than the intended ones. To replace our <code>validUrlRegex</code>, we will use a <code>allowedSchemes</code> list. We will restrict our application to only process requests when the protocols are either <code>https:</code> or <code>http</code>. Not allowing any requests with protocols <code>file:</code> and <code>ftp:</code> will safe-guard our sample application.</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> allowedSchemes = [<span class="hljs-string">'http:'</span>, <span class="hljs-string">'https:'</span>];
</code></pre>
<p>The updated <code>isValidUrl</code> function will look like this:</p>
<pre><code class="lang-js"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">isValidUrl</span>(<span class="hljs-params">url</span>) </span>{
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> parsedUrl = <span class="hljs-keyword">new</span> URL(url);

    <span class="hljs-keyword">if</span> (!whitelist.includes(parsedUrl.hostname)) {
      <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
    }

    <span class="hljs-keyword">if</span> (!allowedPorts.includes(parsedUrl.port)) {
      <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
    }

    <span class="hljs-keyword">if</span> (!allowedSchemes.includes(parsedUrl.protocol)) {
      <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
    }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
  }
}
</code></pre>
<h3 id="heading-disable-redirects">Disable Redirects</h3>
<p>Redirects are a mechanism used by web applications to forward a user's browser from one URL to another. If a server follows redirects automatically, an attacker could exploit this behavior to make the server inadvertently access internal resources, leading to data exposure or unauthorized actions.</p>
<p>To restrict redirects in Axios, pass in an Axios Configuration object in the second parameter:</p>
<pre><code class="lang-js"><span class="hljs-keyword">const</span> response = <span class="hljs-keyword">await</span> axios.get(url, { <span class="hljs-attr">maxRedirects</span>: <span class="hljs-number">0</span> });
</code></pre>
<p>To learn more about Axios Config, check this guide: <a target="_blank" href="https://axios-http.com/docs/req_config">Axios | Request Config</a>.</p>
<h3 id="heading-send-filtered-data-to-the-client">Send Filtered Data to the Client</h3>
<p>Avoid sending raw response bodies directly from your server to the client. Ensure that the responses reaching the client are carefully curated and conform to expected formats.</p>
<p>By implementing this practice, you shield your application from potential security vulnerabilities associated with exposing unfiltered or sensitive information. Always validate, filter, and format responses to align with your application's anticipated data structures.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>And there you have it: by implementing a few well-established methodologies and best practices, you can effectively detect and mitigate SSRF attacks in your applications and create secure APIs as developers.</p>
<p>Find the complete code snippets here — <a target="_blank" href="https://gist.github.com/HamdaanAliQuatil/c7db6f3dd0666bd9396a7f4e6ebe6665">GitHub Gist | HamdaanAliQuatil</a>.<br>You may find me on X (formerly Twitter) - <a target="_blank" href="https://twitter.com/violinblackeye">Hamdaan Ali Quatil</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
