<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Healthcare AI - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Healthcare AI - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 23 Jun 2026 22:44:26 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/healthcare-ai/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ The Hidden PHI Problem in Medical Images: Building a Synthetic Dataset for AI De-Identification ]]>
                </title>
                <description>
                    <![CDATA[ In this article, you'll learn how my team built a synthetic PHI generation pipeline to create privacy-safe training and validation data for medical imaging AI. The Problem Imagine you’re building an A ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-synthetic-dataset-for-ai-de-identification/</link>
                <guid isPermaLink="false">6a357b2a9d624935c947cccf</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Healthcare AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Medical Imaging ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ dicom ]]>
                    </category>
                
                    <category>
                        <![CDATA[ synthetic data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ healthtech ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Lakshmi Mahabaleshwara ]]>
                </dc:creator>
                <pubDate>Fri, 19 Jun 2026 17:23:54 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/74f053ea-3efc-4ef0-932b-d423dccba44a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this article, you'll learn how my team built a synthetic PHI generation pipeline to create privacy-safe training and validation data for medical imaging AI.</p>
<h3 id="heading-the-problem">The Problem</h3>
<p>Imagine you’re building an AI system that removes patient information from medical images.</p>
<p>The model needs thousands of examples showing where Protected Health Information (PHI) appears and what it looks like. The more examples it sees, the better it becomes at finding and removing sensitive information.</p>
<p>But there is a problem:</p>
<p><strong>The data you need to train the model is the same data you’re not allowed to share freely.</strong></p>
<p>Healthcare organizations must protect patient privacy. Regulations like HIPAA require that patient identifiers are removed before medical images can be shared for research, AI development, or external collaboration.</p>
<p>This creates an interesting engineering challenge: How do you build and test de-identification systems when the data needed to train those systems can't be easily used?</p>
<p>One practical solution is <strong>Synthetic PHI.</strong></p>
<p>In this article, I’ll show why synthetic PHI is valuable, explain the hidden PHI problem inside medical images, and walk through a pipeline my team built that generates realistic ultrasound datasets with fully controlled synthetic patient information.</p>
<h2 id="heading-what-youll-learn-in-this-tutorial">What You'll Learn in This Tutorial</h2>
<p>By the end of this tutorial, you'll understand:</p>
<ul>
<li><p>The hidden PHI challenges in medical imaging data.</p>
</li>
<li><p>Why synthetic PHI is useful for building and testing healthcare AI systems.</p>
</li>
<li><p>How to generate realistic synthetic patient identities using Python and Faker.</p>
</li>
<li><p>How to inject PHI into both image pixels and DICOM metadata.</p>
</li>
<li><p>How to create ground-truth labels for AI model training and evaluation.</p>
</li>
<li><p>How to validate synthetic medical imaging datasets before using them in downstream workflows.</p>
</li>
</ul>
<h2 id="heading-what-well-cover"><strong>What We'll Cover:</strong></h2>
<ul>
<li><p><a href="#heading-source-images-openpocus">Source Images: OpenPOCUS</a></p>
</li>
<li><p><a href="#heading-the-iceberg-problem-most-phi-is-hidden">The Iceberg Problem: Most PHI Is Hidden</a></p>
</li>
<li><p><a href="#heading-why-synthetic-phi-matters">Why Synthetic PHI Matters</a></p>
<ul>
<li><p><a href="#heading-challenge-1-privacy-regulations">Challenge 1: Privacy Regulations</a></p>
</li>
<li><p><a href="#heading-challenge-2-annotation-at-scale">Challenge 2: Annotation at Scale</a></p>
</li>
<li><p><a href="#heading-challenge-3-validation">Challenge 3: Validation</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-synthetic-phi-solves-all-three-problems">Synthetic PHI Solves All Three Problems</a></p>
</li>
<li><p><a href="#heading-building-a-synthetic-phi-pipeline">Building a Synthetic PHI Pipeline</a></p>
</li>
<li><p><a href="#heading-pipeline-architecture">Pipeline Architecture</a></p>
</li>
<li><p><a href="#heading-safety-checks-before-burning">Safety Checks Before Burning</a></p>
<ul>
<li><p><a href="#heading-step-1-generate-synthetic-patient-identities">Step 1: Generate Synthetic Patient Identities</a></p>
</li>
<li><p><a href="#heading-step-2-burn-phi-into-image-pixels">Step 2: Burn PHI into Image Pixels</a></p>
</li>
<li><p><a href="#heading-step-3-add-phi-to-dicom-headers">Step 3: Add PHI to DICOM Headers</a></p>
</li>
<li><p><a href="#heading-step-4-identity-mapping-the-de-identified-patientid">Step 4: Identity Mapping: The De-Identified PatientID</a></p>
</li>
<li><p><a href="#heading-step-5-ground-truth-structured-csv-output">Step 5: Ground Truth: Structured CSV Output</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-three-tier-dicom-validation">Three-Tier DICOM Validation</a></p>
</li>
<li><p><a href="#heading-a-surprising-bug-monai-vs-pil">A Surprising Bug: MONAI vs PIL</a></p>
</li>
<li><p><a href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-source-images-openpocus">Source Images: OpenPOCUS</h2>
<p>The synthetic PHI generation uses lung point-of-care ultrasound (POCUS) frames from <a href="https://github.com/kumarandre/OpenPOCUS">OpenPOCUS</a>, an openly licensed collection of real ultrasound images contributed by the POCUS community.</p>
<p>These images carry no real PHI. OpenPOCUS provides clinically authentic ultrasound images while avoiding patient privacy concerns. This makes it an ideal foundation for synthetic PHI generation because we can focus entirely on creating and tracking identifiers without risking exposure of real patient information.</p>
<h2 id="heading-the-iceberg-problem-most-phi-is-hidden">The Iceberg Problem: Most PHI Is Hidden</h2>
<p>When people think about PHI in medical images, they usually think about visible text overlays.</p>
<p>These include:</p>
<pre><code class="language-plaintext">Patient name
Medical Record Number (MRN)
Date of birth
Study date
</code></pre>
<p>These identifiers are often burned directly into image pixels by ultrasound, X-ray, CT, and MRI systems.</p>
<p>But visible text is only the tip of the iceberg. Much of the remaining PHI lives inside the DICOM header, a collection of metadata fields that describe the image and the study. These fields contains identifiers such as <code>PatientName</code>, <code>PatientID</code>, <code>StudyDate</code>, <code>institution names</code>, and other sensitive information.</p>
<p>Unlike burned-in text, header PHI isn't visible when looking at the image itself, but it travels with the file and must also be removed during de-identification.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/4f1036fd-009f-4be7-944a-af5380dfdfcb.png" alt="Iceberg illustration showing visible PHI in image pixels and hidden PHI in DICOM metadata." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>A de-identification system must handle both.</p>
<p>Removing visible text while leaving PHI inside DICOM metadata still creates a privacy risk. Likewise, stripping metadata while leaving patient names burned into image pixels is equally problematic.</p>
<p>This hidden PHI challenge makes testing de-identification software much harder than it first appears.</p>
<h2 id="heading-why-synthetic-phi-matters">Why Synthetic PHI Matters</h2>
<p>At first glance, it seems hospitals already have plenty of real-world data available. So why not simply use that?</p>
<p>The answer comes down to three challenges.</p>
<h3 id="heading-challenge-1-privacy-regulations">Challenge 1: Privacy Regulations</h3>
<p>Medical images often contain patient identifiers.</p>
<p>Sharing those images outside secure clinical environments introduces significant legal and compliance risk.</p>
<p>The more institutions involved, the more difficult governance becomes.</p>
<h3 id="heading-challenge-2-annotation-at-scale">Challenge 2: Annotation at Scale</h3>
<p>Modern AI systems require labeled examples.</p>
<p>Someone must identify:</p>
<ul>
<li><p>Where PHI appears</p>
</li>
<li><p>What type of PHI is it</p>
</li>
<li><p>Which DICOM tags contain PHI</p>
</li>
</ul>
<p>Creating these annotations manually is expensive and time-consuming.</p>
<h3 id="heading-challenge-3-validation">Challenge 3: Validation</h3>
<p>Suppose you’re evaluating a de-identification tool. How do you know whether it successfully removed every identifier?</p>
<p>With real patient data, you often don’t know exactly where every piece of PHI exists. Without ground truth, measuring accuracy becomes difficult.</p>
<h2 id="heading-synthetic-phi-solves-all-three-problems">Synthetic PHI Solves All Three Problems</h2>
<p>Instead of starting with real patient identifiers, we can generate realistic fake identities and intentionally inject them into medical images.</p>
<p>Because the pipeline creates the PHI itself, we know:</p>
<ul>
<li><p>Every identifier value</p>
</li>
<li><p>Every pixel location</p>
</li>
<li><p>Every DICOM tag</p>
</li>
<li><p>Every expected output</p>
</li>
</ul>
<p>This gives us perfect ground truth.</p>
<p>Now, a de-identification system can be evaluated objectively. If a patient name remains after processing, we know it failed. If clinical content is accidentally removed, we know that too.</p>
<p>Synthetic PHI creates a privacy-safe dataset that can be used for:</p>
<ul>
<li><p>Training AI models</p>
</li>
<li><p>Benchmarking de-identification software</p>
</li>
<li><p>Regression testing</p>
</li>
<li><p>Validation before deployment</p>
</li>
</ul>
<h2 id="heading-building-a-synthetic-phi-pipeline">Building a Synthetic PHI Pipeline</h2>
<p>To explore this problem, my team built a pipeline that generates synthetic PHI for lung Point-of-Care Ultrasound (POCUS) images.</p>
<p>The goal was to:</p>
<ol>
<li><p>Start with ultrasound images containing no patient information.</p>
</li>
<li><p>Generate realistic synthetic patient identities.</p>
</li>
<li><p>Burn PHI into image pixels.</p>
</li>
<li><p>Insert matching PHI into DICOM metadata.</p>
</li>
<li><p>Automatically generate ground truth labels.</p>
</li>
<li><p>Validate the resulting DICOM files.</p>
</li>
</ol>
<p>The output looks realistic from the perspective of a de-identification system while containing no real patient information.</p>
<h2 id="heading-pipeline-architecture"><strong>Pipeline Architecture</strong></h2>
<p>The workflow looks like this (we'll go over each step in detail below):</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/f293fb38-b09f-451c-b6ee-75c71e9a7e66.png" alt="Workflow for generating synthetic PHI in ultrasound images and DICOM files." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>Each stage produces artifacts consumed by the next stage. Failures are quarantined rather than silently ignored.</p>
<h2 id="heading-safety-checks-before-burning">Safety Checks Before Burning</h2>
<p>Before writing synthetic PHI onto an image, the pipeline performs a safety check to ensure that the selected region to insert PHI lies outside the ultrasound fan.</p>
<p>The top-left corner of a lung POCUS image is usually outside the imaging fan, a dark border, safe to burn PHI onto without obscuring clinical content.</p>
<p>To make sure this region holds good for every image, the pipeline runs two checks per image:</p>
<ul>
<li><p><strong>Brightness check:</strong> If the average intensity of the configured burn region exceeds a threshold, the region likely overlaps the ultrasound fan rather than the dark border.</p>
</li>
<li><p><strong>Boundary check:</strong> The pipeline verifies that the configured burn region fits entirely within the image. Images that are smaller than the expected burn area are quarantined.</p>
</li>
</ul>
<p>In either case, the image is quarantined with the reason recorded into the manifest. There are no partial burns, no overwritten clinical content, and no silent corruption of test data.</p>
<p>This prevents synthetic identifiers from accidentally obscuring anatomy.</p>
<pre><code class="language-python">def burn_region_is_safe(arr):
    """Check the burn region is dark enough to be outside the fan."""
    h, w = arr.shape
    y2 = min(BURN_REGION_Y + BURN_REGION_H, h)
    x2 = min(BURN_REGION_X + BURN_REGION_W, w)
    region = arr[BURN_REGION_Y:y2, BURN_REGION_X:x2]
    if region.size == 0:
        return False, float("nan")
    mean = float(region.mean())
    return mean &lt;= BRIGHTNESS_SKIP_THRESHOLD, mean
</code></pre>
<p>The function extracts the configured burn region and computes its average brightness. If the region is too bright, it likely overlaps the ultrasound fan rather than the border.</p>
<h3 id="heading-step-1-generate-synthetic-patient-identities">Step 1: Generate Synthetic Patient Identities</h3>
<p>The synthetic identity is produced by <a href="https://faker.readthedocs.io/">Faker</a> and seeded per case, so the same image always yields the same fake patient.</p>
<p>Determinism matters because:</p>
<ul>
<li><p>Reproducing a test result requires reproducing the test data.</p>
</li>
<li><p>Debugging downstream tools is easier when the input doesn't change between runs.</p>
</li>
<li><p>Comparing two de-identification tools fairly requires both to see the same planted PHI.</p>
</li>
</ul>
<pre><code class="language-python">def case_seed(global_seed: int, source_id: str) -&gt; int:
    """Per-image deterministic seed derived from global seed and source path."""
    h = hashlib.sha256(f"{global_seed}|{source_id}".encode()).hexdigest()
    return int(h[:8], 16)


def generate_phi(seed: int) -&gt; dict:
    fake = Faker()
    Faker.seed(seed)
    rng = random.Random(seed)

    last = fake.last_name()
    first = fake.first_name()
    middle = fake.random_letter().upper()
    mrn = f"{rng.randint(1000000, 9999999)}"
    dob = fake.date_of_birth(minimum_age=18, maximum_age=95)
    study_date = fake.date_time_this_decade()
    institution = rng.choice(INSTITUTION_POOL)

    return {
        "case_uuid": f"SYNTH-{uuid.UUID(int=rng.getrandbits(128))}",
        "patient_name_display": f"{last}, {first} {middle}.",
        "patient_name_dicom": f"{last}^{first}^{middle}",   # DICOM PN VR format
        "patient_id": mrn,
        "dob": dob,
        "study_date": study_date,
        "institution_name": institution,
    }
</code></pre>
<p>The <code>case_seed()</code> function generates a deterministic seed from the source image path. That seed is then used by Faker to create a synthetic identity.</p>
<p>Because the seed is repeatable, the same input image always receives the same synthetic patient information. This makes debugging and benchmarking reproducible.</p>
<h3 id="heading-step-2-burn-phi-into-image-pixels">Step 2: Burn PHI into Image Pixels</h3>
<p>Rendering text onto an image is comparatively expensive. For a single zone containing 30+ frames, repeating that work per frame is wasteful.</p>
<p>The pipeline instead renders the PHI overlay onto a transparent canvas one time per zone. This mirrors how many ultrasound systems operate in practice, where patient information remains fixed while the underlying image content changes from frame to frame.</p>
<pre><code class="language-python">def make_phi_overlay(shape, phi):
    """Render PHI ONCE onto a canvas. Returns (overlay_array, overlays_meta)."""
    h, w = shape
    canvas = Image.new("L", (w, h), 0)  # blank canvas
    draw = ImageDraw.Draw(canvas)

    overlays, x, y = [], BURN_REGION_X, BURN_REGION_Y
    for entry in _phi_text_block(phi):
        x0, y0, x1, y1 = draw.textbbox((x, y), entry["line"], font=FONT)
        tw, th = x1 - x0, y1 - y0

        if x + tw &gt; w or y + th &gt; h:
            raise ValueError(
                f"rendered PHI overflows image: '{entry['line']}' "
                f"at ({x},{y}) size ({tw}x{th}), image {w}x{h}"
            )

        draw.text((x, y), entry["line"], font=FONT, fill=TEXT_COLOR)
        overlays.append({
            "phi_category": entry["phi_category"],
            "rendered_text": entry["line"],
            "phi_value": entry["value"],
            "bbox": [x, y, tw, th],
            "dicom_tag": entry["dicom_tag"],
        })
        y += th + LINE_GAP
    return np.array(canvas), overlays
</code></pre>
<p>The <code>make_phi_overlay()</code> function creates a blank canvas and renders each PHI line onto it. At the same time, it records metadata such as the rendered text, bounding box coordinates, and corresponding DICOM tag.</p>
<p>The function returns both the image overlay and the annotation metadata, ensuring that the ground truth always matches the pixels that were actually drawn.</p>
<p>Rendering once and reusing the overlay provides several advantages:</p>
<ul>
<li><p>Faster processing</p>
</li>
<li><p>Consistent PHI placement across frames</p>
</li>
<li><p>Simplified ground-truth generation</p>
</li>
<li><p>Behavior that more closely matches real ultrasound devices</p>
</li>
</ul>
<p>An additional benefit is that the pipeline automatically records the location of every burned identifier.</p>
<h3 id="heading-step-3-add-phi-to-dicom-headers">Step 3: Add PHI to DICOM Headers</h3>
<p>The DICOM standard supports two ways to represent a cine ultrasound loop: as a sequence of single-frame DICOMs that share a series UID, or as one multi-frame DICOM where the pixel data holds every frame stacked together.</p>
<p>The pipeline uses the multi-frame approach because:</p>
<ul>
<li><p>It matches how real ultrasound devices write cine loops.</p>
</li>
<li><p>One header serves all frames — no duplication of patient metadata.</p>
</li>
<li><p>Storage and transfer are more efficient.</p>
</li>
</ul>
<pre><code class="language-python">ds.PatientName = phi["patient_name_dicom"]
ds.PatientID = deid_patient_id
ds.PatientBirthDate = phi["dob"].strftime("%Y%m%d")

ds.StudyInstanceUID = study_uid
ds.StudyDate = phi["study_date"].strftime("%Y%m%d")
ds.InstitutionName = phi["institution_name"]
</code></pre>
<p>These fields populate the DICOM header with the same synthetic identity used in the image overlay. This ensures that visible PHI and hidden metadata remain consistent, producing realistic test data.</p>
<p>A few details that the DICOM standard enforces but the spec doesn't make obvious:</p>
<ul>
<li><p><code>StudyID</code> is required and must be a short string, distinct from <code>StudyInstanceUID</code>. It's easy to forget.</p>
</li>
<li><p><code>ImageType</code> must be present. <code>["DERIVED", "SECONDARY"]</code> is the honest value for synthetic data because it wasn't acquired by a device.</p>
</li>
<li><p><code>Manufacturer</code> is part of the General Equipment IOD module and is required even though the data is synthetic. Setting it to a clearly synthetic value (<code>SYNTHETIC-DEID-TUTORIAL</code>) makes the origin unambiguous.</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/f6c75149-c287-479a-895b-5572fbd6afbf.png" alt="Synthetic ultrasound DICOM containing generated PHI in image overlays and metadata." style="display:block;margin:0 auto" width="1506" height="757" loading="lazy">

<h3 id="heading-step-4-identity-mapping-the-de-identified-patientid">Step 4: Identity Mapping: The De-Identified PatientID</h3>
<p>To support downstream evaluation, every source patient receives a stable identifier such as <code>DEID-0001</code>. A mapping file links source patients, synthetic studies, and generated DICOM objects. This allows evaluators to compare a de-identification tool’s output against the original ground truth.</p>
<pre><code class="language-plaintext">source_patient,deid_patient_id,study_instance_uid
patient_001,DEID-0001,1.2.826.0.1.3680043.8.498.1234...
patient_002,DEID-0002,1.2.826.0.1.3680043.8.498.5678...
</code></pre>
<h3 id="heading-step-5-ground-truth-structured-csv-output">Step 5: Ground Truth: Structured CSV Output</h3>
<p>One major advantage of synthetic PHI is automatic label generation. Because the pipeline creates every identifier, it already knows the text value, bounding box coordinates, and corresponding DICOM tag.</p>
<p>These annotations are exported as structured CSV files and become the ground truth used for training and evaluation.</p>
<pre><code class="language-python">def build_overlay_rows(*, case_uuid, sop_instance_uid, source_id, source_relpath, output_dicom_relpath, overlays,
                      image_shape):
    h, w = image_shape
    rows = []
    for ov in overlays:
        x, y, ow, oh = ov["bbox"]
        rows.append({
            "case_uuid": case_uuid,
            "sop_instance_uid": sop_instance_uid,
            "source_id": source_id,
            "source_relpath": source_relpath,
            "output_dicom_relpath": output_dicom_relpath,
            "image_h": h,
            "image_w": w,
            "region": "top_left_banner",
            "phi_category": ov["phi_category"],
            "phi_value": ov["phi_value"],
            "rendered_text": ov["rendered_text"],
            "bbox_x": x, "bbox_y": y,
            "bbox_w": ow, "bbox_h": oh,
            "dicom_tag": ov["dicom_tag"],
            "seed": SEED,
            "pipeline_version": PIPELINE_VERSION,
            "run_id": RUN_ID,
        })
    return rows
</code></pre>
<p><code>build_overlay_rows</code> function converts each overlay into a row of structured metadata. Along with the text and bounding box coordinates, it records identifiers and reproducibility information such as the pipeline version and random seed.</p>
<p>These CSV files become the ground truth used for training and evaluating de-identification systems.</p>
<p>At the end of the run, the accumulated rows are grouped by de-identified patient ID and written into per-patient CSV files. Each patient folder receives its own <code>phi_overlays.csv</code> covering all of that patient's zones, alongside a <code>run_manifest.csv</code> summarizing zone-level status (processed, quarantined, failed) and paths.</p>
<h2 id="heading-three-tier-dicom-validation">Three-Tier DICOM Validation</h2>
<p>A synthetic DICOM file is only useful if it actually conforms to the DICOM standard. Otherwise, downstream tools that consume it will fail or worse silently mis-handle it.</p>
<p>The pipeline uses a three-tier validation chain that gracefully degrades depending on what's available in the environment:</p>
<ol>
<li><p><code>dciodvfy</code> from dicom3tools: the most rigorous standards-conformance validator, written by David Clunie. It's not pip-installable. It checks against the full DICOM IOD definitions. If it's available on <code>PATH</code>, this is the preferred check.</p>
</li>
<li><p><a href="https://pypi.org/project/dicom-validator/"><code>dicom-validator</code></a> CLI: this is pip-installable. It downloads the DICOM standard definitions on first run, then validates IOD compliance. it's used when <code>dciodvfy</code> isn't available.</p>
</li>
<li><p><code>pydicom</code> re-read: the minimal fallback. It confirms that every file can be re-opened, decoded, and that pixel data round-trips correctly. It doesn't check standards compliance, but catches gross corruption.</p>
</li>
</ol>
<h2 id="heading-a-surprising-bug-monai-vs-pil"><strong>A Surprising Bug: MONAI vs PIL</strong></h2>
<p>Originally, I planned to use MONAI for image loading because it's widely used in medical imaging workflows.</p>
<p>During testing, I discovered an issue: MONAI’s image loading conventions caused non-square images to appear rotated when downstream code assumed traditional image layouts.</p>
<p>At the same time, many ultrasound images contained EXIF orientation metadata that required correction.</p>
<p>Switching to PIL solved both issues.</p>
<pre><code class="language-python">from PIL import Image, ImageOps

img = Image.open(path)
img = ImageOps.exif_transpose(img)
</code></pre>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Synthetic PHI does not replace real-world testing, but it provides something healthcare AI teams rarely have: a safe, shareable, and fully labeled dataset with known answers.</p>
<p>By generating realistic identifiers and embedding them into both image pixels and DICOM metadata, we can build reproducible benchmarks for de-identification systems without exposing real patient data.</p>
<p>As AI systems become increasingly responsible for handling sensitive medical information, synthetic PHI may become one of the most important tools for building trustworthy healthcare AI workflows.</p>
<p>The complete implementation is available as a Jupyter notebook in the <a href="https://github.com/Project-MONAI/wg-ultrasound/tree/main/annotation_and_anonymization">MONAI Ultrasound Working Group</a> repository. You can explore the notebook and experiment with the pipeline yourself.</p>
<p>Sometimes the safest way to test whether a system can remove PHI is to create the PHI yourself.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Why Your Deep Learning Model Isn't Learning: Diagnosing Data Problems in Medical Imaging ]]>
                </title>
                <description>
                    <![CDATA[ I built a clean, well-structured deep learning pipeline using MONAI (Medical Open Network for AI) on a public abdominal ultrasound dataset. The pipeline included: proper subject-grouped train/validat ]]>
                </description>
                <link>https://www.freecodecamp.org/news/why-your-deep-learning-model-isn-t-learning-data-problems-in-medical-imaging/</link>
                <guid isPermaLink="false">6a19aed9b55c6a731d1d7c06</guid>
                
                    <category>
                        <![CDATA[ Medical Imaging ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Healthcare AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Dataanalysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Lakshmi Mahabaleshwara ]]>
                </dc:creator>
                <pubDate>Fri, 29 May 2026 15:20:57 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/36be814e-4189-4905-9470-1cb5860e7124.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>I built a clean, well-structured deep learning pipeline using <a href="https://project-monai.github.io/">MONAI</a> (Medical Open Network for AI) on a public abdominal ultrasound dataset.</p>
<p>The pipeline included:</p>
<ul>
<li><p>proper subject-grouped train/validation splits</p>
</li>
<li><p>robust preprocessing</p>
</li>
<li><p>carefully decoded segmentation masks</p>
</li>
<li><p>sensible loss functions</p>
</li>
<li><p>consistent evaluation</p>
</li>
</ul>
<p>And the model still struggled to learn.</p>
<p>The interesting part isn't that the model underperformed. What mattered was the diagnosis: a series of simple checks that traced the problem back to the dataset, not the model.</p>
<p>Those checks are useful far beyond medical imaging. They apply to almost any machine learning project.</p>
<p>If you're new to ML, this is a lesson worth carrying into every project: <strong>understand your data before you tune your model.</strong></p>
<p>I set out to build a medical image segmentation tutorial. I ended up learning a more valuable lesson: no amount of careful engineering can rescue a model from a dataset that can't support the task.</p>
<p>By the end of this article, you'll understand:</p>
<ul>
<li><p>How to evaluate whether a dataset can actually support your task</p>
</li>
<li><p>Why "the model isn't learning" is often a data problem</p>
</li>
<li><p>How to rule out engineering bugs before blaming the data</p>
</li>
<li><p>Practical diagnostics you can run in minutes</p>
</li>
<li><p>Why synthetic training data often struggles in real-world deployment</p>
</li>
<li><p>When to stop tuning and walk away from a dataset</p>
</li>
</ul>
<p>This is not a beginner introduction to deep learning – it assumes familiarity with concepts like UNet architectures and training loops. But the data-quality lessons apply broadly to many ML projects.</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-the-dataset">The Dataset</a></p>
</li>
<li><p><a href="#heading-step-1-rule-out-the-pipeline-before-blaming-the-data">Step 1: Rule Out the Pipeline Before Blaming the Data</a></p>
<ul>
<li><p><a href="#heading-subject-grouped-splits">Subject-grouped splits</a></p>
</li>
<li><p><a href="#heading-decoding-masks-correctly">Decoding masks correctly</a></p>
</li>
<li><p><a href="#heading-loss-design-and-class-weighting">Loss design and class weighting</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-2-the-model-still-struggled">Step 2: The Model Still Struggled</a></p>
</li>
<li><p><a href="#heading-step-3-interrogating-the-dataset">Step 3: Interrogating the Dataset</a></p>
<ul>
<li><p><a href="#heading-diagnostic-1-what-does-the-dataset-actually-contain">Diagnostic 1: What Does the Dataset Actually Contain?</a></p>
</li>
<li><p><a href="#heading-diagnostic-2-do-synthetic-and-real-images-look-similar">Diagnostic 2: Do Synthetic and Real Images Look Similar?</a></p>
</li>
<li><p><a href="#heading-diagnostic-3-can-the-gap-be-fixed-by-adding-real-data">Diagnostic 3: Can the gap be fixed by adding real data?</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-4-knowing-when-to-stop">Step 4: Knowing When to Stop</a></p>
</li>
<li><p><a href="#heading-a-practical-dataset-evaluation-checklist">A Practical Dataset Evaluation Checklist</a></p>
</li>
<li><p><a href="#heading-what-i-would-try-next">What I Would Try Next</a></p>
</li>
<li><p><a href="#heading-the-bigger-lesson">The Bigger Lesson</a></p>
</li>
</ul>
<h2 id="heading-the-dataset">The Dataset</h2>
<p>I used the <a href="https://www.kaggle.com/datasets/ignaciorlando/ussimandsegm">US Simulation &amp; Segmentation dataset</a>, a public collection of abdominal ultrasound images with organ segmentation labels from Kaggle.</p>
<p>It contains:</p>
<ul>
<li><p><strong>926 synthetic ultrasound images</strong> — generated by a ray-casting simulator from CT scans, with full organ annotations</p>
</li>
<li><p><strong>617 real ultrasound images</strong> — from an actual ultrasound scanner</p>
</li>
<li><p><strong>Labels for 8 organs</strong> — liver, kidney, gallbladder, pancreas, spleen, bones, vessels, and adrenals</p>
</li>
</ul>
<p>At first glance, the dataset looked ideal:</p>
<ul>
<li><p>thousands of images</p>
</li>
<li><p>multiple organ classes</p>
</li>
<li><p>both synthetic and real ultrasound data</p>
</li>
</ul>
<p>Whether it actually supported the task was a different question.</p>
<h2 id="heading-step-1-rule-out-the-pipeline-before-blaming-the-data">Step 1: Rule Out the Pipeline Before Blaming the Data</h2>
<p>Ground rule: you should always rule out the pipeline before blaming the data. A model failing on buggy code looks exactly like a model failing on bad data. The engineering needs to be trustworthy.</p>
<h3 id="heading-subject-grouped-splits">Subject-Grouped Splits</h3>
<p>A common mistake in medical imaging is randomly splitting images into train and test sets.</p>
<p>That approach is problematic because many frames come from the same patient. Those frames share anatomy, scanner settings, and noise patterns.</p>
<p>If frames from the same patient appear in both the train and test sets, the model can partially memorize patient-specific patterns. Test scores look artificially good, even though the model may fail on truly unseen patients.</p>
<p>This is called <strong>subject leakage</strong>.</p>
<p>The fix is to split by patient instead of by image:</p>
<pre><code class="language-python">from sklearn.model_selection import GroupShuffleSplit

def assign_splits(manifest, val_fraction=0.15, seed=42):
    train_data = manifest[manifest["orig_split"] == "train"]
    groups = train_data["subject_id"].values

    gss = GroupShuffleSplit(n_splits=1, test_size=val_fraction, random_state=seed)
    train_idx, val_idx = next(gss.split(X=train_data, y=None, groups=groups))

    train_subjects = set(train_data.iloc[train_idx]["subject_id"].unique())
    val_subjects = set(train_data.iloc[val_idx]["subject_id"].unique())

    # Crash loudly if leakage ever sneaks in
    assert train_subjects.isdisjoint(val_subjects), "Subject leak detected!"
    return train_subjects, val_subjects
</code></pre>
<p><strong>That assertion matters.</strong> If the split logic ever breaks, the pipeline fails loudly instead of silently producing misleading metrics.</p>
<h3 id="heading-decoding-masks-correctly">Decoding Masks Correctly</h3>
<p>The dataset stores labels as color-coded masks. Each organ corresponds to a different RGB color.</p>
<p>Training requires converting those colors into integer class labels.</p>
<p>A naïve implementation uses exact color matching, but resizing operations can slightly alter colors at mask boundaries.</p>
<p>A more robust approach maps each pixel to its nearest palette color:</p>
<pre><code class="language-python">import numpy as np

PALETTE = np.array([
    [0, 0, 0],
    [100, 0, 100],
    [255, 255, 255],
    [0, 255, 0],
    [255, 255, 0],
    [0, 0, 255],
    [255, 0, 0],
    [255, 0, 255],
    [0, 255, 255],
], dtype=np.int32)

def decode_mask(mask_rgb):
    h, w = mask_rgb.shape[:2]
    flat = mask_rgb.reshape(-1, 3).astype(np.int32)
    d2 = (
        (flat[:, None, :] - PALETTE[None, :, :]) ** 2
    ).sum(-1)
    classes = d2.argmin(axis=1).astype(np.uint8)
    return classes.reshape(h, w)
</code></pre>
<p>Before training, it’s worth visually checking a few decoded masks against the original images. This catches issues like incorrect palettes, RGB/BGR channel swaps, or resizing artifacts that silently corrupt labels.</p>
<p>These bugs rarely throw errors. Instead, the model simply learns poorly. And “<em>trained on wrong labels</em>” looks exactly like “<em>the model can’t learn the data.</em>”</p>
<p>Verifying masks early removes that uncertainty.</p>
<h3 id="heading-loss-design-and-class-weighting">Loss Design and Class Weighting</h3>
<p>For training, I usd standard MONAI segmentation losses. The goal wasn’t to aggressively maximize performance, but to establish a stable and trustworthy baseline.</p>
<p>The training curves below show that the model optimized normally: the loss decreased consistently, and the validation dice stabilized rather than diverging. This helped rule out optimization instability as the primary cause of poor final performance.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/841346d4-d3df-48a9-bc4d-31a5dd0d9bb0.png" alt="Two training curves from a MONAI liver segmentation experiment. The left plot shows training loss steadily decreasing across 50 epochs, while the right plot shows validation Dice scores stabilizing around 0.55–0.60 after initial fluctuations, indicating stable optimization but limited segmentation performance." style="display:block;margin:0 auto" width="1594" height="448" loading="lazy">

<p>Three choices were deliberate:</p>
<ul>
<li><p><strong>Dice + Cross-Entropy combined:</strong> Cross-entropy keeps learning stable early on – Dice directly rewards good region overlap. Together they balance each other.</p>
</li>
<li><p><code>include_background=False</code> <strong>for binary segmentation:</strong> In a single-organ task, background can be 85–90% of the pixels. Counting it in the loss drowns out the signal for the organ you actually care about, so it's better left out.</p>
</li>
<li><p><strong>Class weighting for multi-class segmentation:</strong> With organs of very different sizes, an unweighted loss lets the model ignore the small, rare ones and still score well. Weighting rare-class mistakes more heavily pushes back against that.</p>
</li>
</ul>
<h2 id="heading-step-2-the-model-still-struggled">Step 2: The Model Still Struggled</h2>
<p>The first experiment focused on liver segmentation — the simplest single-organ task in the dataset.</p>
<table>
<thead>
<tr>
<th>Test set</th>
<th>Liver Dice</th>
</tr>
</thead>
<tbody><tr>
<td>Synthetic test set</td>
<td>~0.68</td>
</tr>
<tr>
<td>Real ultrasound test set</td>
<td>~0.48</td>
</tr>
</tbody></table>
<p>Dice scores range from 0 (no overlap) to 1 (perfect overlap).</p>
<p>Qualitatively, the predictions often captured rough liver regions but failed at boundaries and consistency across real scans.</p>
<p>Especially important:</p>
<ul>
<li><p>the model struggled even on synthetic in-domain data</p>
</li>
<li><p>performance dropped further on real ultrasound images</p>
</li>
</ul>
<p>At this point, two explanations were possible:</p>
<ol>
<li><p>the model or pipeline was flawed</p>
</li>
<li><p>the dataset itself was limiting performance</p>
</li>
</ol>
<p>Because the engineering had been carefully validated, the second possibility became worth investigating seriously.</p>
<p>That's where the real lesson began.</p>
<h2 id="heading-step-3-interrogating-the-dataset">Step 3: Interrogating the Dataset</h2>
<p>Rather than endlessly tuning the model, the productive move is to turn the diagnostic lens on the dataset.</p>
<p>Three simple checks revealed the real problem. None required retraining or expensive experiments.</p>
<h3 id="heading-diagnostic-1-what-does-the-dataset-actually-contain">Diagnostic 1: What Does the Dataset Actually Contain?</h3>
<p>The first step was simply plotting the dataset composition.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/d2855b12-b416-4a76-b743-971bf4389628.png" alt="Bar chart showing the composition of the ultrasound segmentation dataset. The dataset contains 926 labeled synthetic ultrasound images, 60 labeled real ultrasound images, and 557 unlabeled real ultrasound images, for a total of 1,543 images. Labeled real data represents only 3.9% of the dataset." style="display:block;margin:0 auto" width="1574" height="932" loading="lazy">

<ul>
<li><p><strong>926 labeled synthetic images</strong> (the bulk of training data)</p>
</li>
<li><p><strong>Only 60 labeled real images</strong> — less than 4% of the dataset</p>
</li>
<li><p><strong>557 unlabeled real images</strong> — real data exists, but without labels it can't be used for supervised training</p>
</li>
</ul>
<p>This immediately changed the interpretation of the dataset.</p>
<p>Although the dataset contains many real ultrasound scans, almost all labeled training data is synthetic.</p>
<p>The model is effectively trained on synthetic ultrasound and expected to generalize to real ultrasound.</p>
<p>That's a difficult transfer problem from the start.</p>
<p>The limitation is simple: the real images mostly don't have labels, so supervised training has very little real-world data to learn from.</p>
<p><strong>Lesson:</strong> Before training anything, chart the dataset composition. A headline image count can be misleading. "1,500 images" sounds large until you discover that only a tiny fraction are labeled examples from the target domain.</p>
<h3 id="heading-diagnostic-2-do-synthetic-and-real-images-look-similar">Diagnostic 2: Do Synthetic and Real Images Look Similar?</h3>
<p>The next question was whether the synthetic and real ultrasound images actually followed similar visual distributions.</p>
<p>Plotting intensity histograms showed a clear mismatch.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/baac5168-292e-45f8-ab9c-fd468dc63b46.png" alt="Histogram comparing pixel intensity distributions between synthetic and real ultrasound images. Synthetic images cluster heavily around lower intensity values, while real ultrasound images show a broader mid-range distribution. The figure also reports summary statistics including mean intensity, standard deviation, and percentile ranges for both datasets." style="display:block;margin:0 auto" width="1705" height="951" loading="lazy">

<ul>
<li><p>synthetic images clustered heavily near darker intensities</p>
</li>
<li><p>real ultrasound images had broader mid-range intensity distributions</p>
</li>
</ul>
<p>The synthetic simulator captured anatomical geometry reasonably well, but it didn't reproduce the texture and noise characteristics of real ultrasound:</p>
<ul>
<li><p>speckle patterns</p>
</li>
<li><p>intensity falloff</p>
</li>
<li><p>scanner-specific artifacts</p>
</li>
</ul>
<p>This is the classic <strong>synthetic-to-real domain gap.</strong></p>
<p>The model learned features tuned to synthetic images and then encountered a substantially different distribution during evaluation. Poor transfer performance became expected rather than surprising.</p>
<p><strong>Lesson:</strong> Whenever training and deployment happen on different domains — synthetic → real, scanner A → scanner B, hospital A → hospital B — measure the distribution shift directly. Simple histogram comparisons can reveal major problems in minutes.</p>
<h3 id="heading-diagnostic-3-can-the-gap-be-fixed-by-adding-real-data">Diagnostic 3: Can the gap be fixed by adding real data?</h3>
<p>The obvious next idea was: why not include some real labeled data during training?</p>
<p>But before implementing that approach, it's worth checking how many distinct patients actually had labels.</p>
<pre><code class="language-plaintext">Labeled real images: 60
Distinct subjects (labeled real): 4

Frames per subject:
  subject h: 26
  subject a: 16
  subject g: 10
  subject b: 8
</code></pre>
<p>Only <strong>four</strong> patients.</p>
<p>That result fundamentally changed the situation.</p>
<p>Proper medical imaging evaluation requires subject-grouped train/test splits. But with only four patients, any evaluation becomes statistically unstable.</p>
<p>Training on two or three patients and testing on one or two patients would produce highly unreliable metrics that depend heavily on which patient happened to be held out.</p>
<p>At that point, the dataset simply couldn't support trustworthy real-world evaluation.</p>
<p><strong>Lesson:</strong> In medical imaging, count subjects, not images. The true size of a dataset is bounded by the number of independent patients, not the number of files.</p>
<h2 id="heading-step-4-knowing-when-to-stop">Step 4: Knowing When to Stop</h2>
<p>At this point, additional tuning no longer made sense.</p>
<p>The bottleneck was not the architecture, optimizer, or learning rate. The bottleneck was the dataset itself.</p>
<p>The pipeline was still valuable and reusable. But this particular dataset couldn't reliably support the intended segmentation task.</p>
<p>That distinction matters: sometimes a problem is difficult but solvable, and sometimes the data simply can't support the conclusion you want to draw.</p>
<p>Learning to recognize the difference is an important ML skill.</p>
<h2 id="heading-a-practical-dataset-evaluation-checklist">A Practical Dataset Evaluation Checklist</h2>
<p>Before committing weeks to model development, these checks are worth running on any dataset:</p>
<ol>
<li><p><strong>Chart the dataset composition</strong> — labeled vs unlabeled, class distribution, domain distribution</p>
</li>
<li><p><strong>Count subjects, not images</strong> — independent patients matter more than frame count</p>
</li>
<li><p><strong>Check class balance</strong> — rare classes are often ignored without weighting or sampling strategies</p>
</li>
<li><p><strong>Compare train and deployment distributions</strong> — especially for cross-domain problems</p>
</li>
<li><p><strong>Verify labels visually</strong> — catch preprocessing or annotation errors early</p>
</li>
<li><p><strong>Look for published baselines</strong> — low published performance may indicate dataset limitations</p>
</li>
</ol>
<p>These checks take minutes and can save weeks of unnecessary tuning.</p>
<h2 id="heading-what-i-would-try-next">What I Would Try Next</h2>
<p>Improving results would likely require better data rather than a larger model. The next steps I'd prioritize:</p>
<ul>
<li><p>collecting more labeled real ultrasound scans, from more distinct patients</p>
</li>
<li><p>improving annotation consistency</p>
</li>
<li><p>semi-supervised learning to make use of the unlabeled real images</p>
</li>
<li><p>domain adaptation between synthetic and real ultrasound</p>
</li>
</ul>
<p>All of these target the actual bottleneck: data quality and data diversity.</p>
<h2 id="heading-the-bigger-lesson">The Bigger Lesson</h2>
<p>In machine learning, it's easy to focus most of our attention on architectures, hyperparameters, optimization tricks, and newer models.</p>
<p>But the dataset quietly defines the ceiling.</p>
<p>A sophisticated model on weak data often disappoints, while a simpler model on strong data performs surprisingly well.</p>
<p>That was the real lesson from this project.</p>
<p>The most valuable skill wasn't building the pipeline. It was diagnosing why the model couldn't succeed and being willing to trust what the data was saying.</p>
<p>The workflow — checking dataset composition, counting subjects, comparing distributions, ruling out engineering bugs, and deciding when to stop — transfers to almost any ML project.</p>
<p>In many projects, better judgment about the data matters more than a better model.</p>
<p>The pipeline code and diagnostic notebooks are available at the <a href="https://github.com/lakshmi-mahabaleshwara/wg-ultrasound/tree/abdomen_simulation_segmentation/data_and_tutorials/abdomen_us_multiorgan_segmentation">MONAI</a> <a href="https://github.com/lakshmi-mahabaleshwara/wg-ultrasound/tree/abdomen_simulation_segmentation/data_and_tutorials/abdomen_us_multiorgan_segmentation">Ultrasound Working Group</a> <a href="https://github.com/lakshmi-mahabaleshwara/wg-ultrasound/tree/abdomen_simulation_segmentation/data_and_tutorials/abdomen_us_multiorgan_segmentation">repository</a>. Questions, corrections, and improvements are always welcome.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
