<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Lakshmi Mahabaleshwara - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Lakshmi Mahabaleshwara - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Fri, 22 May 2026 17:38:55 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/lakshmi-mahabalesh/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build an AI-Powered Medical Image De-Identification Pipeline for Clinical Research ]]>
                </title>
                <description>
                    <![CDATA[ Medical imaging is transforming healthcare. Researchers are training deep learning models to detect pneumonia from chest X-rays, estimate cardiac function from echocardiograms, and identify tumors fro ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-ai-image-de-identification-for-clinical-research/</link>
                <guid isPermaLink="false">6a1070e71f237623ea06ca2d</guid>
                
                    <category>
                        <![CDATA[ healthcare ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ open source ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Lakshmi Mahabaleshwara ]]>
                </dc:creator>
                <pubDate>Fri, 22 May 2026 15:06:15 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/04f2b51b-5590-4bde-9d2c-4af3a6d4237c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Medical imaging is transforming healthcare. Researchers are training deep learning models to detect pneumonia from chest X-rays, estimate cardiac function from echocardiograms, and identify tumors from MRI scans. But before any of these images can be shared with researchers or used to train machine learning models, one critical challenge must be solved.</p>
<p><em><strong>How Do We Protect Patient Privacy?</strong></em></p>
<p>Medical images often contain sensitive information such as patient names, dates of birth, hospital identifiers, and accession numbers. Some of this information is stored in DICOM (<strong>Digital Imaging and Communications in Medicine)</strong> metadata, but much of it is also burned directly into the image pixels.</p>
<p>In this tutorial, you’ll learn how to build an AI-powered de-identification pipeline that removes PHI from both metadata and image pixels. Along the way, we’ll explore OCR (Optical Character Recognition), NER (Named Entity Recognition), and standards-based DICOM processing.</p>
<p>At the end, I’ll show how I combined these ideas into an open-source PyTorch project called Aegis.</p>
<ul>
<li><p><a href="#heading-what-youll-build">What You’ll Build</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-why-privacy-matters-in-medical-imaging">Why Privacy Matters in Medical Imaging</a></p>
</li>
<li><p><a href="#heading-understanding-phi-hipaa-and-dicom">Understanding PHI, HIPAA, and DICOM</a></p>
</li>
<li><p><a href="#heading-what-is-phi">What Is PHI?</a></p>
<ul>
<li><p><a href="#heading-what-is-hipaa">What Is HIPAA?</a></p>
</li>
<li><p><a href="#heading-what-is-dicom">What Is DICOM?</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-why-metadata-anonymization-is-not-enough-in-dicom-format">Why Metadata Anonymization Is Not Enough in DICOM format</a></p>
</li>
<li><p><a href="#heading-ocr-and-ai-for-identifying-phi">OCR and AI for Identifying PHI</a></p>
<ul>
<li><p><a href="#heading-step-1-optical-character-recognition-ocr">Step 1: Optical Character Recognition (OCR)</a></p>
</li>
<li><p><a href="#heading-step-2-determine-whether-the-text-is-phi">Step 2: Determine Whether the Text Is PHI</a></p>
</li>
<li><p><a href="#heading-step-3-named-entity-recognition">Step 3: Named Entity Recognition</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-pixel-redaction-and-dicom-scrubbing">Pixel Redaction and DICOM Scrubbing</a></p>
<ul>
<li><a href="#heading-dicom-metadata-scrubbing">DICOM Metadata Scrubbing</a></li>
</ul>
</li>
<li><p><a href="#heading-building-the-complete-pipeline">Building the Complete Pipeline</a></p>
</li>
<li><p><a href="#heading-challenges-and-lessons-learned">Challenges and Lessons Learned</a></p>
</li>
<li><p><a href="#heading-how-i-built-aegis">How I Built Aegis</a></p>
</li>
<li><p><a href="#heading-key-design-decisions">Key Design Decisions</a></p>
</li>
<li><p><a href="#heading-future-directions">Future Directions</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-youll-build">What You’ll Build</h2>
<p>In this tutorial, you’ll build a custom MONAI (PyTorch) preprocessing pipeline that automatically de-identifies medical images before they are used for clinical research or AI model training.</p>
<p>The pipeline will:</p>
<ul>
<li><p>Discover DICOM studies</p>
</li>
<li><p>Load metadata and pixel data</p>
</li>
<li><p>Detect burned-in text using OCR</p>
</li>
<li><p>Classify text as PHI or non-PHI</p>
</li>
<li><p>Redact sensitive pixel regions</p>
</li>
<li><p>Remove PHI from DICOM metadata and pixel data</p>
</li>
<li><p>Save privacy-safe images for downstream AI workflows</p>
</li>
</ul>
<p>By the end, you’ll have a reusable MONAI transform that can be integrated directly into any medical imaging workflow to prepare privacy-safe datasets for research and deep learning.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow this tutorial, you should have:</p>
<ul>
<li><p>Intermediate Python experience</p>
</li>
<li><p>Basic understanding of PyTorch</p>
</li>
<li><p>Familiarity with medical imaging concepts</p>
</li>
<li><p>Python 3.10 or later</p>
</li>
</ul>
<p>We’ll use:</p>
<ul>
<li><p>MONAI</p>
</li>
<li><p>pydicom</p>
</li>
<li><p>EasyOCR</p>
</li>
<li><p>NumPy</p>
</li>
<li><p>Transformers</p>
</li>
<li><p>Stanford NER</p>
</li>
</ul>
<p><em><strong>Set Up the Environment</strong></em></p>
<pre><code class="language-python"># Create and activate a virtual environment
python -m venv venv
source venv/bin/activate        # On Windows: venv\Scripts\activate

# Upgrade pip
pip install --upgrade pip

# Install the core libraries used in this tutorial
pip install \
    monai \
    pydicom \
    easyocr \
    numpy \
    transformers \
    torch 

# Download the Stanford medical de-identification model from Hugging Face
python -c "
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'StanfordAIMI/stanford-deidentifier-base'
AutoTokenizer.from_pretrained(model_name)
AutoModelForTokenClassification.from_pretrained(model_name)
print('Stanford NER model downloaded successfully.')
"
</code></pre>
<h2 id="heading-why-privacy-matters-in-medical-imaging"><strong>Why Privacy Matters in Medical Imaging</strong></h2>
<p>Healthcare organizations generate enormous volumes of imaging data every day. These datasets are invaluable for:</p>
<ul>
<li><p>Clinical research</p>
</li>
<li><p>Multi-center collaborations</p>
</li>
<li><p>Regulatory submissions</p>
</li>
<li><p>Artificial intelligence model development</p>
</li>
<li><p>Educational datasets</p>
</li>
</ul>
<p>But privacy regulations such as the HIPAA (Health Insurance Portability and Accountability Act) in the United States require that PHI (Protected Health Information) be removed before data can be shared. This creates a significant bottleneck.</p>
<p>Many hospitals still rely on manual review to inspect thousands of images, searching for patient identifiers hidden in metadata and image annotations. This process is slow, expensive, and prone to human error.</p>
<p>Automated de-identification solves this problem by combining software engineering, computer vision, and natural language processing.</p>
<h2 id="heading-understanding-phi-hipaa-and-dicom">Understanding PHI, HIPAA, and DICOM</h2>
<h2 id="heading-what-is-phi"><strong>What Is PHI?</strong></h2>
<p>Protected Health Information (PHI) includes any information that can identify a patient, such as:</p>
<pre><code class="language-plaintext">Name
Medical record number
Date of birth
Study date
Hospital ID
Accession number
</code></pre>
<h3 id="heading-what-is-hipaa"><strong>What Is HIPAA?</strong></h3>
<p>The Health Insurance Portability and Accountability Act (HIPAA) defines rules for safeguarding patient data. One common approach is the Safe Harbor method, which requires removing specific identifiers before data is shared.</p>
<h3 id="heading-what-is-dicom"><strong>What Is DICOM?</strong></h3>
<p>Medical images such as <strong>Computed Tomography (CT)</strong>, <strong>Magnetic Resonance Imaging (MRI)</strong>, and <strong>Ultrasound (US)</strong> are commonly stored in the DICOM <strong>(Digital Imaging and Communications in Medicine)</strong> format, the international standard for storing and exchanging medical imaging data.</p>
<p>Unlike ordinary image formats such as JPEG or PNG, a DICOM file contains both the image itself and a rich set of structured metadata that describes the patient, the study, and the imaging procedure.</p>
<p>A typical DICOM file contains two main components:</p>
<ol>
<li><p><strong>Pixel Data</strong> – the actual medical image, such as a CT slice, MRI volume, or ultrasound frame.</p>
</li>
<li><p><strong>Metadata</strong> – structured fields that may include:</p>
<ul>
<li><p>Patient name and medical record number</p>
</li>
<li><p>Date of birth</p>
</li>
<li><p>Study and acquisition dates</p>
</li>
<li><p>Imaging modality (CT, MRI, US)</p>
</li>
<li><p>Scanner manufacturer and technical acquisition parameters</p>
</li>
</ul>
</li>
</ol>
<p>This combination makes DICOM far more than just an image format. It serves as a standardized container that allows imaging devices, hospital systems, and research software to exchange data reliably and consistently.</p>
<p>Because DICOM metadata often contains protected health information (PHI), and because identifiers may also be burned directly into the image pixels, particularly in ultrasound studies, both the metadata and the pixel data must be addressed during de-identification before images can be safely shared for clinical research or AI development.</p>
<h2 id="heading-why-metadata-anonymization-is-not-enough-in-dicom-format"><strong>Why Metadata Anonymization Is Not Enough in DICOM format</strong></h2>
<p>Many tools remove PHI only from metadata. For example, deleting the PatientName tag may appear sufficient.</p>
<p>But in modalities such as ultrasound, fluoroscopy, and some X-ray workflows, identifying information is often burned directly into the image.</p>
<p>Common examples include:</p>
<pre><code class="language-plaintext">NAME: JOHN DOE
DOB: 01/01/1980
MRN: 123456
HOSPITAL: ABC
</code></pre>
<p>If these annotations remain, privacy is still compromised. This means a complete solution must inspect both:</p>
<ul>
<li><p>DICOM metadata</p>
</li>
<li><p>Image pixels</p>
</li>
</ul>
<h2 id="heading-ocr-and-ai-for-identifying-phi"><strong>OCR and AI for Identifying PHI</strong></h2>
<p>To detect PHI embedded in pixels, we first need to find all visible text.</p>
<h3 id="heading-step-1-optical-character-recognition-ocr"><strong>Step 1: Optical Character Recognition (OCR)</strong></h3>
<p>OCR converts image text into machine-readable strings.</p>
<pre><code class="language-python">import easyocr
reader = easyocr.Reader(['en'])
results = reader.readtext('ultrasound.png')
</code></pre>
<p>Each OCR result typically includes:</p>
<ul>
<li><p>Bounding box coordinates – where the text appears in the image</p>
</li>
<li><p>Extracted text – the recognized characters</p>
</li>
<li><p>Confidence score – how certain the model is about the result</p>
</li>
</ul>
<p>Example:</p>
<pre><code class="language-python">[
&nbsp;&nbsp;([[10, 20], [120, 20], [120, 45], [10, 45]], 'JOHN DOE', 0.98)
]
</code></pre>
<h3 id="heading-step-2-determine-whether-the-text-is-phi"><strong>Step 2: Determine Whether the Text Is PHI</strong></h3>
<p>Not all detected text should be removed.</p>
<p>Medical images also contain clinically relevant labels such as:</p>
<pre><code class="language-plaintext">LEFT VENTRICLE
APICAL VIEW
B-MODE
</code></pre>
<p>To distinguish PHI from legitimate clinical text, we can combine:</p>
<ol>
<li><p>Allowlists of known clinical terms</p>
</li>
<li><p>Regular-expression heuristics</p>
</li>
<li><p>Named Entity Recognition (NER)</p>
</li>
</ol>
<h3 id="heading-step-3-named-entity-recognition"><strong>Step 3: Named Entity Recognition</strong></h3>
<p>NER models identify entities such as:</p>
<pre><code class="language-plaintext">PERSON
DATE
LOCATION
ID
</code></pre>
<pre><code class="language-python">def contains_phi(text): 
    if looks_like_date(text): 
    return True 
    if looks_like_identifier(text): 
    return True 
    return ner_model.predict(text) 
</code></pre>
<p>This hybrid approach reduces both false positives and false negatives.</p>
<h2 id="heading-pixel-redaction-and-dicom-scrubbing"><strong>Pixel Redaction and DICOM Scrubbing</strong></h2>
<p><strong>Pixel Redaction</strong></p>
<p>Once PHI is detected, the corresponding image regions can be masked.</p>
<pre><code class="language-python">image[y1:y2, x1:x2] = 0
</code></pre>
<p>This replaces the sensitive area with black pixels.</p>
<h3 id="heading-dicom-metadata-scrubbing"><strong>DICOM Metadata Scrubbing</strong></h3>
<p>Using pydicom, metadata fields can be modified or removed.</p>
<pre><code class="language-python">import pydicom

ds = pydicom.dcmread('study.dcm')
ds.PatientName = 'ANONYMIZED'
del ds.PatientBirthDate
</code></pre>
<p>Additional steps may include:</p>
<ul>
<li><p>Removing private tags</p>
</li>
<li><p>Replacing UIDs</p>
</li>
<li><p>Recursively processing nested sequences</p>
</li>
</ul>
<p>Together, metadata scrubbing and pixel redaction provide comprehensive de-identification.</p>
<h2 id="heading-building-the-complete-pipeline"><strong>Building the Complete Pipeline</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/74b371d7-cb4a-47b5-afa6-d9e39331d03f.png" alt="Step-by-step workflow for medical image de-identification: discover files, load DICOM metadata, run OCR, classify PHI, redact pixels, scrub metadata, and save de-identified output." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The overall workflow looks like this:</p>
<ol>
<li><p>Discover medical image files</p>
</li>
<li><p>Load DICOM metadata and pixel data</p>
</li>
<li><p>Run OCR on annotation regions</p>
</li>
<li><p>Classify text as PHI or non-PHI</p>
</li>
<li><p>Redact sensitive pixel regions</p>
</li>
<li><p>Remove PHI from metadata</p>
</li>
<li><p>Save the de-identified output</p>
</li>
</ol>
<h2 id="heading-challenges-and-lessons-learned"><strong>Challenges and Lessons Learned</strong></h2>
<p>Building a production-ready de-identification system involves many practical challenges.</p>
<p><strong>Clinical Terminology</strong></p>
<p>OCR may detect legitimate labels that should not be removed.</p>
<p><strong>OCR Errors</strong></p>
<p>Low-contrast text and ultrasound overlays can produce inaccurate detections.</p>
<p><strong>Nested DICOM Sequences</strong></p>
<p>PHI may appear in deeply nested metadata structures.</p>
<p><strong>Multi-Frame Studies</strong></p>
<p>Ultrasound cine loops may contain dozens or hundreds of frames.</p>
<p><strong>Deterministic Pseudonymization</strong></p>
<p>Researchers often need the same patient to receive the same replacement identifier across studies.</p>
<p>These challenges require thoughtful engineering rather than a single machine learning model.</p>
<h2 id="heading-how-i-built-aegis"><strong>How I Built Aegis</strong></h2>
<p>While exploring this problem, I developed an open-source MONAI (PyTorch based) project called <strong>Aegis</strong>.</p>
<p>Aegis combines:</p>
<ul>
<li><p>OCR-based text detection</p>
</li>
<li><p>AI-driven PHI classification</p>
</li>
<li><p>Pixel-level redaction</p>
</li>
<li><p>Standards-based DICOM de-identification</p>
</li>
<li><p>Batch processing for research workflows</p>
</li>
</ul>
<h2 id="heading-key-design-decisions"><strong>Key Design Decisions</strong></h2>
<p><strong>Standards First</strong></p>
<p>I aligned metadata scrubbing with the DICOM confidentiality profile to follow established healthcare standards.</p>
<p><strong>Hybrid AI + Rules</strong></p>
<p>Clinical allowlists, heuristics, and NER models work together to improve accuracy.</p>
<p><strong>Ultrasound-Specific Optimization</strong></p>
<p>Aegis uses <code>SequenceOfUltrasoundRegions</code> to focus OCR on annotation areas instead of scanning the entire image.</p>
<p><strong>Deterministic Identity Management</strong></p>
<p>Consistent pseudonyms enable longitudinal research while protecting privacy.</p>
<p><strong>Open Source Architecture</strong></p>
<p>The project is modular, testable, and designed to integrate with research pipelines.</p>
<p>You can explore the full implementation in the Aegis GitHub repository:</p>
<p><a href="https://github.com/lakshmi-mahabaleshwara/aegis">https://github.com/lakshmi-mahabaleshwara/aegis</a></p>
<h2 id="heading-future-directions"><strong>Future Directions</strong></h2>
<p>Automated de-identification continues to evolve.</p>
<p>Future enhancements may include:</p>
<ul>
<li><p>Multilingual OCR</p>
</li>
<li><p>Handwriting recognition</p>
</li>
<li><p>Vision-language models</p>
</li>
<li><p>Human-in-the-loop review</p>
</li>
<li><p>Cloud-native deployment</p>
</li>
<li><p>Integration with AI training pipelines</p>
</li>
</ul>
<p>As healthcare AI expands, privacy-preserving data preparation will become even more important.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Clinical research depends on access to high-quality medical imaging data.</p>
<p>But privacy regulations require that patient identifiers be removed from both DICOM metadata and image pixels.</p>
<p>By combining OCR, named entity recognition, pixel redaction, and standards-based DICOM processing, we can automate this task and dramatically reduce the burden of manual review.</p>
<p>The techniques covered in this tutorial are applicable far beyond a single project.</p>
<p>Whether you’re building a hospital data pipeline, preparing research datasets, or training the next generation of healthcare AI models, automated de-identification is a foundational capability.</p>
<p>To put these ideas into practice, I built Aegis as an open source reference implementation.</p>
<p>More importantly, the underlying concepts can help developers and researchers create privacy-safe workflows that accelerate innovation while respecting patient confidentiality.</p>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p><a href="https://pydicom.github.io/">https://pydicom.github.io/</a></p>
</li>
<li><p><a href="https://project-monai.github.io/">https://project-monai.github.io/</a></p>
</li>
<li><p><a href="https://www.dicomstandard.org/">https://www.dicomstandard.org/</a></p>
</li>
<li><p><a href="https://www.hhs.gov/hipaa/">https://www.hhs.gov/hipaa/</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
