Lakshmi Mahabaleshwara - freeCodeCamp.org

The Hidden PHI Problem in Medical Images: Building a Synthetic Dataset for AI De-Identification

Lakshmi Mahabaleshwara — Fri, 19 Jun 2026 17:23:54 +0000

In this article, you'll learn how my team built a synthetic PHI generation pipeline to create privacy-safe training and validation data for medical imaging AI.

The Problem

Imagine you’re building an AI system that removes patient information from medical images.

The model needs thousands of examples showing where Protected Health Information (PHI) appears and what it looks like. The more examples it sees, the better it becomes at finding and removing sensitive information.

But there is a problem:

The data you need to train the model is the same data you’re not allowed to share freely.

Healthcare organizations must protect patient privacy. Regulations like HIPAA require that patient identifiers are removed before medical images can be shared for research, AI development, or external collaboration.

This creates an interesting engineering challenge: How do you build and test de-identification systems when the data needed to train those systems can't be easily used?

One practical solution is Synthetic PHI.

In this article, I’ll show why synthetic PHI is valuable, explain the hidden PHI problem inside medical images, and walk through a pipeline my team built that generates realistic ultrasound datasets with fully controlled synthetic patient information.

What You'll Learn in This Tutorial

By the end of this tutorial, you'll understand:

The hidden PHI challenges in medical imaging data.
Why synthetic PHI is useful for building and testing healthcare AI systems.
How to generate realistic synthetic patient identities using Python and Faker.
How to inject PHI into both image pixels and DICOM metadata.
How to create ground-truth labels for AI model training and evaluation.
How to validate synthetic medical imaging datasets before using them in downstream workflows.

Source Images: OpenPOCUS

The synthetic PHI generation uses lung point-of-care ultrasound (POCUS) frames from OpenPOCUS, an openly licensed collection of real ultrasound images contributed by the POCUS community.

These images carry no real PHI. OpenPOCUS provides clinically authentic ultrasound images while avoiding patient privacy concerns. This makes it an ideal foundation for synthetic PHI generation because we can focus entirely on creating and tracking identifiers without risking exposure of real patient information.

The Iceberg Problem: Most PHI Is Hidden

When people think about PHI in medical images, they usually think about visible text overlays.

These include:

Patient name
Medical Record Number (MRN)
Date of birth
Study date

These identifiers are often burned directly into image pixels by ultrasound, X-ray, CT, and MRI systems.

But visible text is only the tip of the iceberg. Much of the remaining PHI lives inside the DICOM header, a collection of metadata fields that describe the image and the study. These fields contains identifiers such as PatientName, PatientID, StudyDate, institution names, and other sensitive information.

Unlike burned-in text, header PHI isn't visible when looking at the image itself, but it travels with the file and must also be removed during de-identification.

A de-identification system must handle both.

Removing visible text while leaving PHI inside DICOM metadata still creates a privacy risk. Likewise, stripping metadata while leaving patient names burned into image pixels is equally problematic.

This hidden PHI challenge makes testing de-identification software much harder than it first appears.

Why Synthetic PHI Matters

At first glance, it seems hospitals already have plenty of real-world data available. So why not simply use that?

The answer comes down to three challenges.

Challenge 1: Privacy Regulations

Medical images often contain patient identifiers.

Sharing those images outside secure clinical environments introduces significant legal and compliance risk.

The more institutions involved, the more difficult governance becomes.

Challenge 2: Annotation at Scale

Modern AI systems require labeled examples.

Someone must identify:

Where PHI appears
What type of PHI is it
Which DICOM tags contain PHI

Creating these annotations manually is expensive and time-consuming.

Challenge 3: Validation

Suppose you’re evaluating a de-identification tool. How do you know whether it successfully removed every identifier?

With real patient data, you often don’t know exactly where every piece of PHI exists. Without ground truth, measuring accuracy becomes difficult.

Synthetic PHI Solves All Three Problems

Instead of starting with real patient identifiers, we can generate realistic fake identities and intentionally inject them into medical images.

Because the pipeline creates the PHI itself, we know:

Every identifier value
Every pixel location
Every DICOM tag
Every expected output

This gives us perfect ground truth.

Now, a de-identification system can be evaluated objectively. If a patient name remains after processing, we know it failed. If clinical content is accidentally removed, we know that too.

Synthetic PHI creates a privacy-safe dataset that can be used for:

Training AI models
Benchmarking de-identification software
Regression testing
Validation before deployment

Building a Synthetic PHI Pipeline

To explore this problem, my team built a pipeline that generates synthetic PHI for lung Point-of-Care Ultrasound (POCUS) images.

The goal was to:

Start with ultrasound images containing no patient information.
Generate realistic synthetic patient identities.
Burn PHI into image pixels.
Insert matching PHI into DICOM metadata.
Automatically generate ground truth labels.
Validate the resulting DICOM files.

The output looks realistic from the perspective of a de-identification system while containing no real patient information.

Pipeline Architecture

The workflow looks like this (we'll go over each step in detail below):

Each stage produces artifacts consumed by the next stage. Failures are quarantined rather than silently ignored.

Safety Checks Before Burning

Before writing synthetic PHI onto an image, the pipeline performs a safety check to ensure that the selected region to insert PHI lies outside the ultrasound fan.

The top-left corner of a lung POCUS image is usually outside the imaging fan, a dark border, safe to burn PHI onto without obscuring clinical content.

To make sure this region holds good for every image, the pipeline runs two checks per image:

Brightness check: If the average intensity of the configured burn region exceeds a threshold, the region likely overlaps the ultrasound fan rather than the dark border.
Boundary check: The pipeline verifies that the configured burn region fits entirely within the image. Images that are smaller than the expected burn area are quarantined.

In either case, the image is quarantined with the reason recorded into the manifest. There are no partial burns, no overwritten clinical content, and no silent corruption of test data.

This prevents synthetic identifiers from accidentally obscuring anatomy.

def burn_region_is_safe(arr):
    """Check the burn region is dark enough to be outside the fan."""
    h, w = arr.shape
    y2 = min(BURN_REGION_Y + BURN_REGION_H, h)
    x2 = min(BURN_REGION_X + BURN_REGION_W, w)
    region = arr[BURN_REGION_Y:y2, BURN_REGION_X:x2]
    if region.size == 0:
        return False, float("nan")
    mean = float(region.mean())
    return mean <= BRIGHTNESS_SKIP_THRESHOLD, mean

The function extracts the configured burn region and computes its average brightness. If the region is too bright, it likely overlaps the ultrasound fan rather than the border.

Step 1: Generate Synthetic Patient Identities

The synthetic identity is produced by Faker and seeded per case, so the same image always yields the same fake patient.

Determinism matters because:

Reproducing a test result requires reproducing the test data.
Debugging downstream tools is easier when the input doesn't change between runs.
Comparing two de-identification tools fairly requires both to see the same planted PHI.

def case_seed(global_seed: int, source_id: str) -> int:
    """Per-image deterministic seed derived from global seed and source path."""
    h = hashlib.sha256(f"{global_seed}|{source_id}".encode()).hexdigest()
    return int(h[:8], 16)


def generate_phi(seed: int) -> dict:
    fake = Faker()
    Faker.seed(seed)
    rng = random.Random(seed)

    last = fake.last_name()
    first = fake.first_name()
    middle = fake.random_letter().upper()
    mrn = f"{rng.randint(1000000, 9999999)}"
    dob = fake.date_of_birth(minimum_age=18, maximum_age=95)
    study_date = fake.date_time_this_decade()
    institution = rng.choice(INSTITUTION_POOL)

    return {
        "case_uuid": f"SYNTH-{uuid.UUID(int=rng.getrandbits(128))}",
        "patient_name_display": f"{last}, {first} {middle}.",
        "patient_name_dicom": f"{last}^{first}^{middle}",   # DICOM PN VR format
        "patient_id": mrn,
        "dob": dob,
        "study_date": study_date,
        "institution_name": institution,
    }

The case_seed() function generates a deterministic seed from the source image path. That seed is then used by Faker to create a synthetic identity.

Because the seed is repeatable, the same input image always receives the same synthetic patient information. This makes debugging and benchmarking reproducible.

Step 2: Burn PHI into Image Pixels

Rendering text onto an image is comparatively expensive. For a single zone containing 30+ frames, repeating that work per frame is wasteful.

The pipeline instead renders the PHI overlay onto a transparent canvas one time per zone. This mirrors how many ultrasound systems operate in practice, where patient information remains fixed while the underlying image content changes from frame to frame.

def make_phi_overlay(shape, phi):
    """Render PHI ONCE onto a canvas. Returns (overlay_array, overlays_meta)."""
    h, w = shape
    canvas = Image.new("L", (w, h), 0)  # blank canvas
    draw = ImageDraw.Draw(canvas)

    overlays, x, y = [], BURN_REGION_X, BURN_REGION_Y
    for entry in _phi_text_block(phi):
        x0, y0, x1, y1 = draw.textbbox((x, y), entry["line"], font=FONT)
        tw, th = x1 - x0, y1 - y0

        if x + tw > w or y + th > h:
            raise ValueError(
                f"rendered PHI overflows image: '{entry['line']}' "
                f"at ({x},{y}) size ({tw}x{th}), image {w}x{h}"
            )

        draw.text((x, y), entry["line"], font=FONT, fill=TEXT_COLOR)
        overlays.append({
            "phi_category": entry["phi_category"],
            "rendered_text": entry["line"],
            "phi_value": entry["value"],
            "bbox": [x, y, tw, th],
            "dicom_tag": entry["dicom_tag"],
        })
        y += th + LINE_GAP
    return np.array(canvas), overlays

The make_phi_overlay() function creates a blank canvas and renders each PHI line onto it. At the same time, it records metadata such as the rendered text, bounding box coordinates, and corresponding DICOM tag.

The function returns both the image overlay and the annotation metadata, ensuring that the ground truth always matches the pixels that were actually drawn.

Rendering once and reusing the overlay provides several advantages:

Faster processing
Consistent PHI placement across frames
Simplified ground-truth generation
Behavior that more closely matches real ultrasound devices

An additional benefit is that the pipeline automatically records the location of every burned identifier.

Step 3: Add PHI to DICOM Headers

The DICOM standard supports two ways to represent a cine ultrasound loop: as a sequence of single-frame DICOMs that share a series UID, or as one multi-frame DICOM where the pixel data holds every frame stacked together.

The pipeline uses the multi-frame approach because:

It matches how real ultrasound devices write cine loops.
One header serves all frames — no duplication of patient metadata.
Storage and transfer are more efficient.

ds.PatientName = phi["patient_name_dicom"]
ds.PatientID = deid_patient_id
ds.PatientBirthDate = phi["dob"].strftime("%Y%m%d")

ds.StudyInstanceUID = study_uid
ds.StudyDate = phi["study_date"].strftime("%Y%m%d")
ds.InstitutionName = phi["institution_name"]

These fields populate the DICOM header with the same synthetic identity used in the image overlay. This ensures that visible PHI and hidden metadata remain consistent, producing realistic test data.

A few details that the DICOM standard enforces but the spec doesn't make obvious:

StudyID is required and must be a short string, distinct from StudyInstanceUID. It's easy to forget.
ImageType must be present. ["DERIVED", "SECONDARY"] is the honest value for synthetic data because it wasn't acquired by a device.
Manufacturer is part of the General Equipment IOD module and is required even though the data is synthetic. Setting it to a clearly synthetic value (SYNTHETIC-DEID-TUTORIAL) makes the origin unambiguous.

Step 4: Identity Mapping: The De-Identified PatientID

To support downstream evaluation, every source patient receives a stable identifier such as DEID-0001. A mapping file links source patients, synthetic studies, and generated DICOM objects. This allows evaluators to compare a de-identification tool’s output against the original ground truth.

source_patient,deid_patient_id,study_instance_uid
patient_001,DEID-0001,1.2.826.0.1.3680043.8.498.1234...
patient_002,DEID-0002,1.2.826.0.1.3680043.8.498.5678...

Step 5: Ground Truth: Structured CSV Output

One major advantage of synthetic PHI is automatic label generation. Because the pipeline creates every identifier, it already knows the text value, bounding box coordinates, and corresponding DICOM tag.

These annotations are exported as structured CSV files and become the ground truth used for training and evaluation.

def build_overlay_rows(*, case_uuid, sop_instance_uid, source_id, source_relpath, output_dicom_relpath, overlays,
                      image_shape):
    h, w = image_shape
    rows = []
    for ov in overlays:
        x, y, ow, oh = ov["bbox"]
        rows.append({
            "case_uuid": case_uuid,
            "sop_instance_uid": sop_instance_uid,
            "source_id": source_id,
            "source_relpath": source_relpath,
            "output_dicom_relpath": output_dicom_relpath,
            "image_h": h,
            "image_w": w,
            "region": "top_left_banner",
            "phi_category": ov["phi_category"],
            "phi_value": ov["phi_value"],
            "rendered_text": ov["rendered_text"],
            "bbox_x": x, "bbox_y": y,
            "bbox_w": ow, "bbox_h": oh,
            "dicom_tag": ov["dicom_tag"],
            "seed": SEED,
            "pipeline_version": PIPELINE_VERSION,
            "run_id": RUN_ID,
        })
    return rows

build_overlay_rows function converts each overlay into a row of structured metadata. Along with the text and bounding box coordinates, it records identifiers and reproducibility information such as the pipeline version and random seed.

These CSV files become the ground truth used for training and evaluating de-identification systems.

At the end of the run, the accumulated rows are grouped by de-identified patient ID and written into per-patient CSV files. Each patient folder receives its own phi_overlays.csv covering all of that patient's zones, alongside a run_manifest.csv summarizing zone-level status (processed, quarantined, failed) and paths.

Three-Tier DICOM Validation

A synthetic DICOM file is only useful if it actually conforms to the DICOM standard. Otherwise, downstream tools that consume it will fail or worse silently mis-handle it.

The pipeline uses a three-tier validation chain that gracefully degrades depending on what's available in the environment:

dciodvfy from dicom3tools: the most rigorous standards-conformance validator, written by David Clunie. It's not pip-installable. It checks against the full DICOM IOD definitions. If it's available on PATH, this is the preferred check.
dicom-validator CLI: this is pip-installable. It downloads the DICOM standard definitions on first run, then validates IOD compliance. it's used when dciodvfy isn't available.
pydicom re-read: the minimal fallback. It confirms that every file can be re-opened, decoded, and that pixel data round-trips correctly. It doesn't check standards compliance, but catches gross corruption.

A Surprising Bug: MONAI vs PIL

Originally, I planned to use MONAI for image loading because it's widely used in medical imaging workflows.

During testing, I discovered an issue: MONAI’s image loading conventions caused non-square images to appear rotated when downstream code assumed traditional image layouts.

At the same time, many ultrasound images contained EXIF orientation metadata that required correction.

Switching to PIL solved both issues.

from PIL import Image, ImageOps

img = Image.open(path)
img = ImageOps.exif_transpose(img)

Final Thoughts

Synthetic PHI does not replace real-world testing, but it provides something healthcare AI teams rarely have: a safe, shareable, and fully labeled dataset with known answers.

By generating realistic identifiers and embedding them into both image pixels and DICOM metadata, we can build reproducible benchmarks for de-identification systems without exposing real patient data.

As AI systems become increasingly responsible for handling sensitive medical information, synthetic PHI may become one of the most important tools for building trustworthy healthcare AI workflows.

The complete implementation is available as a Jupyter notebook in the MONAI Ultrasound Working Group repository. You can explore the notebook and experiment with the pipeline yourself.

Sometimes the safest way to test whether a system can remove PHI is to create the PHI yourself.

How to Preprocess Medical Images for Machine Learning – A Guide Using Chest X-Rays

Lakshmi Mahabaleshwara — Thu, 04 Jun 2026 17:13:59 +0000

Working with healthcare data introduces preprocessing challenges that go beyond those you might encounter with structured data. Some familiar techniques still apply, while others look very different once your data becomes medical images.

In this article, you’ll learn how to prepare a real-world medical imaging dataset for machine learning, from initial data validation to a complete preprocessing pipeline.

We’ll use the Chest X-Ray Pneumonia dataset as our running example, but the lessons apply broadly to healthcare imaging data, including ultrasound, MRI, CT, and dermatology images.

What You'll Learn in This Article

By the end of this article, you'll know how to:

Approach healthcare data preprocessing differently from preprocessing structured data, and recognize where standard techniques fall short
Validate a medical imaging dataset before training to catch corrupted files, mislabels, and data leakage between train and test
Apply six core preprocessing techniques for medical images
Build a complete preprocessing pipeline for chest X-rays using Python with OpenCV.

What We'll Cover:

Why Preprocessing Data Matters More in Healthcare
The Dataset
Before Preprocessing: Validate the Dataset
The Six Pillars of Healthcare Imaging Preprocessing
Pillar 1: Scaling — Making the Numbers Play Fair
Pillar 2: Normalization — Centering the Data
Pillar 3: Guiding the Model's Attention
Pillar 4: Handling Missing Data
Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame
Pillar 6: Denoising & Artifact Handling — Cleaning the Window
Putting it All together: A Complete Pipeline
Try it Yourself
Conclusion

Why Preprocessing Data Matters More in Healthcare

Imagine handing a toddler a jigsaw puzzle with missing pieces, warped edges, and pieces from three different puzzles mixed together. The toddler can't solve it, but that isn't really the toddler's fault.

The same thing happens when raw, messy data gets fed into a machine learning model. A bad prediction on a clinical image can mean a missed diagnosis.

Healthcare data tends to be messier than what most ML practitioners are used to:

Images come from different machines, hospitals, and acquisition protocols
Labels are inconsistent, sometimes missing, sometimes wrong
Patient data is incomplete
Image sizes, contrast levels, and orientations vary across sources

Poor preprocessing often leads to models that perform well on benchmark datasets but struggle to generalize to data collected from different hospitals or imaging devices.

The Dataset

This guide uses the Chest X-Ray Pneumonia dataset by Paul Mooney on Kaggle. It's a strong choice for learning preprocessing because:

It contains around 5,800 pediatric chest X-rays
It has two clear classes — Normal and Pneumonia
It's already organized into train, validation, and test folders
The images are recognizable without specialized medical training
It exhibits almost every preprocessing challenge worth learning

The dataset is available at Kaggle: Chest X-Ray Pneumonia.

Folder Structure

After downloading, the dataset is organized like this:

chest_xray/
├── train/
│   ├── NORMAL/
│   └── PNEUMONIA/
├── val/
│   ├── NORMAL/
│   └── PNEUMONIA/
└── test/
    ├── NORMAL/
    └── PNEUMONIA/

Side-by-side comparison — Normal vs Pneumonia chest X-ray:

A quick first look at one of the images:

import os
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import cv2

DATA_DIR = "chest_xray"
TRAIN_DIR = os.path.join(DATA_DIR, "train")

# Peek at a sample image
sample_path = os.path.join(TRAIN_DIR, "NORMAL", os.listdir(os.path.join(TRAIN_DIR, "NORMAL"))[0])
sample_image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)

print(f"Image shape: {sample_image.shape}")
print(f"Pixel range: {sample_image.min()} to {sample_image.max()}")
print(f"Data type: {sample_image.dtype}")

The output reveals a few useful things right away: most images are large (often around 1500×2000 pixels), pixel values fall in the 0–255 range, and image sizes vary across the dataset. Each of these observations will inform a preprocessing step.

Before Preprocessing: Validate the Dataset

Before applying any transformations, it's worth checking that the data itself is intact. This step alone catches issues that would otherwise cause training to fail silently or produce misleading results.

A simple validation function:

def validate_dataset(data_dir):
    """Scan a dataset folder and flag common data quality issues."""
    corrupted = []
    too_small = []
    nearly_black = []
    total = 0
    
    for class_name in os.listdir(data_dir):
        class_path = os.path.join(data_dir, class_name)
        if not os.path.isdir(class_path):
            continue
        for fname in os.listdir(class_path):
            fpath = os.path.join(class_path, fname)
            total += 1
            try:
                img = cv2.imread(fpath, cv2.IMREAD_GRAYSCALE)
                if img is None:
                    corrupted.append(fpath)
                    continue
                if img.shape[0] < 100 or img.shape[1] < 100:
                    too_small.append(fpath)
                if img.mean() < 5:
                    nearly_black.append(fpath)
            except Exception:
                corrupted.append(fpath)
    
    print(f"Total files scanned: {total}")
    print(f"Corrupted: {len(corrupted)}")
    print(f"Too small: {len(too_small)}")
    print(f"Nearly black: {len(nearly_black)}")
    return corrupted, too_small, nearly_black

validate_dataset(TRAIN_DIR)

Common issues this catches:

Corrupted files — files that won't open at all
Empty or nearly-black images — failed acquisitions or saved-as-blank files
Wrong dimensions — thumbnails or partial downloads mixed in
Duplicate images — the same scan appearing in both train and test (this causes data leakage)
Mislabeled images — a normal X-ray placed in the pneumonia folder

⚠️ This step is critical, One corrupted file can crash a training loop hours into a run. One duplicate between train and test can inflate accuracy scores by several percentage points without anyone noticing.

The Six Pillars of Healthcare Imaging Preprocessing

Preprocessing for medical images can be organized around six core concerns. Two of them carry over directly from preprocessing structured data. Two need to be adapted because the mechanics change when the input is an image. And two are entirely new, they only exist once the data becomes pictures of human bodies.

Pillar 1: Scaling — Making the Numbers Play Fair

Imagine two children comparing their collections. One has 3 seashells. The other has 3,000 stickers. Asking who has more makes the answer seem obvious, but the scales are completely different. Comparing them meaningfully means putting both collections on the same measuring system.

In medical images, pixels usually range from 0 to 255 in 8-bit images, or 0 to 65,535 in some 16-bit medical DICOM images. Neural networks tend to train faster and more reliably when input values are small numbers close to zero.

The fix: Divide every pixel by its maximum possible value, bringing everything into the 0-to-1 range.

image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)

# Scale to [0, 1]
image_scaled = image.astype(np.float32) / 255.0

print(f"Before scaling: {image.min()} to {image.max()}")
print(f"After scaling:  {image_scaled.min():.3f} to {image_scaled.max():.3f}")

Takeaway: Pixel scaling follows the same principle as scaling any numerical feature. The values simply happen to be arranged as an image rather than a column.

Pillar 2: Normalization — Centering the Data

Imagine a teacher asks a class to rate a movie from 1 to 10. One child always gives 9s and 10s. Another spreads ratings evenly from 1 to 10. Comparing their opinions fairly requires adjusting each child's score relative to their own average.

In medical imaging even after scaling to 0–1, the overall brightness of images can vary. Some X-rays are taken with stronger exposure than others. Normalization shifts and rescales each image (or each channel) so the values are centered around zero with a standard deviation of one.

The fix: Subtract the mean, divide by the standard deviation.

# Compute mean and std from the TRAINING set only — never from validation or test
def compute_train_stats(train_dir, sample_limit=1000):
    """Compute pixel mean and std across the training set."""
    pixel_values = []
    count = 0
    for class_name in os.listdir(train_dir):
        class_path = os.path.join(train_dir, class_name)
        for fname in os.listdir(class_path):
            if count >= sample_limit:
                break
            img = cv2.imread(os.path.join(class_path, fname), cv2.IMREAD_GRAYSCALE)
            if img is not None:
                pixel_values.append(img.astype(np.float32).flatten() / 255.0)
                count += 1
    pixels = np.concatenate(pixel_values)
    return pixels.mean(), pixels.std()

train_mean, train_std = compute_train_stats(TRAIN_DIR)
image_normalized = (image_scaled - train_mean) / train_std

⚠️ Avoid this common mistake: Statistics for normalization should be computed from the training set only, never from validation or test. Including those in the calculation leaks information from the evaluation data into the model. The same statistics should then be applied to validation, test, and any new data at inference time.

Takeaway: Centering and scaling each image around the dataset's statistics is the imaging equivalent of standardizing a feature column. The pixels are now comparable across images, regardless of how bright or dim each scan happened to be.

Pillar 3: Guiding the Model's Attention

Imagine a child walking into a crowded pet store. Instead of describing every animal in sight, a parent points to the features that matter: “Look at the soft fur, the fluffy tail, and the nice small size.” The child learns where to focus their attention.

Medical image preprocessing does something similar. It highlights the regions and features most relevant to the diagnostic task.

Region-of-interest (ROI) cropping — focus on the lung field and discard the patient's arms, machine borders, and any imprinted text
Contrast enhancement — use techniques like CLAHE (Contrast Limited Adaptive Histogram Equalization) to make subtle lung textures more visible
Channel selection — for images stored as RGB but containing grayscale information, convert to single-channel input to reduce noise

CLAHE applied to an X-ray:

# CLAHE enhances local contrast — useful for X-rays
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
image_enhanced = clahe.apply(image)

# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[1].imshow(image_enhanced, cmap='gray')
axes[1].set_title('After CLAHE')
plt.show()

Takeaway: The goal of teaching the model what to look at hasn't changed. With structured data, the answer is in new columns. With images, the answer is in cropping, enhancement, and emphasizing the regions that carry diagnostic signal.

Pillar 4: Handling Missing Data

Imagine reading a storybook with a few damaged pages. You don’t throw away the entire book, you decide whether to skip the page, infer what might be missing, or mark it for review.

In medical imaging, missing data can mean corrupted files, missing labels, or incomplete studies rather than empty spreadsheet cells.

The same three strategies — drop, impute, flag — still apply, just with different mechanics:

# Strategy 1: Drop — remove unreadable or empty images
def is_valid_image(path):
    try:
        img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        if img is None:
            return False
        if img.mean() < 5:           # nearly black
            return False
        if img.shape[0] < 50 or img.shape[1] < 50:  # too small
            return False
        return True
    except Exception:
        return False

# Strategy 2: Impute — rare for images, but possible (e.g., in painting to fill in missing patches). Generally avoided for diagnostic data.

# Strategy 3: Flag — track which patients are missing which modalities,
#   and let the model condition on availability. Common in multi-modal healthcare ML.

Takeaway: "Missing" in imaging data is rarely just a NaN. It can be a broken file, an unlabeled scan, an absent modality, or a black corner inside an image. The same three strategies still apply.

Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame

Imagine displaying children’s drawings on a classroom wall. If every drawing is a different size, they won’t fit neatly into the display. You resize them while preserving their proportions.

Medical images must often be resized to a common input size, but anatomical structures should retain their original shape.

The fix: Resize all images to a common shape. For medical data, how the resizing is done matters.

TARGET_SIZE = (224, 224)

# Simple resize (may distort aspect ratio)
image_resized = cv2.resize(image, TARGET_SIZE)

# Better: preserve aspect ratio with padding
def resize_with_padding(image, target_size):
    h, w = image.shape[:2]
    target_h, target_w = target_size
    scale = min(target_h / h, target_w / w)
    new_h, new_w = int(h * scale), int(w * scale)
    resized = cv2.resize(image, (new_w, new_h))
    
    pad_h = target_h - new_h
    pad_w = target_w - new_w
    top, bottom = pad_h // 2, pad_h - pad_h // 2
    left, right = pad_w // 2, pad_w - pad_w // 2
    padded = cv2.copyMakeBorder(resized, top, bottom, left, right,
                                 cv2.BORDER_CONSTANT, value=0)
    return padded

image_clean_resize = resize_with_padding(image, TARGET_SIZE)

⚠️ Why aspect ratio matters in healthcare: Squishing a chest X-ray horizontally makes the lungs look unnatural. Models trained on distorted anatomy often perform worse on real scans. Preserving aspect ratio is generally the safer choice.

Takeaway: Models need a consistent input size, but the geometry of the anatomy needs to be preserved. Resize, but resize carefully.

Pillar 6: Denoising & Artifact Handling — Cleaning the Window

Imagine looking through a window with dust and smudges on the glass. Cleaning the window makes the view clearer, but scrubbing too aggressively could scratch the glass.

Similarly, medical images often contain noise and acquisition artifacts that should be reduced carefully without removing clinically important details.

For chest X-rays, the most common issues are mild noise and burned-in text or markers. A gentle median or bilateral filter helps with the first, while cropping or masking helps with the second.

# Gentle denoising — careful not to blur away clinical detail
image_denoised = cv2.medianBlur(image, ksize=3)

# Bilateral filter preserves edges better than a median filter
image_bilateral = cv2.bilateralFilter(image, d=5, sigmaColor=50, sigmaSpace=50)

⚠️ A note of caution: Aggressive denoising can erase the features a model needs to detect a disease. For diagnostic ML, gentle filtering is generally preferred. A useful rule of thumb: if a radiologist can't distinguish the cleaned image from the original, the filtering has gone too far.

Takeaway: Imaging data carries noise that structured data doesn't have. The window can be cleaned, but never so aggressively that the view is wiped away with the smudges.

Putting it All Together: A Complete Pipeline

Here's how the six pillars combine into a single preprocessing function for chest X-ray images:

def preprocess_xray(image_path, target_size=(224, 224),
                    train_mean=0.482, train_std=0.236):
    """
    Full preprocessing pipeline for chest X-ray images.
    Applies all six pillars in order.
    """
    # Pillar 4: Validate first — skip corrupted files
    if not is_valid_image(image_path):
        return None
    
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    
    # Pillar 5: Resize with aspect ratio preserved
    image = resize_with_padding(image, target_size)
    
    # Pillar 6: Gentle denoising
    image = cv2.medianBlur(image, 3)
    
    # Pillar 3: Enhance contrast to highlight lung texture
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    image = clahe.apply(image)
    
    # Pillar 1: Scale to [0, 1]
    image = image.astype(np.float32) / 255.0
    
    # Pillar 2: Normalize using training set statistics
    image = (image - train_mean) / train_std
    
    return image

Try it Yourself

Every code snippet in this article is bundled into a runnable Kaggle notebook: Chest X-Ray Preprocessing — Kaggle Notebook. Fork it, attach the dataset, and run all the cells to see each preprocessing pillar in action on real chest X-rays.

Conclusion

Here's a summary of what we've discussed in this article:

Pillar	Purpose	Example
Scaling	Standardize pixel ranges	0-255 → 0-1
Normalization	Center brightness distributions	z-score normalization
Attention Guidance	Highlight diagnostic regions	CLAHE
Missing Data Handling	Remove unusable scans	Corrupted files
Resizing	Consistent input size	224×224
Denoising	Reduce acquisition noise	Median filter

Preprocessing for structured data is about making numbers play fair so a model can see them clearly.

Preprocessing for healthcare imaging is about respecting the messy reality of how medical data is captured, stored, and labeled. Some standard techniques carry over directly. Some need to be adapted. And a few preprocessing concerns only emerge once the data becomes pictures of human bodies.

Stepping back, whether it's a child learning to organize their toy box, or a model learning to spot pneumonia in a chest X-ray, the quality of learning depends on the quality of data preparation. Get the data right.

If this was useful, you can find a related conceptual primer on preprocessing more broadly here: Data Preprocessing for Machine Learning.

Why Your Deep Learning Model Isn't Learning: Diagnosing Data Problems in Medical Imaging

Lakshmi Mahabaleshwara — Fri, 29 May 2026 15:20:57 +0000

I built a clean, well-structured deep learning pipeline using MONAI (Medical Open Network for AI) on a public abdominal ultrasound dataset.

The pipeline included:

proper subject-grouped train/validation splits
robust preprocessing
carefully decoded segmentation masks
sensible loss functions
consistent evaluation

And the model still struggled to learn.

The interesting part isn't that the model underperformed. What mattered was the diagnosis: a series of simple checks that traced the problem back to the dataset, not the model.

Those checks are useful far beyond medical imaging. They apply to almost any machine learning project.

If you're new to ML, this is a lesson worth carrying into every project: understand your data before you tune your model.

I set out to build a medical image segmentation tutorial. I ended up learning a more valuable lesson: no amount of careful engineering can rescue a model from a dataset that can't support the task.

By the end of this article, you'll understand:

How to evaluate whether a dataset can actually support your task
Why "the model isn't learning" is often a data problem
How to rule out engineering bugs before blaming the data
Practical diagnostics you can run in minutes
Why synthetic training data often struggles in real-world deployment
When to stop tuning and walk away from a dataset

This is not a beginner introduction to deep learning – it assumes familiarity with concepts like UNet architectures and training loops. But the data-quality lessons apply broadly to many ML projects.

What We'll Cover:

The Dataset
Step 1: Rule Out the Pipeline Before Blaming the Data
Step 2: The Model Still Struggled
Step 3: Interrogating the Dataset
Step 4: Knowing When to Stop
A Practical Dataset Evaluation Checklist
What I Would Try Next
The Bigger Lesson

The Dataset

I used the US Simulation & Segmentation dataset, a public collection of abdominal ultrasound images with organ segmentation labels from Kaggle.

It contains:

926 synthetic ultrasound images — generated by a ray-casting simulator from CT scans, with full organ annotations
617 real ultrasound images — from an actual ultrasound scanner
Labels for 8 organs — liver, kidney, gallbladder, pancreas, spleen, bones, vessels, and adrenals

At first glance, the dataset looked ideal:

thousands of images
multiple organ classes
both synthetic and real ultrasound data

Whether it actually supported the task was a different question.

Step 1: Rule Out the Pipeline Before Blaming the Data

Ground rule: you should always rule out the pipeline before blaming the data. A model failing on buggy code looks exactly like a model failing on bad data. The engineering needs to be trustworthy.

Subject-Grouped Splits

A common mistake in medical imaging is randomly splitting images into train and test sets.

That approach is problematic because many frames come from the same patient. Those frames share anatomy, scanner settings, and noise patterns.

If frames from the same patient appear in both the train and test sets, the model can partially memorize patient-specific patterns. Test scores look artificially good, even though the model may fail on truly unseen patients.

This is called subject leakage.

The fix is to split by patient instead of by image:

from sklearn.model_selection import GroupShuffleSplit

def assign_splits(manifest, val_fraction=0.15, seed=42):
    train_data = manifest[manifest["orig_split"] == "train"]
    groups = train_data["subject_id"].values

    gss = GroupShuffleSplit(n_splits=1, test_size=val_fraction, random_state=seed)
    train_idx, val_idx = next(gss.split(X=train_data, y=None, groups=groups))

    train_subjects = set(train_data.iloc[train_idx]["subject_id"].unique())
    val_subjects = set(train_data.iloc[val_idx]["subject_id"].unique())

    # Crash loudly if leakage ever sneaks in
    assert train_subjects.isdisjoint(val_subjects), "Subject leak detected!"
    return train_subjects, val_subjects

That assertion matters. If the split logic ever breaks, the pipeline fails loudly instead of silently producing misleading metrics.

Decoding Masks Correctly

The dataset stores labels as color-coded masks. Each organ corresponds to a different RGB color.

Training requires converting those colors into integer class labels.

A naïve implementation uses exact color matching, but resizing operations can slightly alter colors at mask boundaries.

A more robust approach maps each pixel to its nearest palette color:

import numpy as np

PALETTE = np.array([
    [0, 0, 0],
    [100, 0, 100],
    [255, 255, 255],
    [0, 255, 0],
    [255, 255, 0],
    [0, 0, 255],
    [255, 0, 0],
    [255, 0, 255],
    [0, 255, 255],
], dtype=np.int32)

def decode_mask(mask_rgb):
    h, w = mask_rgb.shape[:2]
    flat = mask_rgb.reshape(-1, 3).astype(np.int32)
    d2 = (
        (flat[:, None, :] - PALETTE[None, :, :]) ** 2
    ).sum(-1)
    classes = d2.argmin(axis=1).astype(np.uint8)
    return classes.reshape(h, w)

Before training, it’s worth visually checking a few decoded masks against the original images. This catches issues like incorrect palettes, RGB/BGR channel swaps, or resizing artifacts that silently corrupt labels.

These bugs rarely throw errors. Instead, the model simply learns poorly. And “trained on wrong labels” looks exactly like “the model can’t learn the data.”

Verifying masks early removes that uncertainty.

Loss Design and Class Weighting

For training, I usd standard MONAI segmentation losses. The goal wasn’t to aggressively maximize performance, but to establish a stable and trustworthy baseline.

The training curves below show that the model optimized normally: the loss decreased consistently, and the validation dice stabilized rather than diverging. This helped rule out optimization instability as the primary cause of poor final performance.

Three choices were deliberate:

Dice + Cross-Entropy combined: Cross-entropy keeps learning stable early on – Dice directly rewards good region overlap. Together they balance each other.
include_background=False for binary segmentation: In a single-organ task, background can be 85–90% of the pixels. Counting it in the loss drowns out the signal for the organ you actually care about, so it's better left out.
Class weighting for multi-class segmentation: With organs of very different sizes, an unweighted loss lets the model ignore the small, rare ones and still score well. Weighting rare-class mistakes more heavily pushes back against that.

Step 2: The Model Still Struggled

The first experiment focused on liver segmentation — the simplest single-organ task in the dataset.

Test set	Liver Dice
Synthetic test set	~0.68
Real ultrasound test set	~0.48

Dice scores range from 0 (no overlap) to 1 (perfect overlap).

Qualitatively, the predictions often captured rough liver regions but failed at boundaries and consistency across real scans.

Especially important:

the model struggled even on synthetic in-domain data
performance dropped further on real ultrasound images

At this point, two explanations were possible:

the model or pipeline was flawed
the dataset itself was limiting performance

Because the engineering had been carefully validated, the second possibility became worth investigating seriously.

That's where the real lesson began.

Step 3: Interrogating the Dataset

Rather than endlessly tuning the model, the productive move is to turn the diagnostic lens on the dataset.

Three simple checks revealed the real problem. None required retraining or expensive experiments.

Diagnostic 1: What Does the Dataset Actually Contain?

The first step was simply plotting the dataset composition.

926 labeled synthetic images (the bulk of training data)
Only 60 labeled real images — less than 4% of the dataset
557 unlabeled real images — real data exists, but without labels it can't be used for supervised training

This immediately changed the interpretation of the dataset.

Although the dataset contains many real ultrasound scans, almost all labeled training data is synthetic.

The model is effectively trained on synthetic ultrasound and expected to generalize to real ultrasound.

That's a difficult transfer problem from the start.

The limitation is simple: the real images mostly don't have labels, so supervised training has very little real-world data to learn from.

Lesson: Before training anything, chart the dataset composition. A headline image count can be misleading. "1,500 images" sounds large until you discover that only a tiny fraction are labeled examples from the target domain.

Diagnostic 2: Do Synthetic and Real Images Look Similar?

The next question was whether the synthetic and real ultrasound images actually followed similar visual distributions.

Plotting intensity histograms showed a clear mismatch.

synthetic images clustered heavily near darker intensities
real ultrasound images had broader mid-range intensity distributions

The synthetic simulator captured anatomical geometry reasonably well, but it didn't reproduce the texture and noise characteristics of real ultrasound:

speckle patterns
intensity falloff
scanner-specific artifacts

This is the classic synthetic-to-real domain gap.

The model learned features tuned to synthetic images and then encountered a substantially different distribution during evaluation. Poor transfer performance became expected rather than surprising.

Lesson: Whenever training and deployment happen on different domains — synthetic → real, scanner A → scanner B, hospital A → hospital B — measure the distribution shift directly. Simple histogram comparisons can reveal major problems in minutes.

Diagnostic 3: Can the gap be fixed by adding real data?

The obvious next idea was: why not include some real labeled data during training?

But before implementing that approach, it's worth checking how many distinct patients actually had labels.

Labeled real images: 60
Distinct subjects (labeled real): 4

Frames per subject:
  subject h: 26
  subject a: 16
  subject g: 10
  subject b: 8

Only four patients.

That result fundamentally changed the situation.

Proper medical imaging evaluation requires subject-grouped train/test splits. But with only four patients, any evaluation becomes statistically unstable.

Training on two or three patients and testing on one or two patients would produce highly unreliable metrics that depend heavily on which patient happened to be held out.

At that point, the dataset simply couldn't support trustworthy real-world evaluation.

Lesson: In medical imaging, count subjects, not images. The true size of a dataset is bounded by the number of independent patients, not the number of files.

Step 4: Knowing When to Stop

At this point, additional tuning no longer made sense.

The bottleneck was not the architecture, optimizer, or learning rate. The bottleneck was the dataset itself.

The pipeline was still valuable and reusable. But this particular dataset couldn't reliably support the intended segmentation task.

That distinction matters: sometimes a problem is difficult but solvable, and sometimes the data simply can't support the conclusion you want to draw.

Learning to recognize the difference is an important ML skill.

A Practical Dataset Evaluation Checklist

Before committing weeks to model development, these checks are worth running on any dataset:

Chart the dataset composition — labeled vs unlabeled, class distribution, domain distribution
Count subjects, not images — independent patients matter more than frame count
Check class balance — rare classes are often ignored without weighting or sampling strategies
Compare train and deployment distributions — especially for cross-domain problems
Verify labels visually — catch preprocessing or annotation errors early
Look for published baselines — low published performance may indicate dataset limitations

These checks take minutes and can save weeks of unnecessary tuning.

What I Would Try Next

Improving results would likely require better data rather than a larger model. The next steps I'd prioritize:

collecting more labeled real ultrasound scans, from more distinct patients
improving annotation consistency
semi-supervised learning to make use of the unlabeled real images
domain adaptation between synthetic and real ultrasound

All of these target the actual bottleneck: data quality and data diversity.

The Bigger Lesson

In machine learning, it's easy to focus most of our attention on architectures, hyperparameters, optimization tricks, and newer models.

But the dataset quietly defines the ceiling.

A sophisticated model on weak data often disappoints, while a simpler model on strong data performs surprisingly well.

That was the real lesson from this project.

The most valuable skill wasn't building the pipeline. It was diagnosing why the model couldn't succeed and being willing to trust what the data was saying.

The workflow — checking dataset composition, counting subjects, comparing distributions, ruling out engineering bugs, and deciding when to stop — transfers to almost any ML project.

In many projects, better judgment about the data matters more than a better model.

The pipeline code and diagnostic notebooks are available at the MONAI Ultrasound Working Group repository. Questions, corrections, and improvements are always welcome.

How to Build an AI-Powered Medical Image De-Identification Pipeline for Clinical Research

Lakshmi Mahabaleshwara — Fri, 22 May 2026 15:06:15 +0000

Medical imaging is transforming healthcare. Researchers are training deep learning models to detect pneumonia from chest X-rays, estimate cardiac function from echocardiograms, and identify tumors from MRI scans. But before any of these images can be shared with researchers or used to train machine learning models, one critical challenge must be solved.

How Do We Protect Patient Privacy?

Medical images often contain sensitive information such as patient names, dates of birth, hospital identifiers, and accession numbers. Some of this information is stored in DICOM (Digital Imaging and Communications in Medicine) metadata, but much of it is also burned directly into the image pixels.

In this tutorial, you’ll learn how to build an AI-powered de-identification pipeline that removes PHI from both metadata and image pixels. Along the way, we’ll explore OCR (Optical Character Recognition), NER (Named Entity Recognition), and standards-based DICOM processing.

At the end, I’ll show how I combined these ideas into an open-source PyTorch project called Aegis.

What You’ll Build
Prerequisites
Why Privacy Matters in Medical Imaging
Understanding PHI, HIPAA, and DICOM
What Is PHI?
- What Is HIPAA?
- What Is DICOM?
Why Metadata Anonymization Is Not Enough in DICOM format
OCR and AI for Identifying PHI
Pixel Redaction and DICOM Scrubbing
- DICOM Metadata Scrubbing
Building the Complete Pipeline
Challenges and Lessons Learned
How I Built Aegis
Key Design Decisions
Future Directions
Conclusion

What You’ll Build

In this tutorial, you’ll build a custom MONAI (PyTorch) preprocessing pipeline that automatically de-identifies medical images before they are used for clinical research or AI model training.

The pipeline will:

Discover DICOM studies
Load metadata and pixel data
Detect burned-in text using OCR
Classify text as PHI or non-PHI
Redact sensitive pixel regions
Remove PHI from DICOM metadata and pixel data
Save privacy-safe images for downstream AI workflows

By the end, you’ll have a reusable MONAI transform that can be integrated directly into any medical imaging workflow to prepare privacy-safe datasets for research and deep learning.

Prerequisites

To follow this tutorial, you should have:

Intermediate Python experience
Basic understanding of PyTorch
Familiarity with medical imaging concepts
Python 3.10 or later

We’ll use:

MONAI
pydicom
EasyOCR
NumPy
Transformers
Stanford NER

Set Up the Environment

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate        # On Windows: venv\Scripts\activate

# Upgrade pip
pip install --upgrade pip

# Install the core libraries used in this tutorial
pip install \
    monai \
    pydicom \
    easyocr \
    numpy \
    transformers \
    torch 

# Download the Stanford medical de-identification model from Hugging Face
python -c "
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'StanfordAIMI/stanford-deidentifier-base'
AutoTokenizer.from_pretrained(model_name)
AutoModelForTokenClassification.from_pretrained(model_name)
print('Stanford NER model downloaded successfully.')
"

Why Privacy Matters in Medical Imaging

Healthcare organizations generate enormous volumes of imaging data every day. These datasets are invaluable for:

Clinical research
Multi-center collaborations
Regulatory submissions
Artificial intelligence model development
Educational datasets

But privacy regulations such as the HIPAA (Health Insurance Portability and Accountability Act) in the United States require that PHI (Protected Health Information) be removed before data can be shared. This creates a significant bottleneck.

Many hospitals still rely on manual review to inspect thousands of images, searching for patient identifiers hidden in metadata and image annotations. This process is slow, expensive, and prone to human error.

Automated de-identification solves this problem by combining software engineering, computer vision, and natural language processing.

Understanding PHI, HIPAA, and DICOM

What Is PHI?

Protected Health Information (PHI) includes any information that can identify a patient, such as:

Name
Medical record number
Date of birth
Study date
Hospital ID
Accession number

What Is HIPAA?

The Health Insurance Portability and Accountability Act (HIPAA) defines rules for safeguarding patient data. One common approach is the Safe Harbor method, which requires removing specific identifiers before data is shared.

What Is DICOM?

Medical images such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Ultrasound (US) are commonly stored in the DICOM (Digital Imaging and Communications in Medicine) format, the international standard for storing and exchanging medical imaging data.

Unlike ordinary image formats such as JPEG or PNG, a DICOM file contains both the image itself and a rich set of structured metadata that describes the patient, the study, and the imaging procedure.

A typical DICOM file contains two main components:

Pixel Data – the actual medical image, such as a CT slice, MRI volume, or ultrasound frame.
Metadata – structured fields that may include:
- Patient name and medical record number
- Date of birth
- Study and acquisition dates
- Imaging modality (CT, MRI, US)
- Scanner manufacturer and technical acquisition parameters

This combination makes DICOM far more than just an image format. It serves as a standardized container that allows imaging devices, hospital systems, and research software to exchange data reliably and consistently.

Because DICOM metadata often contains protected health information (PHI), and because identifiers may also be burned directly into the image pixels, particularly in ultrasound studies, both the metadata and the pixel data must be addressed during de-identification before images can be safely shared for clinical research or AI development.

Why Metadata Anonymization Is Not Enough in DICOM format

Many tools remove PHI only from metadata. For example, deleting the PatientName tag may appear sufficient.

But in modalities such as ultrasound, fluoroscopy, and some X-ray workflows, identifying information is often burned directly into the image.

Common examples include:

NAME: JOHN DOE
DOB: 01/01/1980
MRN: 123456
HOSPITAL: ABC

If these annotations remain, privacy is still compromised. This means a complete solution must inspect both:

DICOM metadata
Image pixels

OCR and AI for Identifying PHI

To detect PHI embedded in pixels, we first need to find all visible text.

Step 1: Optical Character Recognition (OCR)

OCR converts image text into machine-readable strings.

import easyocr
reader = easyocr.Reader(['en'])
results = reader.readtext('ultrasound.png')

Each OCR result typically includes:

Bounding box coordinates – where the text appears in the image
Extracted text – the recognized characters
Confidence score – how certain the model is about the result

Example:

[
  ([[10, 20], [120, 20], [120, 45], [10, 45]], 'JOHN DOE', 0.98)
]

Step 2: Determine Whether the Text Is PHI

Not all detected text should be removed.

Medical images also contain clinically relevant labels such as:

LEFT VENTRICLE
APICAL VIEW
B-MODE

To distinguish PHI from legitimate clinical text, we can combine:

Allowlists of known clinical terms
Regular-expression heuristics
Named Entity Recognition (NER)

Step 3: Named Entity Recognition

NER models identify entities such as:

PERSON
DATE
LOCATION
ID

def contains_phi(text): 
    if looks_like_date(text): 
    return True 
    if looks_like_identifier(text): 
    return True 
    return ner_model.predict(text)

This hybrid approach reduces both false positives and false negatives.

Pixel Redaction and DICOM Scrubbing

Pixel Redaction

Once PHI is detected, the corresponding image regions can be masked.

image[y1:y2, x1:x2] = 0

This replaces the sensitive area with black pixels.

DICOM Metadata Scrubbing

Using pydicom, metadata fields can be modified or removed.

import pydicom

ds = pydicom.dcmread('study.dcm')
ds.PatientName = 'ANONYMIZED'
del ds.PatientBirthDate

Additional steps may include:

Removing private tags
Replacing UIDs
Recursively processing nested sequences

Together, metadata scrubbing and pixel redaction provide comprehensive de-identification.

Building the Complete Pipeline

The overall workflow looks like this:

Discover medical image files
Load DICOM metadata and pixel data
Run OCR on annotation regions
Classify text as PHI or non-PHI
Redact sensitive pixel regions
Remove PHI from metadata
Save the de-identified output

Challenges and Lessons Learned

Building a production-ready de-identification system involves many practical challenges.

Clinical Terminology

OCR may detect legitimate labels that should not be removed.

OCR Errors

Low-contrast text and ultrasound overlays can produce inaccurate detections.

Nested DICOM Sequences

PHI may appear in deeply nested metadata structures.

Multi-Frame Studies

Ultrasound cine loops may contain dozens or hundreds of frames.

Deterministic Pseudonymization

Researchers often need the same patient to receive the same replacement identifier across studies.

These challenges require thoughtful engineering rather than a single machine learning model.

How I Built Aegis

While exploring this problem, I developed an open-source MONAI (PyTorch based) project called Aegis.

Aegis combines:

OCR-based text detection
AI-driven PHI classification
Pixel-level redaction
Standards-based DICOM de-identification
Batch processing for research workflows

Key Design Decisions

Standards First

I aligned metadata scrubbing with the DICOM confidentiality profile to follow established healthcare standards.

Hybrid AI + Rules

Clinical allowlists, heuristics, and NER models work together to improve accuracy.

Ultrasound-Specific Optimization

Aegis uses SequenceOfUltrasoundRegions to focus OCR on annotation areas instead of scanning the entire image.

Deterministic Identity Management

Consistent pseudonyms enable longitudinal research while protecting privacy.

Open Source Architecture

The project is modular, testable, and designed to integrate with research pipelines.

You can explore the full implementation in the Aegis GitHub repository:

https://github.com/lakshmi-mahabaleshwara/aegis

Future Directions

Automated de-identification continues to evolve.

Future enhancements may include:

Multilingual OCR
Handwriting recognition
Vision-language models
Human-in-the-loop review
Cloud-native deployment
Integration with AI training pipelines

As healthcare AI expands, privacy-preserving data preparation will become even more important.

Conclusion

Clinical research depends on access to high-quality medical imaging data.

But privacy regulations require that patient identifiers be removed from both DICOM metadata and image pixels.

By combining OCR, named entity recognition, pixel redaction, and standards-based DICOM processing, we can automate this task and dramatically reduce the burden of manual review.

The techniques covered in this tutorial are applicable far beyond a single project.

Whether you’re building a hospital data pipeline, preparing research datasets, or training the next generation of healthcare AI models, automated de-identification is a foundational capability.

To put these ideas into practice, I built Aegis as an open source reference implementation.

More importantly, the underlying concepts can help developers and researchers create privacy-safe workflows that accelerate innovation while respecting patient confidentiality.

Lakshmi Mahabaleshwara - freeCodeCamp.org

The Hidden PHI Problem in Medical Images: Building a Synthetic Dataset for AI De-Identification

The Problem

What You'll Learn in This Tutorial

What We'll Cover:

Source Images: OpenPOCUS

The Iceberg Problem: Most PHI Is Hidden

Why Synthetic PHI Matters

Challenge 1: Privacy Regulations

Challenge 2: Annotation at Scale

Challenge 3: Validation

Synthetic PHI Solves All Three Problems

Building a Synthetic PHI Pipeline

Pipeline Architecture

Safety Checks Before Burning

Step 1: Generate Synthetic Patient Identities

Step 2: Burn PHI into Image Pixels

Step 3: Add PHI to DICOM Headers

Step 4: Identity Mapping: The De-Identified PatientID

Step 5: Ground Truth: Structured CSV Output

Three-Tier DICOM Validation

A Surprising Bug: MONAI vs PIL

Final Thoughts

How to Preprocess Medical Images for Machine Learning – A Guide Using Chest X-Rays

What You'll Learn in This Article

What We'll Cover:

Why Preprocessing Data Matters More in Healthcare

The Dataset

Folder Structure

Before Preprocessing: Validate the Dataset

The Six Pillars of Healthcare Imaging Preprocessing

Pillar 1: Scaling — Making the Numbers Play Fair

Pillar 2: Normalization — Centering the Data

Pillar 3: Guiding the Model's Attention

Pillar 4: Handling Missing Data

Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame

Pillar 6: Denoising & Artifact Handling — Cleaning the Window

Putting it All Together: A Complete Pipeline

Try it Yourself

Conclusion

Why Your Deep Learning Model Isn't Learning: Diagnosing Data Problems in Medical Imaging

What We'll Cover:

The Dataset

Step 1: Rule Out the Pipeline Before Blaming the Data

Subject-Grouped Splits

Decoding Masks Correctly

Loss Design and Class Weighting

Step 2: The Model Still Struggled

Step 3: Interrogating the Dataset

Diagnostic 1: What Does the Dataset Actually Contain?

Diagnostic 2: Do Synthetic and Real Images Look Similar?

Diagnostic 3: Can the gap be fixed by adding real data?

Step 4: Knowing When to Stop

A Practical Dataset Evaluation Checklist

What I Would Try Next

The Bigger Lesson

How to Build an AI-Powered Medical Image De-Identification Pipeline for Clinical Research

What You’ll Build

Prerequisites

Why Privacy Matters in Medical Imaging

Understanding PHI, HIPAA, and DICOM

What Is PHI?

What Is HIPAA?

What Is DICOM?

Why Metadata Anonymization Is Not Enough in DICOM format

OCR and AI for Identifying PHI

Step 1: Optical Character Recognition (OCR)

Step 2: Determine Whether the Text Is PHI

Step 3: Named Entity Recognition

Pixel Redaction and DICOM Scrubbing

DICOM Metadata Scrubbing

Building the Complete Pipeline

Challenges and Lessons Learned

How I Built Aegis

Key Design Decisions

Future Directions

Conclusion

References