Medical Imaging - freeCodeCamp.org

Build Your Own Healthcare AI Assistant with MedGemma, Ollama, and Open WebUI

Lakshmi Mahabaleshwara — Wed, 08 Jul 2026 23:21:21 +0000

Healthcare data is among the most sensitive data there is. Sending it to a cloud AI service is often not an option because of privacy requirements, regulatory compliance, or both.

In this tutorial, you’ll build a healthcare AI assistant that runs entirely on your own machine using three open-source tools:

MedGemma, Google’s open medical AI model for understanding medical text and images
Ollama, the easiest way to download and run AI models locally
Open WebUI, a ChatGPT-style web interface for interacting with local models

By the end, you’ll be able to chat with a medically tuned AI model, upload medical images such as chest X-rays for analysis, and do it all locally, without sending your data to the cloud.

Important disclaimer before we start: MedGemma is a developer model, not a medical device. Its outputs are not intended to directly inform clinical diagnosis, patient management, or treatment decisions.

Everything you build in this tutorial is for learning, prototyping, and research. Always consult qualified healthcare professionals for real medical questions.

What We'll Cover:

Who is This Tutorial For?
What is MedGemma?
Why Run Models Locally?
Prerequisites
Architecture Diagram
Step 1: Install Ollama
Step 2: Pull MedGemma
Step 3: Test MedGemma from the Terminal
Step 4: Install Open WebUI
- Option A: Docker (recommended)
- Option B: pip (no Docker)
Step 5: Connect Open WebUI to Ollama
Step 6: Start Chatting with MedGemma
Step 7: Upload Medical Images
Example Prompts to Try
Running Larger Models
Troubleshooting Guide
Conclusion

Who is This Tutorial For?

This tutorial is ideal if you’re:

learning healthcare AI
building medical RAG systems
experimenting with radiology assistants
developing medical education tools
researching multimodal models

What is MedGemma?

MedGemma is a collection of open models from Google, built on the Gemma 3 architecture and specifically trained for medical text and image comprehension. Think of it as Gemma after four years of medical school and a radiology residency.

Why MedGemma?

Unlike general-purpose models such as Llama or Mistral, MedGemma is designed specifically for healthcare applications.

Medical image understanding: Its multimodal models are trained on de-identified medical images, including chest X-rays, dermatology, ophthalmology, and pathology images.
Medical language expertise: It has been trained on medical literature and clinical question-answer datasets, enabling it to better understand medical terminology and radiology reports.
Multiple model sizes: MedGemma is available in 4B and 27B variants, both supporting text and image inputs with a 128K context window.
Open weights: You can download, run, fine-tune, and build applications with the model locally under the Health AI Developer Foundation's terms of use.

MedGemma is intended as a foundation model for developers building healthcare applications, medical education tools, research assistants, report summarizers, and other AI-powered medical workflows.

Why Run Models Locally?

You could call a hosted medical model through an API. So why go local? In healthcare, the case is stronger than almost anywhere else.

First, there's the principle of privacy by architecture. When the model runs on your machine, medical text and images never leave your device. There's no API log, no third-party data processor, no data processing agreement to negotiate.

For anyone working near PHI (Protected Health Information), "the data never left the laptop" is the simplest compliance story that exists.

Next, you have zero per-token cost. Experimentation is free once the model is downloaded. You can iterate on prompts hundreds of times without watching a billing dashboard.

You also get offline access. Hospitals, labs, and field clinics often have restricted or air-gapped networks. A local model works without internet after the initial download.

And you have full control over the setup: you choose the model version, you pin it, and it never changes underneath you. No deprecation notices, no silent behavior changes.

Finally, it's a great way to learn. Running models locally demystifies them. You'll develop intuition for context windows, quantization, and memory constraints that you simply don't get from calling an API.

Prerequisites

Here's what you need before starting:

Hardware:

8 GB RAM minimum (16 GB recommended) for the MedGemma 4B model. The download is about 3.3 GB.
32 GB RAM or a 24 GB+ GPU if you want to run the 27B model (a roughly 17 GB download).
Around 15 GB of free disk space to be comfortable (model + Docker images + working room).
Apple Silicon Macs (M1 through M4) are excellent for this. Ollama uses Metal acceleration automatically. On Windows and Linux, an NVIDIA GPU helps a lot but isn't required. A CPU-only inference works, just slower.

Software:

macOS, Linux, or Windows 10/11
Docker Desktop (for the recommended Open WebUI installation), or Python 3.11 if you prefer installing Open WebUI with pip
Basic comfort with the terminal

That's it. No API keys, no accounts, and no GPU cloud credits.

Architecture Diagram

Step 1: Install Ollama

Ollama is a lightweight runtime that handles downloading, quantizing, and serving open models through a simple CLI and a local REST API.

On macOS:

Download the app from ollama.com/download and drag it to Applications, or install via Homebrew:

brew install ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows:

Download the native Windows installer from ollama.com/download and run it. (Ollama now supports Windows natively, no WSL required.)

Once installed, verify it works:

ollama --version

You should see a version number printed. Ollama also starts a background service that listens on http://localhost:11434. This is the API that Open WebUI will talk to later. You can confirm the server is up with:

curl http://localhost:11434

which should return Ollama is running.

Step 2: Pull MedGemma

MedGemma is available directly in the official Ollama model library, so downloading it is one command:

ollama pull medgemma

This pulls the default 4B multimodal variant, about a 3.3 GB download.

If you want to be explicit about the size (useful when you later experiment with the 27B model):

ollama pull medgemma:4b     # 3.3 GB — multimodal, runs on most laptops
ollama pull medgemma:27b    # 17 GB — multimodal, needs serious hardware

When the download finishes, confirm the model is installed:

ollama list

You should see medgemma in the output along with its size.

Step 3: Test MedGemma from the Terminal

Before adding a UI, let's make sure the model actually works. Start an interactive session:

ollama run medgemma

You'll get a >>> prompt. Try a medical question:

>>> What are the classic radiographic signs of pneumonia on a chest X-ray?

MedGemma should respond with a structured answer covering findings like consolidation, air bronchograms, and silhouette signs — the kind of answer that shows its radiology training.

Try one more to see the clinical reasoning:

>>> Explain the difference between Type 1 and Type 2 diabetes to a first-year medical student.

A few useful commands inside the session:

/bye — exit the session
/clear — clear the conversation context
/show info — display model details (parameters, quantization, context length)

You can also test image input directly from the terminal by passing a file path directly in the prompt:

>>> Describe the key findings in this image. ./chest_xray_sample.png

While this works, uploading images through Open WebUI is much more convenient.

Step 4: Install Open WebUI

Open WebUI gives you a clean, ChatGPT-style interface on top of Ollama: conversation history, model switching, image uploads, and multi-user support, all self-hosted.

Option A: Docker (recommended)

Start by installing Docker.

Make sure Docker Desktop is running, then launch Open WebUI with:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Let's break down what this command does:

-d runs the container in the background
-p 3000:8080 maps port 3000 on your machine to the WebUI's internal port 8080
--add-host=host.docker.internal:host-gateway lets the container reach the Ollama server running on your host machine
-v open-webui:/app/backend/data creates a Docker volume so your chats and settings survive container restarts
--restart always brings the UI back up automatically after reboots

Option B: pip (no Docker)

If you'd rather skip Docker, you can instead install Open WebUI as a Python package (Python 3.11 is the supported version):

pip install open-webui
open-webui serve

This starts the interface at http://localhost:8080 instead of port 3000.

Step 5: Connect Open WebUI to Ollama

Open your browser and go to http://localhost:3000 (or :8080 if you used pip).

On first launch, you'll be asked to create an admin account. This account is stored locally on your machine (it's not a cloud signup).

In most setups, Open WebUI auto-detects Ollama at http://localhost:11434 and you're done.

If your models don't appear, wire up the connection manually:

Click your profile icon and go to Admin Panel then Settings then Connections.
Under Ollama API, set the URL:
- Docker install: http://host.docker.internal:11434
- pip install: http://localhost:11434
Click the refresh icon to verify the connection, then save.

Head back to the main chat screen, and medgemma should now appear in the model dropdown at the top.

You can check the troubleshooting section below if you face any errors.

Step 6: Start Chatting with MedGemma

Select medgemma from the model selector and start a conversation. A good first test might look like this:

Summarize this radiology report in plain language a patient could understand:

"Impression: Mild cardiomegaly. Small right pleural effusion.
No focal consolidation. Degenerative changes of the thoracic spine."

You should get a clear, patient-friendly explanation of each finding. This "clinical language to plain language" translation is one of MedGemma's genuine strengths.

There are a few Open WebUI features worth knowing about:

System prompts: Click the model name and set a system prompt like "You are a medical education assistant. Always explain your reasoning and cite the relevant physiology." This shapes every response in the conversation.
Conversation history: Every chat is saved locally and searchable from the sidebar.
Multiple models: You can add llama3.2, gemma3, or any other Ollama model and compare their answers to the same medical question side by side. This is a great way to see the difference domain training makes.

Step 7: Upload Medical Images

This is where MedGemma really separates itself from general-purpose models. Because its vision encoder was pre-trained on medical imaging, it can meaningfully describe radiographs, skin lesions, fundus photos, and histopathology patches.

To try it:

Start a new chat with medgemma selected.
Click the + (or image) icon in the message box, or simply drag and drop an image file.
Add a prompt alongside the image and hit send.

For sample images you can test with (without touching any real patient data), try public teaching datasets like the NIH ChestX-ray14 dataset, MedPix, or Radiopaedia's teaching cases.

Example workflow with a chest X-ray:

[Upload: chest_xray.png]

You are an expert radiology assistant. Describe this chest X-ray
systematically: technical quality, lungs, heart, mediastinum, bones,
and soft tissues. Then summarize the key findings.

MedGemma will typically walk through the image in the systematic order you asked for, which mirrors how radiologists are trained to read films.

Two important caveats:

Ollama and Open WebUI work with standard image formats (PNG, JPEG). Clinical DICOM files need to be converted to PNG/JPEG first — a one-liner with Python libraries like pydicom + Pillow.
Never upload images containing patient-identifying information (names, MRNs, dates burned into the image) unless the data has been properly de-identified. Even on a local machine, good data hygiene is a habit worth building.

Example Prompts to Try

Here are prompts that showcase different capabilities. Use them as starting points:

Medical education:

Create a comparison table of ACE inhibitors vs ARBs: mechanism, common examples, key side effects, and contraindications.

Clinical documentation:

Convert these shorthand clinic notes into a structured SOAP note:"45F, 3d cough + fever 101F, no SOB, lungs clear, likely viral URI, supportive care, return if worse"

Report translation for patients:

Explain this MRI impression to a worried patient in a reassuring but honest tone: "Small disc protrusion at L4-L5 without significant canal stenosis or nerve root compression."

Image analysis (with an uploaded dermatology photo):

Describe this skin lesion using the ABCDE criteria
(Asymmetry, Border, Color, Diameter, Evolution cannot be assessed from a single image — note that explicitly).

Differential reasoning:

A 60-year-old presents with sudden painless vision loss in one eye. List the top 5 differential diagnoses and the key distinguishing feature of each.

Notice a pattern: the best results come from prompts that give MedGemma a role, a structure to follow, and explicit constraints. That's true of all LLMs, but it matters even more in a domain where precision counts.

Running Larger Models

The 4B model is impressive for its size, but the 27B variant is noticeably stronger at complex clinical reasoning, longer differential diagnoses, and nuanced report interpretation.

The trade-off is hardware:

Model	Download	Realistic RAM/VRAM needed	Best for
`medgemma:4b`	3.3 GB	8 GB+ RAM	Laptops, quick iteration, image Q&A
`medgemma:27b`	17 GB	32 GB RAM or 24 GB VRAM	Deep reasoning, complex cases

To try the 27B model:

ollama pull medgemma:27b
ollama run medgemma:27b

Practical tips for larger models:

Watch your memory: Run ollama ps to see how much RAM/VRAM a loaded model is using and whether it's running on GPU, CPU, or split across both. A model that spills from GPU to CPU gets dramatically slower.
On Apple Silicon, a 32 GB M-series Mac runs the 27B model comfortably.
Free memory between models: Ollama keeps models loaded for a few minutes after use. Unload immediately with ollama stop medgemma:27b if you need the RAM back.
Sanity-check the speed trade-off: If the 27B model generates at 2–3 tokens per second on your machine, the 4B model at 30+ tokens/second may be the better.

You can keep both installed and switch between them in the Open WebUI dropdown — 4B for fast iteration, 27B when you need the deeper reasoning.

Troubleshooting Guide

Error: `registry.ollama.ai/library/medgemma:latest does not support tools`

This is the most common MedGemma-specific error, and it means Open WebUI is sending native tool/function definitions with your request. MedGemma (like base Gemma 3) doesn't support Ollama's tools API, so the request is rejected before the model even sees your message.

Hunt down whatever is attaching tools, in this order:

Model capabilities (most likely culprit): Go to the Admin Panel, then Settings, then Models, then medgemma, then uncheck Builtin Tools, Web Search, Code Interpreter, and Terminal under Capabilities, and make sure every item in the Builtin Tools checklist is unticked. Keep Vision, File Upload, and File Context checked. Newer Open WebUI versions enable builtin tools by default, so a fresh install will hit this immediately.
Task model: Go to Admin Panel, then Settings, then Interface, and make sure neither the local nor external Task Model is set to medgemma. Background jobs like title and follow-up generation use tool calls — route them to llama3.2 or similar.
Function Calling mode: Set to Default (not Native) in the model's Advanced Params and in your user Settings, General, Advanced Parameters.
Global functions/filters: Go to Admin Panel, then Functions, and disable the Global toggle on any active function, since global functions attach to every model.
Per-chat toggles: In the message box, make sure web search and code interpreter toggles are off, and no Tools are attached via the + menu.

Then start a new chat (old chats can carry stale settings) and test. To confirm the model itself is fine, run ollama run medgemma "hello" in your terminal. If that works, the issue is purely Open WebUI configuration.

The container can't reach Ollama. Check that:

Ollama is actually running: curl http://localhost:11434 should return Ollama is running.
The connection URL in Admin Panel, Settings, Connections is http://host.docker.internal:11434 (Docker) — localhost won't work from inside a container because it refers to the container itself.
On Linux, if host.docker.internal doesn't resolve, add --network=host to your docker run command instead and use http://localhost:11434.

`ollama pull medgemma` says model not found

Update Ollama, as MedGemma requires a recent version. Re-run the installer or, on macOS, click the menu bar icon and then Update. Then retry the pull.

Responses are extremely slow

Check ollama ps — if the model shows a large CPU percentage, it doesn't fit in your GPU/unified memory. Switch to the 4B model.
Close memory-hungry apps (browsers with 40 tabs are the usual suspect).
On first message, models take several seconds to load into memory, subsequent messages are much faster.

Image upload doesn't work or the model ignores the image

Make sure you selected medgemma (multimodal) and not a text-only model in the dropdown.
Use PNG or JPEG. DICOM files must be converted first.
Very high-resolution images can cause issues — resize to something reasonable (e.g., 1024px on the long edge) before uploading.

Port 3000 is already in use

Map a different host port: change -p 3000:8080 to -p 3001:8080 and access the UI at http://localhost:3001.

Your machine doesn't have enough free RAM/VRAM. Stick with medgemma:4b, or free memory and try again. There is no shame in the 4B model — it punches well above its weight.

Conclusion

In this tutorial, you built a complete, private healthcare AI assistant from scratch — and it took three tools and a handful of terminal commands.

Let's recap what you accomplished:

Installed Ollama and pulled MedGemma, a medically-tuned multimodal model, onto your own machine
Verified the model from the terminal, then put a full chat interface on top of it with Open WebUI
Configured the model's capabilities correctly so tool-calling features don't break a model that doesn't support them
Chatted with a model that understands radiology reports, clinical terminology, and medical images — and uploaded images for analysis
Learned how to scale up to the 27B model and how to diagnose the most common errors along the way.

You now have a fully private AI assistant running entirely on your own machine. From here, you can extend it with retrieval-augmented generation (RAG), integrate it with medical imaging pipelines, or connect it to de-identified clinical datasets to build more advanced healthcare AI applications.

Happy building!

Further reading:

The Hidden PHI Problem in Medical Images: Building a Synthetic Dataset for AI De-Identification

Lakshmi Mahabaleshwara — Fri, 19 Jun 2026 17:23:54 +0000

In this article, you'll learn how my team built a synthetic PHI generation pipeline to create privacy-safe training and validation data for medical imaging AI.

The Problem

Imagine you’re building an AI system that removes patient information from medical images.

The model needs thousands of examples showing where Protected Health Information (PHI) appears and what it looks like. The more examples it sees, the better it becomes at finding and removing sensitive information.

But there is a problem:

The data you need to train the model is the same data you’re not allowed to share freely.

Healthcare organizations must protect patient privacy. Regulations like HIPAA require that patient identifiers are removed before medical images can be shared for research, AI development, or external collaboration.

This creates an interesting engineering challenge: How do you build and test de-identification systems when the data needed to train those systems can't be easily used?

One practical solution is Synthetic PHI.

In this article, I’ll show why synthetic PHI is valuable, explain the hidden PHI problem inside medical images, and walk through a pipeline my team built that generates realistic ultrasound datasets with fully controlled synthetic patient information.

What You'll Learn in This Tutorial

By the end of this tutorial, you'll understand:

The hidden PHI challenges in medical imaging data.
Why synthetic PHI is useful for building and testing healthcare AI systems.
How to generate realistic synthetic patient identities using Python and Faker.
How to inject PHI into both image pixels and DICOM metadata.
How to create ground-truth labels for AI model training and evaluation.
How to validate synthetic medical imaging datasets before using them in downstream workflows.

Source Images: OpenPOCUS

The synthetic PHI generation uses lung point-of-care ultrasound (POCUS) frames from OpenPOCUS, an openly licensed collection of real ultrasound images contributed by the POCUS community.

These images carry no real PHI. OpenPOCUS provides clinically authentic ultrasound images while avoiding patient privacy concerns. This makes it an ideal foundation for synthetic PHI generation because we can focus entirely on creating and tracking identifiers without risking exposure of real patient information.

The Iceberg Problem: Most PHI Is Hidden

When people think about PHI in medical images, they usually think about visible text overlays.

These include:

Patient name
Medical Record Number (MRN)
Date of birth
Study date

These identifiers are often burned directly into image pixels by ultrasound, X-ray, CT, and MRI systems.

But visible text is only the tip of the iceberg. Much of the remaining PHI lives inside the DICOM header, a collection of metadata fields that describe the image and the study. These fields contains identifiers such as PatientName, PatientID, StudyDate, institution names, and other sensitive information.

Unlike burned-in text, header PHI isn't visible when looking at the image itself, but it travels with the file and must also be removed during de-identification.

A de-identification system must handle both.

Removing visible text while leaving PHI inside DICOM metadata still creates a privacy risk. Likewise, stripping metadata while leaving patient names burned into image pixels is equally problematic.

This hidden PHI challenge makes testing de-identification software much harder than it first appears.

Why Synthetic PHI Matters

At first glance, it seems hospitals already have plenty of real-world data available. So why not simply use that?

The answer comes down to three challenges.

Challenge 1: Privacy Regulations

Medical images often contain patient identifiers.

Sharing those images outside secure clinical environments introduces significant legal and compliance risk.

The more institutions involved, the more difficult governance becomes.

Challenge 2: Annotation at Scale

Modern AI systems require labeled examples.

Someone must identify:

Where PHI appears
What type of PHI is it
Which DICOM tags contain PHI

Creating these annotations manually is expensive and time-consuming.

Challenge 3: Validation

Suppose you’re evaluating a de-identification tool. How do you know whether it successfully removed every identifier?

With real patient data, you often don’t know exactly where every piece of PHI exists. Without ground truth, measuring accuracy becomes difficult.

Synthetic PHI Solves All Three Problems

Instead of starting with real patient identifiers, we can generate realistic fake identities and intentionally inject them into medical images.

Because the pipeline creates the PHI itself, we know:

Every identifier value
Every pixel location
Every DICOM tag
Every expected output

This gives us perfect ground truth.

Now, a de-identification system can be evaluated objectively. If a patient name remains after processing, we know it failed. If clinical content is accidentally removed, we know that too.

Synthetic PHI creates a privacy-safe dataset that can be used for:

Training AI models
Benchmarking de-identification software
Regression testing
Validation before deployment

Building a Synthetic PHI Pipeline

To explore this problem, my team built a pipeline that generates synthetic PHI for lung Point-of-Care Ultrasound (POCUS) images.

The goal was to:

Start with ultrasound images containing no patient information.
Generate realistic synthetic patient identities.
Burn PHI into image pixels.
Insert matching PHI into DICOM metadata.
Automatically generate ground truth labels.
Validate the resulting DICOM files.

The output looks realistic from the perspective of a de-identification system while containing no real patient information.

Pipeline Architecture

The workflow looks like this (we'll go over each step in detail below):

Each stage produces artifacts consumed by the next stage. Failures are quarantined rather than silently ignored.

Safety Checks Before Burning

Before writing synthetic PHI onto an image, the pipeline performs a safety check to ensure that the selected region to insert PHI lies outside the ultrasound fan.

The top-left corner of a lung POCUS image is usually outside the imaging fan, a dark border, safe to burn PHI onto without obscuring clinical content.

To make sure this region holds good for every image, the pipeline runs two checks per image:

Brightness check: If the average intensity of the configured burn region exceeds a threshold, the region likely overlaps the ultrasound fan rather than the dark border.
Boundary check: The pipeline verifies that the configured burn region fits entirely within the image. Images that are smaller than the expected burn area are quarantined.

In either case, the image is quarantined with the reason recorded into the manifest. There are no partial burns, no overwritten clinical content, and no silent corruption of test data.

This prevents synthetic identifiers from accidentally obscuring anatomy.

def burn_region_is_safe(arr):
    """Check the burn region is dark enough to be outside the fan."""
    h, w = arr.shape
    y2 = min(BURN_REGION_Y + BURN_REGION_H, h)
    x2 = min(BURN_REGION_X + BURN_REGION_W, w)
    region = arr[BURN_REGION_Y:y2, BURN_REGION_X:x2]
    if region.size == 0:
        return False, float("nan")
    mean = float(region.mean())
    return mean <= BRIGHTNESS_SKIP_THRESHOLD, mean

The function extracts the configured burn region and computes its average brightness. If the region is too bright, it likely overlaps the ultrasound fan rather than the border.

Step 1: Generate Synthetic Patient Identities

The synthetic identity is produced by Faker and seeded per case, so the same image always yields the same fake patient.

Determinism matters because:

Reproducing a test result requires reproducing the test data.
Debugging downstream tools is easier when the input doesn't change between runs.
Comparing two de-identification tools fairly requires both to see the same planted PHI.

def case_seed(global_seed: int, source_id: str) -> int:
    """Per-image deterministic seed derived from global seed and source path."""
    h = hashlib.sha256(f"{global_seed}|{source_id}".encode()).hexdigest()
    return int(h[:8], 16)


def generate_phi(seed: int) -> dict:
    fake = Faker()
    Faker.seed(seed)
    rng = random.Random(seed)

    last = fake.last_name()
    first = fake.first_name()
    middle = fake.random_letter().upper()
    mrn = f"{rng.randint(1000000, 9999999)}"
    dob = fake.date_of_birth(minimum_age=18, maximum_age=95)
    study_date = fake.date_time_this_decade()
    institution = rng.choice(INSTITUTION_POOL)

    return {
        "case_uuid": f"SYNTH-{uuid.UUID(int=rng.getrandbits(128))}",
        "patient_name_display": f"{last}, {first} {middle}.",
        "patient_name_dicom": f"{last}^{first}^{middle}",   # DICOM PN VR format
        "patient_id": mrn,
        "dob": dob,
        "study_date": study_date,
        "institution_name": institution,
    }

The case_seed() function generates a deterministic seed from the source image path. That seed is then used by Faker to create a synthetic identity.

Because the seed is repeatable, the same input image always receives the same synthetic patient information. This makes debugging and benchmarking reproducible.

Step 2: Burn PHI into Image Pixels

Rendering text onto an image is comparatively expensive. For a single zone containing 30+ frames, repeating that work per frame is wasteful.

The pipeline instead renders the PHI overlay onto a transparent canvas one time per zone. This mirrors how many ultrasound systems operate in practice, where patient information remains fixed while the underlying image content changes from frame to frame.

def make_phi_overlay(shape, phi):
    """Render PHI ONCE onto a canvas. Returns (overlay_array, overlays_meta)."""
    h, w = shape
    canvas = Image.new("L", (w, h), 0)  # blank canvas
    draw = ImageDraw.Draw(canvas)

    overlays, x, y = [], BURN_REGION_X, BURN_REGION_Y
    for entry in _phi_text_block(phi):
        x0, y0, x1, y1 = draw.textbbox((x, y), entry["line"], font=FONT)
        tw, th = x1 - x0, y1 - y0

        if x + tw > w or y + th > h:
            raise ValueError(
                f"rendered PHI overflows image: '{entry['line']}' "
                f"at ({x},{y}) size ({tw}x{th}), image {w}x{h}"
            )

        draw.text((x, y), entry["line"], font=FONT, fill=TEXT_COLOR)
        overlays.append({
            "phi_category": entry["phi_category"],
            "rendered_text": entry["line"],
            "phi_value": entry["value"],
            "bbox": [x, y, tw, th],
            "dicom_tag": entry["dicom_tag"],
        })
        y += th + LINE_GAP
    return np.array(canvas), overlays

The make_phi_overlay() function creates a blank canvas and renders each PHI line onto it. At the same time, it records metadata such as the rendered text, bounding box coordinates, and corresponding DICOM tag.

The function returns both the image overlay and the annotation metadata, ensuring that the ground truth always matches the pixels that were actually drawn.

Rendering once and reusing the overlay provides several advantages:

Faster processing
Consistent PHI placement across frames
Simplified ground-truth generation
Behavior that more closely matches real ultrasound devices

An additional benefit is that the pipeline automatically records the location of every burned identifier.

Step 3: Add PHI to DICOM Headers

The DICOM standard supports two ways to represent a cine ultrasound loop: as a sequence of single-frame DICOMs that share a series UID, or as one multi-frame DICOM where the pixel data holds every frame stacked together.

The pipeline uses the multi-frame approach because:

It matches how real ultrasound devices write cine loops.
One header serves all frames — no duplication of patient metadata.
Storage and transfer are more efficient.

ds.PatientName = phi["patient_name_dicom"]
ds.PatientID = deid_patient_id
ds.PatientBirthDate = phi["dob"].strftime("%Y%m%d")

ds.StudyInstanceUID = study_uid
ds.StudyDate = phi["study_date"].strftime("%Y%m%d")
ds.InstitutionName = phi["institution_name"]

These fields populate the DICOM header with the same synthetic identity used in the image overlay. This ensures that visible PHI and hidden metadata remain consistent, producing realistic test data.

A few details that the DICOM standard enforces but the spec doesn't make obvious:

StudyID is required and must be a short string, distinct from StudyInstanceUID. It's easy to forget.
ImageType must be present. ["DERIVED", "SECONDARY"] is the honest value for synthetic data because it wasn't acquired by a device.
Manufacturer is part of the General Equipment IOD module and is required even though the data is synthetic. Setting it to a clearly synthetic value (SYNTHETIC-DEID-TUTORIAL) makes the origin unambiguous.

Step 4: Identity Mapping: The De-Identified PatientID

To support downstream evaluation, every source patient receives a stable identifier such as DEID-0001. A mapping file links source patients, synthetic studies, and generated DICOM objects. This allows evaluators to compare a de-identification tool’s output against the original ground truth.

source_patient,deid_patient_id,study_instance_uid
patient_001,DEID-0001,1.2.826.0.1.3680043.8.498.1234...
patient_002,DEID-0002,1.2.826.0.1.3680043.8.498.5678...

Step 5: Ground Truth: Structured CSV Output

One major advantage of synthetic PHI is automatic label generation. Because the pipeline creates every identifier, it already knows the text value, bounding box coordinates, and corresponding DICOM tag.

These annotations are exported as structured CSV files and become the ground truth used for training and evaluation.

def build_overlay_rows(*, case_uuid, sop_instance_uid, source_id, source_relpath, output_dicom_relpath, overlays,
                      image_shape):
    h, w = image_shape
    rows = []
    for ov in overlays:
        x, y, ow, oh = ov["bbox"]
        rows.append({
            "case_uuid": case_uuid,
            "sop_instance_uid": sop_instance_uid,
            "source_id": source_id,
            "source_relpath": source_relpath,
            "output_dicom_relpath": output_dicom_relpath,
            "image_h": h,
            "image_w": w,
            "region": "top_left_banner",
            "phi_category": ov["phi_category"],
            "phi_value": ov["phi_value"],
            "rendered_text": ov["rendered_text"],
            "bbox_x": x, "bbox_y": y,
            "bbox_w": ow, "bbox_h": oh,
            "dicom_tag": ov["dicom_tag"],
            "seed": SEED,
            "pipeline_version": PIPELINE_VERSION,
            "run_id": RUN_ID,
        })
    return rows

build_overlay_rows function converts each overlay into a row of structured metadata. Along with the text and bounding box coordinates, it records identifiers and reproducibility information such as the pipeline version and random seed.

These CSV files become the ground truth used for training and evaluating de-identification systems.

At the end of the run, the accumulated rows are grouped by de-identified patient ID and written into per-patient CSV files. Each patient folder receives its own phi_overlays.csv covering all of that patient's zones, alongside a run_manifest.csv summarizing zone-level status (processed, quarantined, failed) and paths.

Three-Tier DICOM Validation

A synthetic DICOM file is only useful if it actually conforms to the DICOM standard. Otherwise, downstream tools that consume it will fail or worse silently mis-handle it.

The pipeline uses a three-tier validation chain that gracefully degrades depending on what's available in the environment:

dciodvfy from dicom3tools: the most rigorous standards-conformance validator, written by David Clunie. It's not pip-installable. It checks against the full DICOM IOD definitions. If it's available on PATH, this is the preferred check.
dicom-validator CLI: this is pip-installable. It downloads the DICOM standard definitions on first run, then validates IOD compliance. it's used when dciodvfy isn't available.
pydicom re-read: the minimal fallback. It confirms that every file can be re-opened, decoded, and that pixel data round-trips correctly. It doesn't check standards compliance, but catches gross corruption.

A Surprising Bug: MONAI vs PIL

Originally, I planned to use MONAI for image loading because it's widely used in medical imaging workflows.

During testing, I discovered an issue: MONAI’s image loading conventions caused non-square images to appear rotated when downstream code assumed traditional image layouts.

At the same time, many ultrasound images contained EXIF orientation metadata that required correction.

Switching to PIL solved both issues.

from PIL import Image, ImageOps

img = Image.open(path)
img = ImageOps.exif_transpose(img)

Final Thoughts

Synthetic PHI does not replace real-world testing, but it provides something healthcare AI teams rarely have: a safe, shareable, and fully labeled dataset with known answers.

By generating realistic identifiers and embedding them into both image pixels and DICOM metadata, we can build reproducible benchmarks for de-identification systems without exposing real patient data.

As AI systems become increasingly responsible for handling sensitive medical information, synthetic PHI may become one of the most important tools for building trustworthy healthcare AI workflows.

The complete implementation is available as a Jupyter notebook in the MONAI Ultrasound Working Group repository. You can explore the notebook and experiment with the pipeline yourself.

Sometimes the safest way to test whether a system can remove PHI is to create the PHI yourself.

How to Preprocess Medical Images for Machine Learning – A Guide Using Chest X-Rays

Lakshmi Mahabaleshwara — Thu, 04 Jun 2026 17:13:59 +0000

Working with healthcare data introduces preprocessing challenges that go beyond those you might encounter with structured data. Some familiar techniques still apply, while others look very different once your data becomes medical images.

In this article, you’ll learn how to prepare a real-world medical imaging dataset for machine learning, from initial data validation to a complete preprocessing pipeline.

We’ll use the Chest X-Ray Pneumonia dataset as our running example, but the lessons apply broadly to healthcare imaging data, including ultrasound, MRI, CT, and dermatology images.

What You'll Learn in This Article

By the end of this article, you'll know how to:

Approach healthcare data preprocessing differently from preprocessing structured data, and recognize where standard techniques fall short
Validate a medical imaging dataset before training to catch corrupted files, mislabels, and data leakage between train and test
Apply six core preprocessing techniques for medical images
Build a complete preprocessing pipeline for chest X-rays using Python with OpenCV.

What We'll Cover:

Why Preprocessing Data Matters More in Healthcare
The Dataset
Before Preprocessing: Validate the Dataset
The Six Pillars of Healthcare Imaging Preprocessing
Pillar 1: Scaling — Making the Numbers Play Fair
Pillar 2: Normalization — Centering the Data
Pillar 3: Guiding the Model's Attention
Pillar 4: Handling Missing Data
Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame
Pillar 6: Denoising & Artifact Handling — Cleaning the Window
Putting it All together: A Complete Pipeline
Try it Yourself
Conclusion

Why Preprocessing Data Matters More in Healthcare

Imagine handing a toddler a jigsaw puzzle with missing pieces, warped edges, and pieces from three different puzzles mixed together. The toddler can't solve it, but that isn't really the toddler's fault.

The same thing happens when raw, messy data gets fed into a machine learning model. A bad prediction on a clinical image can mean a missed diagnosis.

Healthcare data tends to be messier than what most ML practitioners are used to:

Images come from different machines, hospitals, and acquisition protocols
Labels are inconsistent, sometimes missing, sometimes wrong
Patient data is incomplete
Image sizes, contrast levels, and orientations vary across sources

Poor preprocessing often leads to models that perform well on benchmark datasets but struggle to generalize to data collected from different hospitals or imaging devices.

The Dataset

This guide uses the Chest X-Ray Pneumonia dataset by Paul Mooney on Kaggle. It's a strong choice for learning preprocessing because:

It contains around 5,800 pediatric chest X-rays
It has two clear classes — Normal and Pneumonia
It's already organized into train, validation, and test folders
The images are recognizable without specialized medical training
It exhibits almost every preprocessing challenge worth learning

The dataset is available at Kaggle: Chest X-Ray Pneumonia.

Folder Structure

After downloading, the dataset is organized like this:

chest_xray/
├── train/
│   ├── NORMAL/
│   └── PNEUMONIA/
├── val/
│   ├── NORMAL/
│   └── PNEUMONIA/
└── test/
    ├── NORMAL/
    └── PNEUMONIA/

Side-by-side comparison — Normal vs Pneumonia chest X-ray:

A quick first look at one of the images:

import os
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import cv2

DATA_DIR = "chest_xray"
TRAIN_DIR = os.path.join(DATA_DIR, "train")

# Peek at a sample image
sample_path = os.path.join(TRAIN_DIR, "NORMAL", os.listdir(os.path.join(TRAIN_DIR, "NORMAL"))[0])
sample_image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)

print(f"Image shape: {sample_image.shape}")
print(f"Pixel range: {sample_image.min()} to {sample_image.max()}")
print(f"Data type: {sample_image.dtype}")

The output reveals a few useful things right away: most images are large (often around 1500×2000 pixels), pixel values fall in the 0–255 range, and image sizes vary across the dataset. Each of these observations will inform a preprocessing step.

Before Preprocessing: Validate the Dataset

Before applying any transformations, it's worth checking that the data itself is intact. This step alone catches issues that would otherwise cause training to fail silently or produce misleading results.

A simple validation function:

def validate_dataset(data_dir):
    """Scan a dataset folder and flag common data quality issues."""
    corrupted = []
    too_small = []
    nearly_black = []
    total = 0
    
    for class_name in os.listdir(data_dir):
        class_path = os.path.join(data_dir, class_name)
        if not os.path.isdir(class_path):
            continue
        for fname in os.listdir(class_path):
            fpath = os.path.join(class_path, fname)
            total += 1
            try:
                img = cv2.imread(fpath, cv2.IMREAD_GRAYSCALE)
                if img is None:
                    corrupted.append(fpath)
                    continue
                if img.shape[0] < 100 or img.shape[1] < 100:
                    too_small.append(fpath)
                if img.mean() < 5:
                    nearly_black.append(fpath)
            except Exception:
                corrupted.append(fpath)
    
    print(f"Total files scanned: {total}")
    print(f"Corrupted: {len(corrupted)}")
    print(f"Too small: {len(too_small)}")
    print(f"Nearly black: {len(nearly_black)}")
    return corrupted, too_small, nearly_black

validate_dataset(TRAIN_DIR)

Common issues this catches:

Corrupted files — files that won't open at all
Empty or nearly-black images — failed acquisitions or saved-as-blank files
Wrong dimensions — thumbnails or partial downloads mixed in
Duplicate images — the same scan appearing in both train and test (this causes data leakage)
Mislabeled images — a normal X-ray placed in the pneumonia folder

⚠️ This step is critical, One corrupted file can crash a training loop hours into a run. One duplicate between train and test can inflate accuracy scores by several percentage points without anyone noticing.

The Six Pillars of Healthcare Imaging Preprocessing

Preprocessing for medical images can be organized around six core concerns. Two of them carry over directly from preprocessing structured data. Two need to be adapted because the mechanics change when the input is an image. And two are entirely new, they only exist once the data becomes pictures of human bodies.

Pillar 1: Scaling — Making the Numbers Play Fair

Imagine two children comparing their collections. One has 3 seashells. The other has 3,000 stickers. Asking who has more makes the answer seem obvious, but the scales are completely different. Comparing them meaningfully means putting both collections on the same measuring system.

In medical images, pixels usually range from 0 to 255 in 8-bit images, or 0 to 65,535 in some 16-bit medical DICOM images. Neural networks tend to train faster and more reliably when input values are small numbers close to zero.

The fix: Divide every pixel by its maximum possible value, bringing everything into the 0-to-1 range.

image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)

# Scale to [0, 1]
image_scaled = image.astype(np.float32) / 255.0

print(f"Before scaling: {image.min()} to {image.max()}")
print(f"After scaling:  {image_scaled.min():.3f} to {image_scaled.max():.3f}")

Takeaway: Pixel scaling follows the same principle as scaling any numerical feature. The values simply happen to be arranged as an image rather than a column.

Pillar 2: Normalization — Centering the Data

Imagine a teacher asks a class to rate a movie from 1 to 10. One child always gives 9s and 10s. Another spreads ratings evenly from 1 to 10. Comparing their opinions fairly requires adjusting each child's score relative to their own average.

In medical imaging even after scaling to 0–1, the overall brightness of images can vary. Some X-rays are taken with stronger exposure than others. Normalization shifts and rescales each image (or each channel) so the values are centered around zero with a standard deviation of one.

The fix: Subtract the mean, divide by the standard deviation.

# Compute mean and std from the TRAINING set only — never from validation or test
def compute_train_stats(train_dir, sample_limit=1000):
    """Compute pixel mean and std across the training set."""
    pixel_values = []
    count = 0
    for class_name in os.listdir(train_dir):
        class_path = os.path.join(train_dir, class_name)
        for fname in os.listdir(class_path):
            if count >= sample_limit:
                break
            img = cv2.imread(os.path.join(class_path, fname), cv2.IMREAD_GRAYSCALE)
            if img is not None:
                pixel_values.append(img.astype(np.float32).flatten() / 255.0)
                count += 1
    pixels = np.concatenate(pixel_values)
    return pixels.mean(), pixels.std()

train_mean, train_std = compute_train_stats(TRAIN_DIR)
image_normalized = (image_scaled - train_mean) / train_std

⚠️ Avoid this common mistake: Statistics for normalization should be computed from the training set only, never from validation or test. Including those in the calculation leaks information from the evaluation data into the model. The same statistics should then be applied to validation, test, and any new data at inference time.

Takeaway: Centering and scaling each image around the dataset's statistics is the imaging equivalent of standardizing a feature column. The pixels are now comparable across images, regardless of how bright or dim each scan happened to be.

Pillar 3: Guiding the Model's Attention

Imagine a child walking into a crowded pet store. Instead of describing every animal in sight, a parent points to the features that matter: “Look at the soft fur, the fluffy tail, and the nice small size.” The child learns where to focus their attention.

Medical image preprocessing does something similar. It highlights the regions and features most relevant to the diagnostic task.

Region-of-interest (ROI) cropping — focus on the lung field and discard the patient's arms, machine borders, and any imprinted text
Contrast enhancement — use techniques like CLAHE (Contrast Limited Adaptive Histogram Equalization) to make subtle lung textures more visible
Channel selection — for images stored as RGB but containing grayscale information, convert to single-channel input to reduce noise

CLAHE applied to an X-ray:

# CLAHE enhances local contrast — useful for X-rays
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
image_enhanced = clahe.apply(image)

# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[1].imshow(image_enhanced, cmap='gray')
axes[1].set_title('After CLAHE')
plt.show()

Takeaway: The goal of teaching the model what to look at hasn't changed. With structured data, the answer is in new columns. With images, the answer is in cropping, enhancement, and emphasizing the regions that carry diagnostic signal.

Pillar 4: Handling Missing Data

Imagine reading a storybook with a few damaged pages. You don’t throw away the entire book, you decide whether to skip the page, infer what might be missing, or mark it for review.

In medical imaging, missing data can mean corrupted files, missing labels, or incomplete studies rather than empty spreadsheet cells.

The same three strategies — drop, impute, flag — still apply, just with different mechanics:

# Strategy 1: Drop — remove unreadable or empty images
def is_valid_image(path):
    try:
        img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        if img is None:
            return False
        if img.mean() < 5:           # nearly black
            return False
        if img.shape[0] < 50 or img.shape[1] < 50:  # too small
            return False
        return True
    except Exception:
        return False

# Strategy 2: Impute — rare for images, but possible (e.g., in painting to fill in missing patches). Generally avoided for diagnostic data.

# Strategy 3: Flag — track which patients are missing which modalities,
#   and let the model condition on availability. Common in multi-modal healthcare ML.

Takeaway: "Missing" in imaging data is rarely just a NaN. It can be a broken file, an unlabeled scan, an absent modality, or a black corner inside an image. The same three strategies still apply.

Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame

Imagine displaying children’s drawings on a classroom wall. If every drawing is a different size, they won’t fit neatly into the display. You resize them while preserving their proportions.

Medical images must often be resized to a common input size, but anatomical structures should retain their original shape.

The fix: Resize all images to a common shape. For medical data, how the resizing is done matters.

TARGET_SIZE = (224, 224)

# Simple resize (may distort aspect ratio)
image_resized = cv2.resize(image, TARGET_SIZE)

# Better: preserve aspect ratio with padding
def resize_with_padding(image, target_size):
    h, w = image.shape[:2]
    target_h, target_w = target_size
    scale = min(target_h / h, target_w / w)
    new_h, new_w = int(h * scale), int(w * scale)
    resized = cv2.resize(image, (new_w, new_h))
    
    pad_h = target_h - new_h
    pad_w = target_w - new_w
    top, bottom = pad_h // 2, pad_h - pad_h // 2
    left, right = pad_w // 2, pad_w - pad_w // 2
    padded = cv2.copyMakeBorder(resized, top, bottom, left, right,
                                 cv2.BORDER_CONSTANT, value=0)
    return padded

image_clean_resize = resize_with_padding(image, TARGET_SIZE)

⚠️ Why aspect ratio matters in healthcare: Squishing a chest X-ray horizontally makes the lungs look unnatural. Models trained on distorted anatomy often perform worse on real scans. Preserving aspect ratio is generally the safer choice.

Takeaway: Models need a consistent input size, but the geometry of the anatomy needs to be preserved. Resize, but resize carefully.

Pillar 6: Denoising & Artifact Handling — Cleaning the Window

Imagine looking through a window with dust and smudges on the glass. Cleaning the window makes the view clearer, but scrubbing too aggressively could scratch the glass.

Similarly, medical images often contain noise and acquisition artifacts that should be reduced carefully without removing clinically important details.

For chest X-rays, the most common issues are mild noise and burned-in text or markers. A gentle median or bilateral filter helps with the first, while cropping or masking helps with the second.

# Gentle denoising — careful not to blur away clinical detail
image_denoised = cv2.medianBlur(image, ksize=3)

# Bilateral filter preserves edges better than a median filter
image_bilateral = cv2.bilateralFilter(image, d=5, sigmaColor=50, sigmaSpace=50)

⚠️ A note of caution: Aggressive denoising can erase the features a model needs to detect a disease. For diagnostic ML, gentle filtering is generally preferred. A useful rule of thumb: if a radiologist can't distinguish the cleaned image from the original, the filtering has gone too far.

Takeaway: Imaging data carries noise that structured data doesn't have. The window can be cleaned, but never so aggressively that the view is wiped away with the smudges.

Putting it All Together: A Complete Pipeline

Here's how the six pillars combine into a single preprocessing function for chest X-ray images:

def preprocess_xray(image_path, target_size=(224, 224),
                    train_mean=0.482, train_std=0.236):
    """
    Full preprocessing pipeline for chest X-ray images.
    Applies all six pillars in order.
    """
    # Pillar 4: Validate first — skip corrupted files
    if not is_valid_image(image_path):
        return None
    
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    
    # Pillar 5: Resize with aspect ratio preserved
    image = resize_with_padding(image, target_size)
    
    # Pillar 6: Gentle denoising
    image = cv2.medianBlur(image, 3)
    
    # Pillar 3: Enhance contrast to highlight lung texture
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    image = clahe.apply(image)
    
    # Pillar 1: Scale to [0, 1]
    image = image.astype(np.float32) / 255.0
    
    # Pillar 2: Normalize using training set statistics
    image = (image - train_mean) / train_std
    
    return image

Try it Yourself

Every code snippet in this article is bundled into a runnable Kaggle notebook: Chest X-Ray Preprocessing — Kaggle Notebook. Fork it, attach the dataset, and run all the cells to see each preprocessing pillar in action on real chest X-rays.

Conclusion

Here's a summary of what we've discussed in this article:

Pillar	Purpose	Example
Scaling	Standardize pixel ranges	0-255 → 0-1
Normalization	Center brightness distributions	z-score normalization
Attention Guidance	Highlight diagnostic regions	CLAHE
Missing Data Handling	Remove unusable scans	Corrupted files
Resizing	Consistent input size	224×224
Denoising	Reduce acquisition noise	Median filter

Preprocessing for structured data is about making numbers play fair so a model can see them clearly.

Preprocessing for healthcare imaging is about respecting the messy reality of how medical data is captured, stored, and labeled. Some standard techniques carry over directly. Some need to be adapted. And a few preprocessing concerns only emerge once the data becomes pictures of human bodies.

Stepping back, whether it's a child learning to organize their toy box, or a model learning to spot pneumonia in a chest X-ray, the quality of learning depends on the quality of data preparation. Get the data right.

If this was useful, you can find a related conceptual primer on preprocessing more broadly here: Data Preprocessing for Machine Learning.

Why Your Deep Learning Model Isn't Learning: Diagnosing Data Problems in Medical Imaging

Lakshmi Mahabaleshwara — Fri, 29 May 2026 15:20:57 +0000

I built a clean, well-structured deep learning pipeline using MONAI (Medical Open Network for AI) on a public abdominal ultrasound dataset.

The pipeline included:

proper subject-grouped train/validation splits
robust preprocessing
carefully decoded segmentation masks
sensible loss functions
consistent evaluation

And the model still struggled to learn.

The interesting part isn't that the model underperformed. What mattered was the diagnosis: a series of simple checks that traced the problem back to the dataset, not the model.

Those checks are useful far beyond medical imaging. They apply to almost any machine learning project.

If you're new to ML, this is a lesson worth carrying into every project: understand your data before you tune your model.

I set out to build a medical image segmentation tutorial. I ended up learning a more valuable lesson: no amount of careful engineering can rescue a model from a dataset that can't support the task.

By the end of this article, you'll understand:

How to evaluate whether a dataset can actually support your task
Why "the model isn't learning" is often a data problem
How to rule out engineering bugs before blaming the data
Practical diagnostics you can run in minutes
Why synthetic training data often struggles in real-world deployment
When to stop tuning and walk away from a dataset

This is not a beginner introduction to deep learning – it assumes familiarity with concepts like UNet architectures and training loops. But the data-quality lessons apply broadly to many ML projects.

What We'll Cover:

The Dataset
Step 1: Rule Out the Pipeline Before Blaming the Data
Step 2: The Model Still Struggled
Step 3: Interrogating the Dataset
Step 4: Knowing When to Stop
A Practical Dataset Evaluation Checklist
What I Would Try Next
The Bigger Lesson

The Dataset

I used the US Simulation & Segmentation dataset, a public collection of abdominal ultrasound images with organ segmentation labels from Kaggle.

It contains:

926 synthetic ultrasound images — generated by a ray-casting simulator from CT scans, with full organ annotations
617 real ultrasound images — from an actual ultrasound scanner
Labels for 8 organs — liver, kidney, gallbladder, pancreas, spleen, bones, vessels, and adrenals

At first glance, the dataset looked ideal:

thousands of images
multiple organ classes
both synthetic and real ultrasound data

Whether it actually supported the task was a different question.

Step 1: Rule Out the Pipeline Before Blaming the Data

Ground rule: you should always rule out the pipeline before blaming the data. A model failing on buggy code looks exactly like a model failing on bad data. The engineering needs to be trustworthy.

Subject-Grouped Splits

A common mistake in medical imaging is randomly splitting images into train and test sets.

That approach is problematic because many frames come from the same patient. Those frames share anatomy, scanner settings, and noise patterns.

If frames from the same patient appear in both the train and test sets, the model can partially memorize patient-specific patterns. Test scores look artificially good, even though the model may fail on truly unseen patients.

This is called subject leakage.

The fix is to split by patient instead of by image:

from sklearn.model_selection import GroupShuffleSplit

def assign_splits(manifest, val_fraction=0.15, seed=42):
    train_data = manifest[manifest["orig_split"] == "train"]
    groups = train_data["subject_id"].values

    gss = GroupShuffleSplit(n_splits=1, test_size=val_fraction, random_state=seed)
    train_idx, val_idx = next(gss.split(X=train_data, y=None, groups=groups))

    train_subjects = set(train_data.iloc[train_idx]["subject_id"].unique())
    val_subjects = set(train_data.iloc[val_idx]["subject_id"].unique())

    # Crash loudly if leakage ever sneaks in
    assert train_subjects.isdisjoint(val_subjects), "Subject leak detected!"
    return train_subjects, val_subjects

That assertion matters. If the split logic ever breaks, the pipeline fails loudly instead of silently producing misleading metrics.

Decoding Masks Correctly

The dataset stores labels as color-coded masks. Each organ corresponds to a different RGB color.

Training requires converting those colors into integer class labels.

A naïve implementation uses exact color matching, but resizing operations can slightly alter colors at mask boundaries.

A more robust approach maps each pixel to its nearest palette color:

import numpy as np

PALETTE = np.array([
    [0, 0, 0],
    [100, 0, 100],
    [255, 255, 255],
    [0, 255, 0],
    [255, 255, 0],
    [0, 0, 255],
    [255, 0, 0],
    [255, 0, 255],
    [0, 255, 255],
], dtype=np.int32)

def decode_mask(mask_rgb):
    h, w = mask_rgb.shape[:2]
    flat = mask_rgb.reshape(-1, 3).astype(np.int32)
    d2 = (
        (flat[:, None, :] - PALETTE[None, :, :]) ** 2
    ).sum(-1)
    classes = d2.argmin(axis=1).astype(np.uint8)
    return classes.reshape(h, w)

Before training, it’s worth visually checking a few decoded masks against the original images. This catches issues like incorrect palettes, RGB/BGR channel swaps, or resizing artifacts that silently corrupt labels.

These bugs rarely throw errors. Instead, the model simply learns poorly. And “trained on wrong labels” looks exactly like “the model can’t learn the data.”

Verifying masks early removes that uncertainty.

Loss Design and Class Weighting

For training, I usd standard MONAI segmentation losses. The goal wasn’t to aggressively maximize performance, but to establish a stable and trustworthy baseline.

The training curves below show that the model optimized normally: the loss decreased consistently, and the validation dice stabilized rather than diverging. This helped rule out optimization instability as the primary cause of poor final performance.

Three choices were deliberate:

Dice + Cross-Entropy combined: Cross-entropy keeps learning stable early on – Dice directly rewards good region overlap. Together they balance each other.
include_background=False for binary segmentation: In a single-organ task, background can be 85–90% of the pixels. Counting it in the loss drowns out the signal for the organ you actually care about, so it's better left out.
Class weighting for multi-class segmentation: With organs of very different sizes, an unweighted loss lets the model ignore the small, rare ones and still score well. Weighting rare-class mistakes more heavily pushes back against that.

Step 2: The Model Still Struggled

The first experiment focused on liver segmentation — the simplest single-organ task in the dataset.

Test set	Liver Dice
Synthetic test set	~0.68
Real ultrasound test set	~0.48

Dice scores range from 0 (no overlap) to 1 (perfect overlap).

Qualitatively, the predictions often captured rough liver regions but failed at boundaries and consistency across real scans.

Especially important:

the model struggled even on synthetic in-domain data
performance dropped further on real ultrasound images

At this point, two explanations were possible:

the model or pipeline was flawed
the dataset itself was limiting performance

Because the engineering had been carefully validated, the second possibility became worth investigating seriously.

That's where the real lesson began.

Step 3: Interrogating the Dataset

Rather than endlessly tuning the model, the productive move is to turn the diagnostic lens on the dataset.

Three simple checks revealed the real problem. None required retraining or expensive experiments.

Diagnostic 1: What Does the Dataset Actually Contain?

The first step was simply plotting the dataset composition.

926 labeled synthetic images (the bulk of training data)
Only 60 labeled real images — less than 4% of the dataset
557 unlabeled real images — real data exists, but without labels it can't be used for supervised training

This immediately changed the interpretation of the dataset.

Although the dataset contains many real ultrasound scans, almost all labeled training data is synthetic.

The model is effectively trained on synthetic ultrasound and expected to generalize to real ultrasound.

That's a difficult transfer problem from the start.

The limitation is simple: the real images mostly don't have labels, so supervised training has very little real-world data to learn from.

Lesson: Before training anything, chart the dataset composition. A headline image count can be misleading. "1,500 images" sounds large until you discover that only a tiny fraction are labeled examples from the target domain.

Diagnostic 2: Do Synthetic and Real Images Look Similar?

The next question was whether the synthetic and real ultrasound images actually followed similar visual distributions.

Plotting intensity histograms showed a clear mismatch.

synthetic images clustered heavily near darker intensities
real ultrasound images had broader mid-range intensity distributions

The synthetic simulator captured anatomical geometry reasonably well, but it didn't reproduce the texture and noise characteristics of real ultrasound:

speckle patterns
intensity falloff
scanner-specific artifacts

This is the classic synthetic-to-real domain gap.

The model learned features tuned to synthetic images and then encountered a substantially different distribution during evaluation. Poor transfer performance became expected rather than surprising.

Lesson: Whenever training and deployment happen on different domains — synthetic → real, scanner A → scanner B, hospital A → hospital B — measure the distribution shift directly. Simple histogram comparisons can reveal major problems in minutes.

Diagnostic 3: Can the gap be fixed by adding real data?

The obvious next idea was: why not include some real labeled data during training?

But before implementing that approach, it's worth checking how many distinct patients actually had labels.

Labeled real images: 60
Distinct subjects (labeled real): 4

Frames per subject:
  subject h: 26
  subject a: 16
  subject g: 10
  subject b: 8

Only four patients.

That result fundamentally changed the situation.

Proper medical imaging evaluation requires subject-grouped train/test splits. But with only four patients, any evaluation becomes statistically unstable.

Training on two or three patients and testing on one or two patients would produce highly unreliable metrics that depend heavily on which patient happened to be held out.

At that point, the dataset simply couldn't support trustworthy real-world evaluation.

Lesson: In medical imaging, count subjects, not images. The true size of a dataset is bounded by the number of independent patients, not the number of files.

Step 4: Knowing When to Stop

At this point, additional tuning no longer made sense.

The bottleneck was not the architecture, optimizer, or learning rate. The bottleneck was the dataset itself.

The pipeline was still valuable and reusable. But this particular dataset couldn't reliably support the intended segmentation task.

That distinction matters: sometimes a problem is difficult but solvable, and sometimes the data simply can't support the conclusion you want to draw.

Learning to recognize the difference is an important ML skill.

A Practical Dataset Evaluation Checklist

Before committing weeks to model development, these checks are worth running on any dataset:

Chart the dataset composition — labeled vs unlabeled, class distribution, domain distribution
Count subjects, not images — independent patients matter more than frame count
Check class balance — rare classes are often ignored without weighting or sampling strategies
Compare train and deployment distributions — especially for cross-domain problems
Verify labels visually — catch preprocessing or annotation errors early
Look for published baselines — low published performance may indicate dataset limitations

These checks take minutes and can save weeks of unnecessary tuning.

What I Would Try Next

Improving results would likely require better data rather than a larger model. The next steps I'd prioritize:

collecting more labeled real ultrasound scans, from more distinct patients
improving annotation consistency
semi-supervised learning to make use of the unlabeled real images
domain adaptation between synthetic and real ultrasound

All of these target the actual bottleneck: data quality and data diversity.

The Bigger Lesson

In machine learning, it's easy to focus most of our attention on architectures, hyperparameters, optimization tricks, and newer models.

But the dataset quietly defines the ceiling.

A sophisticated model on weak data often disappoints, while a simpler model on strong data performs surprisingly well.

That was the real lesson from this project.

The most valuable skill wasn't building the pipeline. It was diagnosing why the model couldn't succeed and being willing to trust what the data was saying.

The workflow — checking dataset composition, counting subjects, comparing distributions, ruling out engineering bugs, and deciding when to stop — transfers to almost any ML project.

In many projects, better judgment about the data matters more than a better model.

The pipeline code and diagnostic notebooks are available at the MONAI Ultrasound Working Group repository. Questions, corrections, and improvements are always welcome.

Medical Imaging - freeCodeCamp.org

Build Your Own Healthcare AI Assistant with MedGemma, Ollama, and Open WebUI

What We'll Cover:

Who is This Tutorial For?

What is MedGemma?

Why MedGemma?

Why Run Models Locally?

Prerequisites

Architecture Diagram

Step 1: Install Ollama

Step 2: Pull MedGemma

Step 3: Test MedGemma from the Terminal

Step 4: Install Open WebUI

Option A: Docker (recommended)

Option B: pip (no Docker)

Step 5: Connect Open WebUI to Ollama

Step 6: Start Chatting with MedGemma

Step 7: Upload Medical Images

Example Prompts to Try

Running Larger Models

Troubleshooting Guide

Error: registry.ollama.ai/library/medgemma:latest does not support tools

Open WebUI shows no models in the dropdown

ollama pull medgemma says model not found

Responses are extremely slow

Image upload doesn't work or the model ignores the image

Port 3000 is already in use

"Out of memory" errors when loading the 27B model

Conclusion

The Hidden PHI Problem in Medical Images: Building a Synthetic Dataset for AI De-Identification

The Problem

What You'll Learn in This Tutorial

What We'll Cover:

Source Images: OpenPOCUS

The Iceberg Problem: Most PHI Is Hidden

Why Synthetic PHI Matters

Challenge 1: Privacy Regulations

Challenge 2: Annotation at Scale

Challenge 3: Validation

Synthetic PHI Solves All Three Problems

Building a Synthetic PHI Pipeline

Pipeline Architecture

Safety Checks Before Burning

Step 1: Generate Synthetic Patient Identities

Step 2: Burn PHI into Image Pixels

Step 3: Add PHI to DICOM Headers

Step 4: Identity Mapping: The De-Identified PatientID

Step 5: Ground Truth: Structured CSV Output

Three-Tier DICOM Validation

A Surprising Bug: MONAI vs PIL

Final Thoughts

How to Preprocess Medical Images for Machine Learning – A Guide Using Chest X-Rays

What You'll Learn in This Article

What We'll Cover:

Why Preprocessing Data Matters More in Healthcare

The Dataset

Folder Structure

Before Preprocessing: Validate the Dataset

The Six Pillars of Healthcare Imaging Preprocessing

Pillar 1: Scaling — Making the Numbers Play Fair

Pillar 2: Normalization — Centering the Data

Pillar 3: Guiding the Model's Attention

Pillar 4: Handling Missing Data

Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame

Pillar 6: Denoising & Artifact Handling — Cleaning the Window

Putting it All Together: A Complete Pipeline

Try it Yourself

Conclusion

Why Your Deep Learning Model Isn't Learning: Diagnosing Data Problems in Medical Imaging

What We'll Cover:

The Dataset

Step 1: Rule Out the Pipeline Before Blaming the Data

Subject-Grouped Splits

Decoding Masks Correctly

Loss Design and Class Weighting

Step 2: The Model Still Struggled

Step 3: Interrogating the Dataset

Diagnostic 1: What Does the Dataset Actually Contain?

Diagnostic 2: Do Synthetic and Real Images Look Similar?

Diagnostic 3: Can the gap be fixed by adding real data?

Error: `registry.ollama.ai/library/medgemma:latest does not support tools`

`ollama pull medgemma` says model not found