Working with healthcare data introduces preprocessing challenges that go beyond those you might encounter with structured data. Some familiar techniques still apply, while others look very different once your data becomes medical images.
In this article, you’ll learn how to prepare a real-world medical imaging dataset for machine learning, from initial data validation to a complete preprocessing pipeline.
We’ll use the Chest X-Ray Pneumonia dataset as our running example, but the lessons apply broadly to healthcare imaging data, including ultrasound, MRI, CT, and dermatology images.
What You'll Learn in This Article
By the end of this article, you'll know how to:
Approach healthcare data preprocessing differently from preprocessing structured data, and recognize where standard techniques fall short
Validate a medical imaging dataset before training to catch corrupted files, mislabels, and data leakage between train and test
Apply six core preprocessing techniques for medical images
Build a complete preprocessing pipeline for chest X-rays using Python with OpenCV.
What We'll Cover:
Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame
Pillar 6: Denoising & Artifact Handling — Cleaning the Window
Why Preprocessing Data Matters More in Healthcare
Imagine handing a toddler a jigsaw puzzle with missing pieces, warped edges, and pieces from three different puzzles mixed together. The toddler can't solve it, but that isn't really the toddler's fault.
The same thing happens when raw, messy data gets fed into a machine learning model. A bad prediction on a clinical image can mean a missed diagnosis.
Healthcare data tends to be messier than what most ML practitioners are used to:
Images come from different machines, hospitals, and acquisition protocols
Labels are inconsistent, sometimes missing, sometimes wrong
Patient data is incomplete
Image sizes, contrast levels, and orientations vary across sources
Poor preprocessing often leads to models that perform well on benchmark datasets but struggle to generalize to data collected from different hospitals or imaging devices.
The Dataset
This guide uses the Chest X-Ray Pneumonia dataset by Paul Mooney on Kaggle. It's a strong choice for learning preprocessing because:
It contains around 5,800 pediatric chest X-rays
It has two clear classes — Normal and Pneumonia
It's already organized into train, validation, and test folders
The images are recognizable without specialized medical training
It exhibits almost every preprocessing challenge worth learning
The dataset is available at Kaggle: Chest X-Ray Pneumonia.
Folder Structure
After downloading, the dataset is organized like this:
chest_xray/
├── train/
│ ├── NORMAL/
│ └── PNEUMONIA/
├── val/
│ ├── NORMAL/
│ └── PNEUMONIA/
└── test/
├── NORMAL/
└── PNEUMONIA/
Side-by-side comparison — Normal vs Pneumonia chest X-ray:
A quick first look at one of the images:
import os
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import cv2
DATA_DIR = "chest_xray"
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Peek at a sample image
sample_path = os.path.join(TRAIN_DIR, "NORMAL", os.listdir(os.path.join(TRAIN_DIR, "NORMAL"))[0])
sample_image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)
print(f"Image shape: {sample_image.shape}")
print(f"Pixel range: {sample_image.min()} to {sample_image.max()}")
print(f"Data type: {sample_image.dtype}")
The output reveals a few useful things right away: most images are large (often around 1500×2000 pixels), pixel values fall in the 0–255 range, and image sizes vary across the dataset. Each of these observations will inform a preprocessing step.
Before Preprocessing: Validate the Dataset
Before applying any transformations, it's worth checking that the data itself is intact. This step alone catches issues that would otherwise cause training to fail silently or produce misleading results.
A simple validation function:
def validate_dataset(data_dir):
"""Scan a dataset folder and flag common data quality issues."""
corrupted = []
too_small = []
nearly_black = []
total = 0
for class_name in os.listdir(data_dir):
class_path = os.path.join(data_dir, class_name)
if not os.path.isdir(class_path):
continue
for fname in os.listdir(class_path):
fpath = os.path.join(class_path, fname)
total += 1
try:
img = cv2.imread(fpath, cv2.IMREAD_GRAYSCALE)
if img is None:
corrupted.append(fpath)
continue
if img.shape[0] < 100 or img.shape[1] < 100:
too_small.append(fpath)
if img.mean() < 5:
nearly_black.append(fpath)
except Exception:
corrupted.append(fpath)
print(f"Total files scanned: {total}")
print(f"Corrupted: {len(corrupted)}")
print(f"Too small: {len(too_small)}")
print(f"Nearly black: {len(nearly_black)}")
return corrupted, too_small, nearly_black
validate_dataset(TRAIN_DIR)
Common issues this catches:
Corrupted files — files that won't open at all
Empty or nearly-black images — failed acquisitions or saved-as-blank files
Wrong dimensions — thumbnails or partial downloads mixed in
Duplicate images — the same scan appearing in both train and test (this causes data leakage)
Mislabeled images — a normal X-ray placed in the pneumonia folder
⚠️ This step is critical, One corrupted file can crash a training loop hours into a run. One duplicate between train and test can inflate accuracy scores by several percentage points without anyone noticing.
The Six Pillars of Healthcare Imaging Preprocessing
Preprocessing for medical images can be organized around six core concerns. Two of them carry over directly from preprocessing structured data. Two need to be adapted because the mechanics change when the input is an image. And two are entirely new, they only exist once the data becomes pictures of human bodies.
Pillar 1: Scaling — Making the Numbers Play Fair
Imagine two children comparing their collections. One has 3 seashells. The other has 3,000 stickers. Asking who has more makes the answer seem obvious, but the scales are completely different. Comparing them meaningfully means putting both collections on the same measuring system.
In medical images, pixels usually range from 0 to 255 in 8-bit images, or 0 to 65,535 in some 16-bit medical DICOM images. Neural networks tend to train faster and more reliably when input values are small numbers close to zero.
The fix: Divide every pixel by its maximum possible value, bringing everything into the 0-to-1 range.
image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)
# Scale to [0, 1]
image_scaled = image.astype(np.float32) / 255.0
print(f"Before scaling: {image.min()} to {image.max()}")
print(f"After scaling: {image_scaled.min():.3f} to {image_scaled.max():.3f}")
Takeaway: Pixel scaling follows the same principle as scaling any numerical feature. The values simply happen to be arranged as an image rather than a column.
Pillar 2: Normalization — Centering the Data
Imagine a teacher asks a class to rate a movie from 1 to 10. One child always gives 9s and 10s. Another spreads ratings evenly from 1 to 10. Comparing their opinions fairly requires adjusting each child's score relative to their own average.
In medical imaging even after scaling to 0–1, the overall brightness of images can vary. Some X-rays are taken with stronger exposure than others. Normalization shifts and rescales each image (or each channel) so the values are centered around zero with a standard deviation of one.
The fix: Subtract the mean, divide by the standard deviation.
# Compute mean and std from the TRAINING set only — never from validation or test
def compute_train_stats(train_dir, sample_limit=1000):
"""Compute pixel mean and std across the training set."""
pixel_values = []
count = 0
for class_name in os.listdir(train_dir):
class_path = os.path.join(train_dir, class_name)
for fname in os.listdir(class_path):
if count >= sample_limit:
break
img = cv2.imread(os.path.join(class_path, fname), cv2.IMREAD_GRAYSCALE)
if img is not None:
pixel_values.append(img.astype(np.float32).flatten() / 255.0)
count += 1
pixels = np.concatenate(pixel_values)
return pixels.mean(), pixels.std()
train_mean, train_std = compute_train_stats(TRAIN_DIR)
image_normalized = (image_scaled - train_mean) / train_std
⚠️ Avoid this common mistake: Statistics for normalization should be computed from the training set only, never from validation or test. Including those in the calculation leaks information from the evaluation data into the model. The same statistics should then be applied to validation, test, and any new data at inference time.
Takeaway: Centering and scaling each image around the dataset's statistics is the imaging equivalent of standardizing a feature column. The pixels are now comparable across images, regardless of how bright or dim each scan happened to be.
Pillar 3: Guiding the Model's Attention
Imagine a child walking into a crowded pet store. Instead of describing every animal in sight, a parent points to the features that matter: “Look at the soft fur, the fluffy tail, and the nice small size.” The child learns where to focus their attention.
Medical image preprocessing does something similar. It highlights the regions and features most relevant to the diagnostic task.
Region-of-interest (ROI) cropping — focus on the lung field and discard the patient's arms, machine borders, and any imprinted text
Contrast enhancement — use techniques like CLAHE (Contrast Limited Adaptive Histogram Equalization) to make subtle lung textures more visible
Channel selection — for images stored as RGB but containing grayscale information, convert to single-channel input to reduce noise
CLAHE applied to an X-ray:
# CLAHE enhances local contrast — useful for X-rays
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
image_enhanced = clahe.apply(image)
# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[1].imshow(image_enhanced, cmap='gray')
axes[1].set_title('After CLAHE')
plt.show()
Takeaway: The goal of teaching the model what to look at hasn't changed. With structured data, the answer is in new columns. With images, the answer is in cropping, enhancement, and emphasizing the regions that carry diagnostic signal.
Pillar 4: Handling Missing Data
Imagine reading a storybook with a few damaged pages. You don’t throw away the entire book, you decide whether to skip the page, infer what might be missing, or mark it for review.
In medical imaging, missing data can mean corrupted files, missing labels, or incomplete studies rather than empty spreadsheet cells.
The same three strategies — drop, impute, flag — still apply, just with different mechanics:
# Strategy 1: Drop — remove unreadable or empty images
def is_valid_image(path):
try:
img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
if img is None:
return False
if img.mean() < 5: # nearly black
return False
if img.shape[0] < 50 or img.shape[1] < 50: # too small
return False
return True
except Exception:
return False
# Strategy 2: Impute — rare for images, but possible (e.g., in painting to fill in missing patches). Generally avoided for diagnostic data.
# Strategy 3: Flag — track which patients are missing which modalities,
# and let the model condition on availability. Common in multi-modal healthcare ML.
Takeaway: "Missing" in imaging data is rarely just a NaN. It can be a broken file, an unlabeled scan, an absent modality, or a black corner inside an image. The same three strategies still apply.
Pillar 5: Resizing & Resampling — Fitting Everything in the Same Frame
Imagine displaying children’s drawings on a classroom wall. If every drawing is a different size, they won’t fit neatly into the display. You resize them while preserving their proportions.
Medical images must often be resized to a common input size, but anatomical structures should retain their original shape.
The fix: Resize all images to a common shape. For medical data, how the resizing is done matters.
TARGET_SIZE = (224, 224)
# Simple resize (may distort aspect ratio)
image_resized = cv2.resize(image, TARGET_SIZE)
# Better: preserve aspect ratio with padding
def resize_with_padding(image, target_size):
h, w = image.shape[:2]
target_h, target_w = target_size
scale = min(target_h / h, target_w / w)
new_h, new_w = int(h * scale), int(w * scale)
resized = cv2.resize(image, (new_w, new_h))
pad_h = target_h - new_h
pad_w = target_w - new_w
top, bottom = pad_h // 2, pad_h - pad_h // 2
left, right = pad_w // 2, pad_w - pad_w // 2
padded = cv2.copyMakeBorder(resized, top, bottom, left, right,
cv2.BORDER_CONSTANT, value=0)
return padded
image_clean_resize = resize_with_padding(image, TARGET_SIZE)
⚠️ Why aspect ratio matters in healthcare: Squishing a chest X-ray horizontally makes the lungs look unnatural. Models trained on distorted anatomy often perform worse on real scans. Preserving aspect ratio is generally the safer choice.
Takeaway: Models need a consistent input size, but the geometry of the anatomy needs to be preserved. Resize, but resize carefully.
Pillar 6: Denoising & Artifact Handling — Cleaning the Window
Imagine looking through a window with dust and smudges on the glass. Cleaning the window makes the view clearer, but scrubbing too aggressively could scratch the glass.
Similarly, medical images often contain noise and acquisition artifacts that should be reduced carefully without removing clinically important details.
For chest X-rays, the most common issues are mild noise and burned-in text or markers. A gentle median or bilateral filter helps with the first, while cropping or masking helps with the second.
# Gentle denoising — careful not to blur away clinical detail
image_denoised = cv2.medianBlur(image, ksize=3)
# Bilateral filter preserves edges better than a median filter
image_bilateral = cv2.bilateralFilter(image, d=5, sigmaColor=50, sigmaSpace=50)
⚠️ A note of caution: Aggressive denoising can erase the features a model needs to detect a disease. For diagnostic ML, gentle filtering is generally preferred. A useful rule of thumb: if a radiologist can't distinguish the cleaned image from the original, the filtering has gone too far.
Takeaway: Imaging data carries noise that structured data doesn't have. The window can be cleaned, but never so aggressively that the view is wiped away with the smudges.
Putting it All Together: A Complete Pipeline
Here's how the six pillars combine into a single preprocessing function for chest X-ray images:
def preprocess_xray(image_path, target_size=(224, 224),
train_mean=0.482, train_std=0.236):
"""
Full preprocessing pipeline for chest X-ray images.
Applies all six pillars in order.
"""
# Pillar 4: Validate first — skip corrupted files
if not is_valid_image(image_path):
return None
image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
# Pillar 5: Resize with aspect ratio preserved
image = resize_with_padding(image, target_size)
# Pillar 6: Gentle denoising
image = cv2.medianBlur(image, 3)
# Pillar 3: Enhance contrast to highlight lung texture
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
image = clahe.apply(image)
# Pillar 1: Scale to [0, 1]
image = image.astype(np.float32) / 255.0
# Pillar 2: Normalize using training set statistics
image = (image - train_mean) / train_std
return image
Try it Yourself
Every code snippet in this article is bundled into a runnable Kaggle notebook: Chest X-Ray Preprocessing — Kaggle Notebook. Fork it, attach the dataset, and run all the cells to see each preprocessing pillar in action on real chest X-rays.
Conclusion
Here's a summary of what we've discussed in this article:
| Pillar | Purpose | Example |
|---|---|---|
| Scaling | Standardize pixel ranges | 0-255 → 0-1 |
| Normalization | Center brightness distributions | z-score normalization |
| Attention Guidance | Highlight diagnostic regions | CLAHE |
| Missing Data Handling | Remove unusable scans | Corrupted files |
| Resizing | Consistent input size | 224×224 |
| Denoising | Reduce acquisition noise | Median filter |
Preprocessing for structured data is about making numbers play fair so a model can see them clearly.
Preprocessing for healthcare imaging is about respecting the messy reality of how medical data is captured, stored, and labeled. Some standard techniques carry over directly. Some need to be adapted. And a few preprocessing concerns only emerge once the data becomes pictures of human bodies.
Stepping back, whether it's a child learning to organize their toy box, or a model learning to spot pneumonia in a chest X-ray, the quality of learning depends on the quality of data preparation. Get the data right.
If this was useful, you can find a related conceptual primer on preprocessing more broadly here: Data Preprocessing for Machine Learning.