Retinal Disease Classification

Master's Thesis · Monash University

From Prompts to Probes

Zero-shot and few-shot transfer of foundation models for retinal fundus disease classification.

Pugalenthi Magendran · Supervised by Dr Yasmeen George · 2025

4
Models Compared
3
Benchmarks
0.921
Best AUROC
5-Fold
Cross-Validation

What Is This?

Imagine you go to the eye doctor. They take a photo of the back of your eye (called a fundus image) and look for signs of diseases like diabetes-related eye damage or glaucoma. This is how millions of people around the world get screened for vision-threatening conditions.

The problem? There aren't enough trained eye doctors, especially in developing countries. Over 43 million people worldwide are blind, and the majority live in regions with limited access to specialists.

What if an AI could look at that same eye photo and flag potential diseases, without ever being explicitly trained on eye images?

That's exactly what this research explores. I took four powerful AI models that were trained on different combinations of medical images and text, and tested how well they can detect retinal diseases in two ways:

  • Zero-shot: The AI has never seen these specific eye images before. I just describe the disease in words, and the AI decides if the image matches the description.
  • Few-shot: I give the AI a tiny fraction of labeled examples (as few as 5% of the dataset) and let it learn a simple classifier on top of its existing knowledge.
The core question:
Can these foundation models diagnose retinal diseases reliably with little or no task-specific training data, and which model works best for which disease?

This thesis presents a systematic benchmark of four vision-language and vision-only foundation models for retinal fundus image classification under zero-shot and few-shot linear probing protocols. The work addresses a critical gap: while foundation models show impressive generalization in natural image domains, their transfer characteristics to ophthalmic imaging remain under-explored, particularly under label-scarce conditions.

Research Contributions
  • First head-to-head comparison of FLAIR, OpenCLIP, BiomedCLIP, and RETFound on three retinal benchmarks under identical evaluation protocols
  • Clinician-validated prompt engineering with three templates of increasing clinical specificity (T1-T3)
  • Label-efficiency analysis at 5%, 10%, and 20% annotation budgets with frozen encoders
  • Calibration assessment via temperature scaling with ECE and AUCE metrics
  • Clinical operating point analysis at Sensitivity @ 90% Specificity for glaucoma screening

Why It Matters

Three eye diseases account for the vast majority of preventable blindness worldwide:

Diabetic Retinopathy

Leading cause of blindness in working-age adults

High blood sugar damages the tiny blood vessels in the retina. If caught early, laser treatment can prevent vision loss. If missed, it leads to irreversible blindness.

Glaucoma

The "silent thief of sight"

Damages the optic nerve gradually with no symptoms until significant vision is already lost. Early detection through screening is the only defense.

Multiple Conditions

Often co-occurring in the same patient

Real-world patients often present with more than one condition. A practical screening tool needs to handle multiple diseases from a single image.

The bottleneck is human experts. Training an ophthalmologist takes 12+ years. AI models that can screen eye images with minimal training data could bridge this gap, especially in underserved communities.

Label scarcity is the primary bottleneck in medical image classification. Annotation requires board-certified ophthalmologists, and inter-grader agreement is often moderate (Cohen's kappa 0.5-0.7 for DR grading). Foundation models pre-trained on large-scale image-text pairs offer a potential bypass: leveraging rich visual representations learned from diverse medical data to perform downstream tasks with minimal or zero task-specific labels.

This is particularly critical for:

  • Low-resource clinical settings where expert annotations are prohibitively expensive
  • Rare disease subtypes where even large hospitals have few confirmed cases
  • Rapid deployment scenarios where a model must generalize without fine-tuning

The Approach

I tested two fundamentally different ways of using these AI models:

Zero-Shot

The AI has never seen these eye images during training. I describe each disease using carefully crafted text prompts (validated by a clinician), and the model compares each image to these descriptions to make a diagnosis.

Analogy: Like showing a radiologist from another country a type of scan they've never seen, but giving them a detailed written description of what to look for.

vs

Few-Shot (Linear Probe)

The AI's core knowledge is frozen (locked in place). I add a simple classifier on top and train it with just 5%, 10%, or 20% of the labeled data.

Analogy: Like giving that same radiologist a quick 30-minute tutorial with a handful of example cases, then testing them on new ones.

The key insight is that these models already understand a lot about medical images from their pre-training. We're testing how far that existing knowledge can stretch when we give it very little (or zero) task-specific guidance.

Two Evaluation Protocols

Protocol 1: Zero-Shot Inference Input Image ──→ Vision Encoder ──→ Image Embedding ↓ Cosine Similarity ──→ Prediction ↑ Text Prompts ──→ Text Encoder ──→ Text Embeddings (T1, T2, T3) Protocol 2: Few-Shot Linear Probing Input Image ──→ Vision Encoder ──→ Image Embedding ──→ Linear Classifier ──→ Prediction (frozen) (trained on 5/10/20%) Note: RETFound (self-supervised MAE) has no text encoder, so it participates only in Protocol 2.

For zero-shot, we compute cosine similarity between image embeddings and text embeddings for each class label. For few-shot, we freeze the vision encoder and train a single linear layer (logistic regression) on the extracted features using limited labeled data. Both protocols use stratified 5-fold cross-validation.

The Four Models

Think of each model as having a different "education background":

Four foundation models spanning three pre-training paradigms: retinal-specific contrastive (FLAIR), general biomedical contrastive (BiomedCLIP), web-scale contrastive (OpenCLIP), and self-supervised masked autoencoding (RETFound).

👁

FLAIR

Retinal Specialist

The eye specialist. Trained specifically on 288,000 retinal images paired with clinical descriptions from 38 different datasets. Knows eye diseases inside and out.

ResNet-50 vision encoder + BioClinicalBERT text encoder. Contrastive pre-training on 288K fundus images from 38 curated ophthalmic datasets with clinical text pairs.

288K images 38 datasets ResNet-50
🌐

OpenCLIP

World Traveler

The generalist who's seen everything. Trained on 2 billion images from the entire internet. Knows a little about a lot, including some medical images.

ViT-H/14 vision encoder. CLIP-style contrastive pre-training on LAION-2B (2 billion image-text pairs from the open web). Largest encoder architecture in the comparison.

2B images LAION-2B ViT-H/14
🩺

BiomedCLIP

Medical Generalist

The medical school graduate. Trained on 15 million medical image-text pairs from research papers. Knows medicine broadly, but isn't an eye specialist.

ViT-Base/16 vision encoder + PubMedBERT text encoder. Contrastive pre-training on PMC-15M (15M figure-caption pairs from PubMed Central biomedical literature).

15M pairs PMC-15M ViT-B/16
🧠

RETFound

Self-Taught Eye Expert

The self-taught expert. Learned from 1.6 million retinal images by predicting masked parts of images (no text descriptions needed). Deep visual understanding, but can't match images to text.

ViT-Large encoder. Self-supervised pre-training via masked autoencoding (MAE) on 1.6M retinal images. No text encoder, so excluded from zero-shot; participates in linear probing only.

1.6M images MAE ViT-Large
Architecture Note
All experiments ran on an Apple M-series laptop using PyTorch 2.3.1 with MPS acceleration. No GPU cluster required, demonstrating that meaningful foundation model evaluation is accessible to individual researchers.

The Datasets

I tested the models on three established medical benchmarks, each targeting a different eye disease:

Three public ophthalmic benchmarks selected to span binary, multi-class, and multi-disease classification with varying class distributions and imaging conditions:

MESSIDOR

Diabetic Retinopathy Grading

1,200 eye images graded by severity of diabetic retinopathy (grades 0 to 3). The challenge: distinguishing between "no disease" and early, moderate, or severe stages.

1,200 fundus images with 4-class DR severity grading (R0-R3). Evaluated with Macro-F1 to account for class imbalance. Moderate inter-grader agreement makes this a challenging benchmark.

1,200 images 4 classes Macro-F1

REFUGE

Glaucoma Detection

1,200 eye images classified as glaucoma or healthy. The challenge: glaucoma cases are rare (only ~10%), so the model must be very precise to avoid overwhelming clinics with false alarms.

1,200 fundus images for binary glaucoma classification (~10% positive rate). Evaluated with AUROC for threshold-independent discrimination, plus Sensitivity @ 90% Specificity for clinical operating point analysis.

1,200 images Binary AUROC ~10% prevalence

ODIR-200×3

Multi-Disease Classification

A balanced subset of 600 images covering three categories: normal, diabetic retinopathy, and other diseases. Tests whether a model can handle multiple conditions at once.

Balanced 200-per-class subset from ODIR: Normal, DR, and Other pathologies (600 images total). Evaluated with Macro-F1. The heterogeneous "Other" category tests generalization beyond specific disease patterns.

600 images 3 classes Macro-F1 Balanced

Prompt Design

For zero-shot classification, the AI needs a text description of each disease to compare against the eye image. I designed three levels of prompts, each more detailed than the last, and had them validated by a clinical supervisor:

T1
Simple Label

Just the disease name. Example: "A fundus photograph of diabetic retinopathy."

T2
Clinical Description

Adds specific visual signs. Example: "A fundus photograph showing microaneurysms, hemorrhages, and hard exudates consistent with diabetic retinopathy."

T3
Structured Clinical Template

A full structured description following clinical reporting standards, including lesion types, locations, and severity indicators.

Results are averaged across all three prompt styles to give a fair assessment of each model's robustness to different prompt formulations.

Three prompt templates (T1-T3) of increasing clinical specificity, validated by an ophthalmic clinician (thesis supervisor Dr Yasmeen George). This hierarchy tests sensitivity to prompt engineering depth:

  • T1 (Simple): Minimal disease-name template. Baseline for prompt complexity.
  • T2 (Descriptive): Incorporates pathological signs and clinical features visible in fundus photography.
  • T3 (Structured): Full clinical description template following ophthalmic reporting conventions.

Zero-shot results are reported as mean ± SD across T1-T3, capturing both average performance and prompt sensitivity. This averaging strategy follows the approach established in FLAIR's original evaluation.

Why prompt design matters
The standard deviation across prompts reveals how sensitive each model is to prompt engineering. A clinically useful model should perform consistently across reasonable prompt formulations, not just on one carefully tuned template.

Zero-Shot Results

Remember, zero-shot means the models have never been trained on these specific eye images. They're diagnosing purely from text descriptions. Here's how they performed:

Performance averaged across T1-T3 prompts. RETFound excluded (no text encoder). Statistical significance assessed via paired image-bootstrap (α = 0.05, Holm adjustment).

Model MESSIDOR (Macro-F1) REFUGE (AUROC) ODIR-200×3 (Macro-F1)
FLAIR 0.735 ± 0.049 0.921 ± 0.049 0.366 ± 0.016
BiomedCLIP 0.471 ± 0.042 0.649 ± 0.037 0.709 ± 0.032
OpenCLIP 0.353 ± 0.001 0.530 ± 0.046 0.399 ± 0.131

Orange = best in column. Higher is better.

What does this tell us?

  • FLAIR dominates on diabetic retinopathy (MESSIDOR) and glaucoma (REFUGE). Its specialized retinal training pays off massively: 0.921 AUROC on glaucoma means it correctly distinguishes healthy from glaucoma eyes 92.1% of the time, with zero training on this dataset.
  • BiomedCLIP wins on multi-disease classification (ODIR-200×3), scoring 0.709. Its broad medical knowledge handles the diverse "Other" disease category better than the eye specialist.
  • No single model wins everything. The best model depends on the specific disease and task.
The standout result
FLAIR's 0.921 AUROC on glaucoma detection is remarkable for a model that has never been explicitly trained on REFUGE images. This suggests the model has genuinely learned the visual signatures of glaucomatous optic neuropathy during its retinal pre-training.

Analysis

  • FLAIR's dominance on MESSIDOR and REFUGE reflects the benefit of domain-specific contrastive pre-training on retinal images with clinical text. The retina-tuned vision-language alignment transfers directly to DR grading and glaucoma discrimination.
  • BiomedCLIP's ODIR advantage stems from its broader biomedical training corpus (PMC-15M), which provides better coverage of the heterogeneous "Other" pathology category that includes conditions outside FLAIR's pre-training distribution.
  • OpenCLIP's low prompt variance (±0.001 on MESSIDOR) suggests its web-scale representations are relatively invariant to medical prompt formulation, while its absolute performance lags due to domain gap.

Pairwise differences between FLAIR and BiomedCLIP on MESSIDOR and REFUGE are statistically significant (paired image-bootstrap, α = 0.05, Holm-adjusted).

Few-Shot Results

Now we give each model a small number of labeled examples (5%, 10%, or 20% of the training data) and add a simple classifier on top. The model's core knowledge stays frozen. This is like giving each doctor a quick tutorial before the test.

Linear probing with frozen encoders. All four models participate (including RETFound). Reported as mean [95% BCa bootstrap CI] across 5-fold stratified CV.

MESSIDOR (Diabetic Retinopathy)

Model5% Labels10% Labels20% Labels
FLAIR 0.647 0.675 0.700
OpenCLIP 0.611 0.624 0.648
BiomedCLIP 0.580 0.596 0.613
RETFound 0.543 0.579 0.619

Metric: Macro-F1. Orange = best in column.

REFUGE (Glaucoma)

Model5% Labels10% Labels20% Labels
FLAIR 0.718 0.843 0.870
OpenCLIP 0.807 0.853 0.891
BiomedCLIP 0.756 0.808 0.874
RETFound 0.664 0.811 0.836

Metric: AUROC. Orange = best in column.

ODIR-200×3 (Multi-Disease)

Model5% Labels10% Labels20% Labels
FLAIR 0.843 0.873 0.900
OpenCLIP 0.793 0.871 0.878
BiomedCLIP 0.857 0.892 0.870
RETFound 0.650 0.744 0.820

Metric: Macro-F1. Orange = best in column.

What does this tell us?

  • A little data goes a long way. With just 5% of labels, most models already perform well. Adding more labels helps, but the biggest jump is from zero-shot to 5%.
  • FLAIR stays king on DR (MESSIDOR), winning at every label budget.
  • Plot twist on glaucoma: OpenCLIP (the generalist) beats FLAIR (the eye specialist) when given even a few labeled examples. Its massive ViT-H/14 encoder extracts better features for a simple classifier to use.
  • BiomedCLIP shines on multi-disease at lower budgets (5%, 10%), but FLAIR catches up and overtakes at 20%.
  • RETFound underperforms despite being trained on retinal images. Its self-supervised training (without text) seems less useful than the vision-language approach.

Analysis

  • FLAIR dominates MESSIDOR across all budgets, consistent with its retinal-specific contrastive training providing superior DR-discriminative features.
  • OpenCLIP's REFUGE advantage is noteworthy: the ViT-H/14's higher-dimensional feature space (1280-d vs FLAIR's 2048-d from ResNet-50) apparently captures structural features relevant to optic disc/cup ratio analysis that benefit linear probing.
  • BiomedCLIP's ODIR efficiency at low budgets reflects its PMC-15M training capturing diverse pathology patterns, giving the linear head more discriminative features for the heterogeneous "Other" class.
  • RETFound's lagging performance despite 1.6M retinal images suggests that MAE pre-training objectives optimize for reconstruction, not discrimination. The learned features, while rich in spatial detail, may lack the semantic structure that contrastive training provides for classification tasks.

FLAIR achieves the highest overall average rank across all nine dataset-budget cells, followed by OpenCLIP, then BiomedCLIP, with RETFound trailing.

Clinical Operating Point

For a real glaucoma screening program, just having a high AUROC isn't enough. Clinicians need to set a specific threshold: how many healthy people are we willing to incorrectly flag as potential glaucoma cases?

I fixed the specificity at 90% (meaning only 10% of healthy people would be incorrectly flagged) and measured how many actual glaucoma cases each model catches at that threshold:

AUROC is threshold-independent. For deployment, we analyze Sensitivity @ 90% Specificity on REFUGE, a clinically meaningful operating point for population-level screening where false positive burden must be controlled:

ModelSens@Spec=90% (5%)Sens@Spec=90% (20%)
OpenCLIP 0.436 0.611
FLAIR 0.289 0.526
BiomedCLIP 0.231 0.471
RETFound 0.200 0.414

OpenCLIP catches 61.1% of glaucoma cases with 20% labels at this strict threshold. While not yet sufficient for standalone screening, it's the best option among the models tested and demonstrates that even with minimal training data, meaningful clinical detection is possible.

Clinical Interpretation
At the 90% specificity operating point, OpenCLIP achieves 61.1% sensitivity with 20% labels. While below the WHO-recommended sensitivity threshold for population screening (≥80%), this represents a meaningful triage capability that could reduce specialist workload by pre-filtering the most likely negative cases. Temperature scaling improves calibration (reduces ECE/AUCE) without affecting AUROC, enabling more reliable probability outputs for clinical decision support.

Key Findings

01

No Single Model Wins Everything

The best model depends on the disease and how much labeled data you have. FLAIR for DR, OpenCLIP for glaucoma (with probe), BiomedCLIP for multi-disease at low budgets.

02

Domain-Specific Training Matters Most for Zero-Shot

FLAIR's retinal pre-training gives it a massive zero-shot advantage on DR and glaucoma. But when you add a simple classifier (linear probe), the playing field levels out.

03

Bigger Isn't Always Better

RETFound (ViT-Large, 1.6M retinal images) underperforms despite having a much larger architecture and domain-specific training. The pre-training objective matters more than scale.

04

5% Labels Already Help Enormously

The jump from zero-shot to 5% labeled data is dramatic for most tasks. Diminishing returns set in after 10%, suggesting that even tiny annotation budgets are worthwhile.

05

Calibration Can Be Fixed Separately

Temperature scaling improves model confidence calibration without changing the ranking or discrimination ability. This means calibrated probability outputs are achievable post-hoc.

06

Runs on a Laptop

All experiments ran on an Apple M-series laptop with MPS acceleration. No GPU cluster needed. Reproducible, accessible research that any researcher can replicate.

Methodology

To make sure the results are trustworthy and not just lucky, I used rigorous statistical methods:

  • 5-fold cross-validation: Split the data into 5 parts, train on 4, test on 1, rotate 5 times. Every image gets tested exactly once.
  • Stratified splits: Each fold maintains the same disease-to-healthy ratio as the full dataset, preventing "easy" or "hard" folds.
  • Bootstrap confidence intervals: Resample the test predictions 2,000 times to estimate how much the results might vary. Reports 95% confidence intervals.
  • Statistical tests: Used the Wilcoxon signed-rank test (a non-parametric test that doesn't assume normal distributions) to confirm that differences between models are real and not due to chance.
  • Calibration analysis: Measured how well the model's confidence scores match actual probabilities using Expected Calibration Error (ECE) and Area Under Calibration Error (AUCE).

Experimental Design

  • Evaluation: Stratified 5-fold CV with identical fold assignments across all models and budgets.
  • Few-shot protocol: Within each training fold, stratified subsampling at 5%, 10%, 20% label budgets. Linear head (logistic regression) trained on frozen encoder features.
  • Statistical testing: Paired Wilcoxon signed-rank tests across folds for model comparisons. BCa bootstrap CIs (B=2000) for point estimate uncertainty.
  • Calibration: Temperature scaling optimized on validation split (NLL objective). Reported metrics: ECE (15-bin), AUCE (area under calibration error curve). AUROC invariant to monotone transforms.
  • Metrics: Macro-F1 for multi-class tasks (MESSIDOR 4-class, ODIR 3-class); AUROC for binary task (REFUGE). Sensitivity @ 90% Specificity for clinical operating point.

Implementation Stack

ComponentTechnology
LanguagePython 3.9.22
FrameworkPyTorch 2.3.1 (MPS acceleration)
ModelsHugging Face Transformers, open_clip, RETFound weights
Statisticsscipy.stats (Wilcoxon), scikit-learn (bootstrap, calibration)
HardwareApple M-series laptop (no external GPU)
ReproducibilityFixed random seeds, deterministic data splits, public benchmark datasets
Ethical Considerations
All datasets are publicly available de-identified benchmarks. No patient-identifiable data was used. The research was approved under Monash University ethics guidelines. Results are not intended for direct clinical use without further validation on local populations.

Practical Guidance

If you're a researcher or engineer looking to build a retinal disease screening tool, here's what this research suggests:

Decision Framework

For Diabetic Retinopathy Screening

Use FLAIR. It wins in both zero-shot and few-shot settings. If you have no labeled data at all, FLAIR's zero-shot performance (0.735 Macro-F1) is already useful for triage.

For Glaucoma Screening

Use OpenCLIP + linear probe. Despite being a generalist model, it produces the best glaucoma features when paired with a simple classifier. Even 5% labeled data (about 60 images) gives meaningful results.

For Multi-Disease Screening

Start with BiomedCLIP if you have very few labels (<10%). Switch to FLAIR once you have 20% or more labeled data.

General Recommendations

  • Always prefer few-shot probing over zero-shot if any labeled data is available
  • Apply temperature scaling for calibrated probability outputs
  • Do not assume RETFound will outperform just because it was trained on retinal images
  • Validate on your local population before deployment

Conclusion

This research shows that modern AI foundation models can detect retinal diseases with surprisingly high accuracy, even with little or no disease-specific training data. The key takeaway is that there is no universal best model. The right choice depends on what disease you're screening for and how much labeled data you can afford to collect.

FLAIR is the clear winner for diabetic retinopathy. OpenCLIP (with a simple classifier) is best for glaucoma. And the amount of data you need is less than you might think: even 5% of labeled examples makes a dramatic difference.

All of this runs on a laptop. No expensive GPU cluster, no massive compute budget. Just careful experimental design, rigorous statistics, and the right combination of model and task.

This thesis provides the first controlled head-to-head evaluation of four foundation model paradigms for retinal fundus classification under zero-shot and label-scarce conditions. The results demonstrate that:

  1. Pre-training domain alignment is the primary determinant of zero-shot transfer quality, with FLAIR's retinal-specific training providing decisive advantages on DR and glaucoma.
  2. Linear probing reshuffles the ranking, with OpenCLIP's high-dimensional ViT-H features proving most effective for glaucoma discrimination when a task-specific head is added.
  3. Self-supervised MAE pre-training (RETFound) underperforms contrastive approaches for classification despite large-scale domain-specific training, suggesting that reconstruction-optimized features are suboptimal for discriminative transfer.
  4. Calibration and clinical operating point analysis reveal practical deployment considerations beyond aggregate metrics, with temperature scaling providing a viable path to calibrated probability outputs.

Future work should explore adapter-based fine-tuning (LoRA), ensemble strategies combining complementary model strengths, and validation on prospective clinical cohorts from diverse populations.

About the Author
This research was conducted by Pugalenthi Magendran as part of the Master of Artificial Intelligence program at Monash University, supervised by Dr Yasmeen George. The full thesis is available upon request.