From Prompts to Probes
Zero-shot and few-shot transfer of foundation models for retinal fundus disease classification.
What Is This?
Imagine you go to the eye doctor. They take a photo of the back of your eye (called a fundus image) and look for signs of diseases like diabetes-related eye damage or glaucoma. This is how millions of people around the world get screened for vision-threatening conditions.
The problem? There aren't enough trained eye doctors, especially in developing countries. Over 43 million people worldwide are blind, and the majority live in regions with limited access to specialists.
What if an AI could look at that same eye photo and flag potential diseases, without ever being explicitly trained on eye images?
That's exactly what this research explores. I took four powerful AI models that were trained on different combinations of medical images and text, and tested how well they can detect retinal diseases in two ways:
- Zero-shot: The AI has never seen these specific eye images before. I just describe the disease in words, and the AI decides if the image matches the description.
- Few-shot: I give the AI a tiny fraction of labeled examples (as few as 5% of the dataset) and let it learn a simple classifier on top of its existing knowledge.
This thesis presents a systematic benchmark of four vision-language and vision-only foundation models for retinal fundus image classification under zero-shot and few-shot linear probing protocols. The work addresses a critical gap: while foundation models show impressive generalization in natural image domains, their transfer characteristics to ophthalmic imaging remain under-explored, particularly under label-scarce conditions.
- First head-to-head comparison of FLAIR, OpenCLIP, BiomedCLIP, and RETFound on three retinal benchmarks under identical evaluation protocols
- Clinician-validated prompt engineering with three templates of increasing clinical specificity (T1-T3)
- Label-efficiency analysis at 5%, 10%, and 20% annotation budgets with frozen encoders
- Calibration assessment via temperature scaling with ECE and AUCE metrics
- Clinical operating point analysis at Sensitivity @ 90% Specificity for glaucoma screening
Why It Matters
Three eye diseases account for the vast majority of preventable blindness worldwide:
Diabetic Retinopathy
High blood sugar damages the tiny blood vessels in the retina. If caught early, laser treatment can prevent vision loss. If missed, it leads to irreversible blindness.
Glaucoma
Damages the optic nerve gradually with no symptoms until significant vision is already lost. Early detection through screening is the only defense.
Multiple Conditions
Real-world patients often present with more than one condition. A practical screening tool needs to handle multiple diseases from a single image.
The bottleneck is human experts. Training an ophthalmologist takes 12+ years. AI models that can screen eye images with minimal training data could bridge this gap, especially in underserved communities.
Label scarcity is the primary bottleneck in medical image classification. Annotation requires board-certified ophthalmologists, and inter-grader agreement is often moderate (Cohen's kappa 0.5-0.7 for DR grading). Foundation models pre-trained on large-scale image-text pairs offer a potential bypass: leveraging rich visual representations learned from diverse medical data to perform downstream tasks with minimal or zero task-specific labels.
This is particularly critical for:
- Low-resource clinical settings where expert annotations are prohibitively expensive
- Rare disease subtypes where even large hospitals have few confirmed cases
- Rapid deployment scenarios where a model must generalize without fine-tuning
The Approach
I tested two fundamentally different ways of using these AI models:
Zero-Shot
The AI has never seen these eye images during training. I describe each disease using carefully crafted text prompts (validated by a clinician), and the model compares each image to these descriptions to make a diagnosis.
Analogy: Like showing a radiologist from another country a type of scan they've never seen, but giving them a detailed written description of what to look for.
Few-Shot (Linear Probe)
The AI's core knowledge is frozen (locked in place). I add a simple classifier on top and train it with just 5%, 10%, or 20% of the labeled data.
Analogy: Like giving that same radiologist a quick 30-minute tutorial with a handful of example cases, then testing them on new ones.
The key insight is that these models already understand a lot about medical images from their pre-training. We're testing how far that existing knowledge can stretch when we give it very little (or zero) task-specific guidance.
Two Evaluation Protocols
For zero-shot, we compute cosine similarity between image embeddings and text embeddings for each class label. For few-shot, we freeze the vision encoder and train a single linear layer (logistic regression) on the extracted features using limited labeled data. Both protocols use stratified 5-fold cross-validation.
The Four Models
Think of each model as having a different "education background":
Four foundation models spanning three pre-training paradigms: retinal-specific contrastive (FLAIR), general biomedical contrastive (BiomedCLIP), web-scale contrastive (OpenCLIP), and self-supervised masked autoencoding (RETFound).
FLAIR
The eye specialist. Trained specifically on 288,000 retinal images paired with clinical descriptions from 38 different datasets. Knows eye diseases inside and out.
ResNet-50 vision encoder + BioClinicalBERT text encoder. Contrastive pre-training on 288K fundus images from 38 curated ophthalmic datasets with clinical text pairs.
OpenCLIP
The generalist who's seen everything. Trained on 2 billion images from the entire internet. Knows a little about a lot, including some medical images.
ViT-H/14 vision encoder. CLIP-style contrastive pre-training on LAION-2B (2 billion image-text pairs from the open web). Largest encoder architecture in the comparison.
BiomedCLIP
The medical school graduate. Trained on 15 million medical image-text pairs from research papers. Knows medicine broadly, but isn't an eye specialist.
ViT-Base/16 vision encoder + PubMedBERT text encoder. Contrastive pre-training on PMC-15M (15M figure-caption pairs from PubMed Central biomedical literature).
RETFound
The self-taught expert. Learned from 1.6 million retinal images by predicting masked parts of images (no text descriptions needed). Deep visual understanding, but can't match images to text.
ViT-Large encoder. Self-supervised pre-training via masked autoencoding (MAE) on 1.6M retinal images. No text encoder, so excluded from zero-shot; participates in linear probing only.
The Datasets
I tested the models on three established medical benchmarks, each targeting a different eye disease:
Three public ophthalmic benchmarks selected to span binary, multi-class, and multi-disease classification with varying class distributions and imaging conditions:
MESSIDOR
1,200 eye images graded by severity of diabetic retinopathy (grades 0 to 3). The challenge: distinguishing between "no disease" and early, moderate, or severe stages.
1,200 fundus images with 4-class DR severity grading (R0-R3). Evaluated with Macro-F1 to account for class imbalance. Moderate inter-grader agreement makes this a challenging benchmark.
REFUGE
1,200 eye images classified as glaucoma or healthy. The challenge: glaucoma cases are rare (only ~10%), so the model must be very precise to avoid overwhelming clinics with false alarms.
1,200 fundus images for binary glaucoma classification (~10% positive rate). Evaluated with AUROC for threshold-independent discrimination, plus Sensitivity @ 90% Specificity for clinical operating point analysis.
ODIR-200×3
A balanced subset of 600 images covering three categories: normal, diabetic retinopathy, and other diseases. Tests whether a model can handle multiple conditions at once.
Balanced 200-per-class subset from ODIR: Normal, DR, and Other pathologies (600 images total). Evaluated with Macro-F1. The heterogeneous "Other" category tests generalization beyond specific disease patterns.
Prompt Design
For zero-shot classification, the AI needs a text description of each disease to compare against the eye image. I designed three levels of prompts, each more detailed than the last, and had them validated by a clinical supervisor:
Just the disease name. Example: "A fundus photograph of diabetic retinopathy."
Adds specific visual signs. Example: "A fundus photograph showing microaneurysms, hemorrhages, and hard exudates consistent with diabetic retinopathy."
A full structured description following clinical reporting standards, including lesion types, locations, and severity indicators.
Results are averaged across all three prompt styles to give a fair assessment of each model's robustness to different prompt formulations.
Three prompt templates (T1-T3) of increasing clinical specificity, validated by an ophthalmic clinician (thesis supervisor Dr Yasmeen George). This hierarchy tests sensitivity to prompt engineering depth:
- T1 (Simple): Minimal disease-name template. Baseline for prompt complexity.
- T2 (Descriptive): Incorporates pathological signs and clinical features visible in fundus photography.
- T3 (Structured): Full clinical description template following ophthalmic reporting conventions.
Zero-shot results are reported as mean ± SD across T1-T3, capturing both average performance and prompt sensitivity. This averaging strategy follows the approach established in FLAIR's original evaluation.
Zero-Shot Results
Remember, zero-shot means the models have never been trained on these specific eye images. They're diagnosing purely from text descriptions. Here's how they performed:
Performance averaged across T1-T3 prompts. RETFound excluded (no text encoder). Statistical significance assessed via paired image-bootstrap (α = 0.05, Holm adjustment).
| Model | MESSIDOR (Macro-F1) | REFUGE (AUROC) | ODIR-200×3 (Macro-F1) |
|---|---|---|---|
| FLAIR | 0.735 ± 0.049 | 0.921 ± 0.049 | 0.366 ± 0.016 |
| BiomedCLIP | 0.471 ± 0.042 | 0.649 ± 0.037 | 0.709 ± 0.032 |
| OpenCLIP | 0.353 ± 0.001 | 0.530 ± 0.046 | 0.399 ± 0.131 |
Orange = best in column. Higher is better.
What does this tell us?
- FLAIR dominates on diabetic retinopathy (MESSIDOR) and glaucoma (REFUGE). Its specialized retinal training pays off massively: 0.921 AUROC on glaucoma means it correctly distinguishes healthy from glaucoma eyes 92.1% of the time, with zero training on this dataset.
- BiomedCLIP wins on multi-disease classification (ODIR-200×3), scoring 0.709. Its broad medical knowledge handles the diverse "Other" disease category better than the eye specialist.
- No single model wins everything. The best model depends on the specific disease and task.
Analysis
- FLAIR's dominance on MESSIDOR and REFUGE reflects the benefit of domain-specific contrastive pre-training on retinal images with clinical text. The retina-tuned vision-language alignment transfers directly to DR grading and glaucoma discrimination.
- BiomedCLIP's ODIR advantage stems from its broader biomedical training corpus (PMC-15M), which provides better coverage of the heterogeneous "Other" pathology category that includes conditions outside FLAIR's pre-training distribution.
- OpenCLIP's low prompt variance (±0.001 on MESSIDOR) suggests its web-scale representations are relatively invariant to medical prompt formulation, while its absolute performance lags due to domain gap.
Pairwise differences between FLAIR and BiomedCLIP on MESSIDOR and REFUGE are statistically significant (paired image-bootstrap, α = 0.05, Holm-adjusted).
Few-Shot Results
Now we give each model a small number of labeled examples (5%, 10%, or 20% of the training data) and add a simple classifier on top. The model's core knowledge stays frozen. This is like giving each doctor a quick tutorial before the test.
Linear probing with frozen encoders. All four models participate (including RETFound). Reported as mean [95% BCa bootstrap CI] across 5-fold stratified CV.
MESSIDOR (Diabetic Retinopathy)
| Model | 5% Labels | 10% Labels | 20% Labels |
|---|---|---|---|
| FLAIR | 0.647 | 0.675 | 0.700 |
| OpenCLIP | 0.611 | 0.624 | 0.648 |
| BiomedCLIP | 0.580 | 0.596 | 0.613 |
| RETFound | 0.543 | 0.579 | 0.619 |
Metric: Macro-F1. Orange = best in column.
REFUGE (Glaucoma)
| Model | 5% Labels | 10% Labels | 20% Labels |
|---|---|---|---|
| FLAIR | 0.718 | 0.843 | 0.870 |
| OpenCLIP | 0.807 | 0.853 | 0.891 |
| BiomedCLIP | 0.756 | 0.808 | 0.874 |
| RETFound | 0.664 | 0.811 | 0.836 |
Metric: AUROC. Orange = best in column.
ODIR-200×3 (Multi-Disease)
| Model | 5% Labels | 10% Labels | 20% Labels |
|---|---|---|---|
| FLAIR | 0.843 | 0.873 | 0.900 |
| OpenCLIP | 0.793 | 0.871 | 0.878 |
| BiomedCLIP | 0.857 | 0.892 | 0.870 |
| RETFound | 0.650 | 0.744 | 0.820 |
Metric: Macro-F1. Orange = best in column.
What does this tell us?
- A little data goes a long way. With just 5% of labels, most models already perform well. Adding more labels helps, but the biggest jump is from zero-shot to 5%.
- FLAIR stays king on DR (MESSIDOR), winning at every label budget.
- Plot twist on glaucoma: OpenCLIP (the generalist) beats FLAIR (the eye specialist) when given even a few labeled examples. Its massive ViT-H/14 encoder extracts better features for a simple classifier to use.
- BiomedCLIP shines on multi-disease at lower budgets (5%, 10%), but FLAIR catches up and overtakes at 20%.
- RETFound underperforms despite being trained on retinal images. Its self-supervised training (without text) seems less useful than the vision-language approach.
Analysis
- FLAIR dominates MESSIDOR across all budgets, consistent with its retinal-specific contrastive training providing superior DR-discriminative features.
- OpenCLIP's REFUGE advantage is noteworthy: the ViT-H/14's higher-dimensional feature space (1280-d vs FLAIR's 2048-d from ResNet-50) apparently captures structural features relevant to optic disc/cup ratio analysis that benefit linear probing.
- BiomedCLIP's ODIR efficiency at low budgets reflects its PMC-15M training capturing diverse pathology patterns, giving the linear head more discriminative features for the heterogeneous "Other" class.
- RETFound's lagging performance despite 1.6M retinal images suggests that MAE pre-training objectives optimize for reconstruction, not discrimination. The learned features, while rich in spatial detail, may lack the semantic structure that contrastive training provides for classification tasks.
FLAIR achieves the highest overall average rank across all nine dataset-budget cells, followed by OpenCLIP, then BiomedCLIP, with RETFound trailing.
Clinical Operating Point
For a real glaucoma screening program, just having a high AUROC isn't enough. Clinicians need to set a specific threshold: how many healthy people are we willing to incorrectly flag as potential glaucoma cases?
I fixed the specificity at 90% (meaning only 10% of healthy people would be incorrectly flagged) and measured how many actual glaucoma cases each model catches at that threshold:
AUROC is threshold-independent. For deployment, we analyze Sensitivity @ 90% Specificity on REFUGE, a clinically meaningful operating point for population-level screening where false positive burden must be controlled:
| Model | Sens@Spec=90% (5%) | Sens@Spec=90% (20%) |
|---|---|---|
| OpenCLIP | 0.436 | 0.611 |
| FLAIR | 0.289 | 0.526 |
| BiomedCLIP | 0.231 | 0.471 |
| RETFound | 0.200 | 0.414 |
OpenCLIP catches 61.1% of glaucoma cases with 20% labels at this strict threshold. While not yet sufficient for standalone screening, it's the best option among the models tested and demonstrates that even with minimal training data, meaningful clinical detection is possible.
Key Findings
No Single Model Wins Everything
The best model depends on the disease and how much labeled data you have. FLAIR for DR, OpenCLIP for glaucoma (with probe), BiomedCLIP for multi-disease at low budgets.
Domain-Specific Training Matters Most for Zero-Shot
FLAIR's retinal pre-training gives it a massive zero-shot advantage on DR and glaucoma. But when you add a simple classifier (linear probe), the playing field levels out.
Bigger Isn't Always Better
RETFound (ViT-Large, 1.6M retinal images) underperforms despite having a much larger architecture and domain-specific training. The pre-training objective matters more than scale.
5% Labels Already Help Enormously
The jump from zero-shot to 5% labeled data is dramatic for most tasks. Diminishing returns set in after 10%, suggesting that even tiny annotation budgets are worthwhile.
Calibration Can Be Fixed Separately
Temperature scaling improves model confidence calibration without changing the ranking or discrimination ability. This means calibrated probability outputs are achievable post-hoc.
Runs on a Laptop
All experiments ran on an Apple M-series laptop with MPS acceleration. No GPU cluster needed. Reproducible, accessible research that any researcher can replicate.
Methodology
To make sure the results are trustworthy and not just lucky, I used rigorous statistical methods:
- 5-fold cross-validation: Split the data into 5 parts, train on 4, test on 1, rotate 5 times. Every image gets tested exactly once.
- Stratified splits: Each fold maintains the same disease-to-healthy ratio as the full dataset, preventing "easy" or "hard" folds.
- Bootstrap confidence intervals: Resample the test predictions 2,000 times to estimate how much the results might vary. Reports 95% confidence intervals.
- Statistical tests: Used the Wilcoxon signed-rank test (a non-parametric test that doesn't assume normal distributions) to confirm that differences between models are real and not due to chance.
- Calibration analysis: Measured how well the model's confidence scores match actual probabilities using Expected Calibration Error (ECE) and Area Under Calibration Error (AUCE).
Experimental Design
- Evaluation: Stratified 5-fold CV with identical fold assignments across all models and budgets.
- Few-shot protocol: Within each training fold, stratified subsampling at 5%, 10%, 20% label budgets. Linear head (logistic regression) trained on frozen encoder features.
- Statistical testing: Paired Wilcoxon signed-rank tests across folds for model comparisons. BCa bootstrap CIs (B=2000) for point estimate uncertainty.
- Calibration: Temperature scaling optimized on validation split (NLL objective). Reported metrics: ECE (15-bin), AUCE (area under calibration error curve). AUROC invariant to monotone transforms.
- Metrics: Macro-F1 for multi-class tasks (MESSIDOR 4-class, ODIR 3-class); AUROC for binary task (REFUGE). Sensitivity @ 90% Specificity for clinical operating point.
Implementation Stack
| Component | Technology |
|---|---|
| Language | Python 3.9.22 |
| Framework | PyTorch 2.3.1 (MPS acceleration) |
| Models | Hugging Face Transformers, open_clip, RETFound weights |
| Statistics | scipy.stats (Wilcoxon), scikit-learn (bootstrap, calibration) |
| Hardware | Apple M-series laptop (no external GPU) |
| Reproducibility | Fixed random seeds, deterministic data splits, public benchmark datasets |
Practical Guidance
If you're a researcher or engineer looking to build a retinal disease screening tool, here's what this research suggests:
Decision Framework
For Diabetic Retinopathy Screening
Use FLAIR. It wins in both zero-shot and few-shot settings. If you have no labeled data at all, FLAIR's zero-shot performance (0.735 Macro-F1) is already useful for triage.
For Glaucoma Screening
Use OpenCLIP + linear probe. Despite being a generalist model, it produces the best glaucoma features when paired with a simple classifier. Even 5% labeled data (about 60 images) gives meaningful results.
For Multi-Disease Screening
Start with BiomedCLIP if you have very few labels (<10%). Switch to FLAIR once you have 20% or more labeled data.
General Recommendations
- Always prefer few-shot probing over zero-shot if any labeled data is available
- Apply temperature scaling for calibrated probability outputs
- Do not assume RETFound will outperform just because it was trained on retinal images
- Validate on your local population before deployment
Conclusion
This research shows that modern AI foundation models can detect retinal diseases with surprisingly high accuracy, even with little or no disease-specific training data. The key takeaway is that there is no universal best model. The right choice depends on what disease you're screening for and how much labeled data you can afford to collect.
FLAIR is the clear winner for diabetic retinopathy. OpenCLIP (with a simple classifier) is best for glaucoma. And the amount of data you need is less than you might think: even 5% of labeled examples makes a dramatic difference.
All of this runs on a laptop. No expensive GPU cluster, no massive compute budget. Just careful experimental design, rigorous statistics, and the right combination of model and task.
This thesis provides the first controlled head-to-head evaluation of four foundation model paradigms for retinal fundus classification under zero-shot and label-scarce conditions. The results demonstrate that:
- Pre-training domain alignment is the primary determinant of zero-shot transfer quality, with FLAIR's retinal-specific training providing decisive advantages on DR and glaucoma.
- Linear probing reshuffles the ranking, with OpenCLIP's high-dimensional ViT-H features proving most effective for glaucoma discrimination when a task-specific head is added.
- Self-supervised MAE pre-training (RETFound) underperforms contrastive approaches for classification despite large-scale domain-specific training, suggesting that reconstruction-optimized features are suboptimal for discriminative transfer.
- Calibration and clinical operating point analysis reveal practical deployment considerations beyond aggregate metrics, with temperature scaling providing a viable path to calibrated probability outputs.
Future work should explore adapter-based fine-tuning (LoRA), ensemble strategies combining complementary model strengths, and validation on prospective clinical cohorts from diverse populations.