Small Datasets Are Not Small When They Are Dense
A single patient logging daily produces more data than most clinical trials. The problem is not rarity; it is instrumentation.
Traditional biostatistics needs large sample sizes because the methods are blunt. A Phase 3 trial collects a few dozen data points per patient over months. You need thousands of patients to compensate for sparse data with volume.
Flip that model. A single patient generating daily structured data (biomarkers, dietary logs, symptom scores, quality-of-life instruments, medication timing) produces more data points in a year than most clinical trials collect from their entire cohort. Now do that across 50 patients with the same disorder, using the same schema.
- Traditional Phase 3 trial: ~50 data points / patient / year
- Patient daily logging: ~2,000+ data points / patient / year
Modern causal inference, foundation models trained on biomedical literature, and longitudinal ML methods can extract signal from these datasets that traditional statistics never could. A disease with 200 known patients worldwide is not hopeless. It is under-instrumented.
When every patient's daily data uses the same instruments and schema, each individual dataset becomes a node in a network. Patterns that no single clinician could see across geographies, genotypes, and years emerge from the aggregate.