The Time Value of Data
Why starting longitudinal data collection now matters more than starting perfectly. The compounding return on the second decade of follow-up that the first decade makes possible.
A person with classic galactosemia who is 55 years old today was diagnosed through newborn screening in 1971. She has followed the galactose-free diet for 55 years. She has had hundreds of clinic visits, dozens of blood draws, multiple cognitive assessments, fertility consultations, bone density scans, and speech evaluations. Her medical history is one of the longest continuous treatment records for any metabolic condition on the newborn screening panel.
That record does not exist in any structured, centralized, accessible form. It is scattered across the medical records of every provider she has ever seen, in formats ranging from handwritten clinical notes to EHR entries in systems that may no longer be operational. The data was generated. It was never collected.
The cost of that failure compounds every year.
What Longitudinal Data Does
A single blood phenylalanine level from one person with PKU at one moment in time tells you that person's phenylalanine is 6 mg/dL. Ten years of weekly blood phenylalanine levels from the same person tell you the trajectory: how stable the control is, how it responds to illness, how it changes with age, how it correlates with dietary adherence, and whether the trend is improving or deteriorating. The trajectory is what generates the insight. The snapshot is almost useless by comparison.
Multiply that by a thousand people contributing the same structured data over the same decade, and the dataset reveals patterns invisible in any individual record. The 40-year-old adults with PKU whose phenylalanine control was excellent in childhood but deteriorated in their twenties: is that pattern universal or does it cluster with specific genotypes, specific formulas, specific life events? The answer is in the data, but only if the data exists across enough people and enough time.
The Nurses' Health Study demonstrated this principle across 50 years and 121,700 participants. Frank Speizer mailed a questionnaire to registered nurses in 1976. Most of them kept filling it out every two years for decades. The study has generated findings on cardiovascular disease, cancer, diabetes, hormone replacement, diet, physical activity, and dozens of other conditions. Its power comes from three properties: a large cohort, standardized data collection, and decades of continuous follow-up.
The rare disease equivalent of that dataset does not exist for any condition.
The Irreversibility of Lost Time
A person with galactosemia who begins contributing structured data today and continues for 10 years generates a 10-year longitudinal record by 2036. If the data infrastructure is not built until 2029, and collection begins then, the same person generates a 7-year record by 2036. Those three years are gone permanently. The cognitive changes that happened between ages 38 and 41, the bone density measurement at 39, the medication change at 40: all lost.
This arithmetic applies to every rare disease, but it is most urgent for conditions where the affected adults are aging out of the window where their data can be captured. The adults with PKU who were identified through screening in the 1960s are now in their sixties. They are the first generation to have grown up with treatment from birth. Their life trajectories, from childhood compliance through adolescent dietary abandonment through adult return to treatment (or not), constitute the most complete treated natural history dataset for any metabolic condition. That dataset is disappearing into fragmented medical records and clinical notes that no one is systematically collecting.
The adults with galactosemia who grew up with dietary treatment in the 1970s and 1980s are the cohort that could answer the question of why long-term outcomes vary so widely despite similar dietary management. Some have near-normal cognition. Others have significant impairment. Some women retained partial ovarian function. Most did not. The factors that separate these outcomes are discoverable, but only with data that spans the full arc of these adults' lives. That arc is ending, and the data is not being captured.
Starting Imperfect Is Better Than Waiting for Perfect
The Nurses' Health Study launched in 1976 with paper questionnaires mailed through the postal service. The instruments were imperfect. The cohort was not representative of the general population. The early surveys asked simpler questions than the sophisticated instruments developed later. The data collection tools improved over time, but the critical decision was the decision to start.
The data collected with imperfect instruments in 1976 turned out to be more valuable than no data at all, because the longitudinal dimension is what generates the insights, and no one can go back in time to collect a trajectory that was never recorded.
The same logic applies to rare disease data collection in 2026. The data standards are not finalized. FHIR was not designed for rare disease use cases. ICD-10 codes for many rare diseases are nonexistent or absurdly broad. The consent frameworks for persistent, multi-use data are still evolving. The interoperability problems between different data systems are real.
None of that is a reason to wait.
A person with Ehlers-Danlos syndrome who starts recording structured symptom data today, using whatever instrument is available now, generates a longitudinal record that will become more valuable over time as the instruments improve, the standards mature, and the analytical tools evolve. The data that matters most is the data that was collected early, because early data captures the period that later data cannot recover.
The Compounding Return
Longitudinal data from one condition accelerates research in related conditions. The natural history data from people with PKU who are experiencing cognitive changes in middle age may reveal patterns shared by people with galactosemia, maple syrup urine disease, or other amino acid and carbohydrate metabolism disorders. The blood phenylalanine trajectories that predict cognitive decline in PKU may share statistical signatures with galactose-1-phosphate trajectories in galactosemia. These cross-condition signals are discoverable only when the data from multiple conditions occupies the same infrastructure and uses compatible formats.
Each year of delay narrows the window for capturing the cross-condition patterns that would emerge from a multi-disease longitudinal dataset. The adults with treated metabolic disease who are in their fifties and sixties today are aging through the period where their data is most informative for understanding adult outcomes. In 10 years, the cohort will be smaller. The survivors will be older. The treatment landscape will have changed. The opportunity to study the first generation of screened and treated adults across multiple metabolic conditions simultaneously will have passed.
What the Data Is Worth
The value of longitudinal rare disease data is not abstract. It is measured in specific, concrete outcomes.
Natural history data from a cohort of adults with galactosemia could determine whether AT-007, the aldose reductase inhibitor currently in Phase III trials, needs to be started in infancy to prevent ovarian damage or whether it can reverse damage when started in adulthood. That determination requires knowing the trajectory of ovarian function across the lifespan, which requires longitudinal data that does not currently exist in aggregated form.
Natural history data from a cohort of adults with PKU could serve as the external control arm for a gene therapy trial, eliminating the need to randomize people to a placebo group when the existing standard of care is well characterized. That use requires the natural history data to be structured, standardized, and available. Data locked in clinic files cannot serve as a control arm for anything.
Longitudinal symptom data from a cohort of people with hEDS could enable phenotypic clustering that identifies genetically distinct subgroups within the current hEDS population, accelerating the search for the gene or genes responsible for the most common form of Ehlers-Danlos syndrome. That clustering requires structured symptom data collected over years, not a one-time survey.
Each of these applications requires the same thing: data that was collected early enough, structured well enough, and maintained long enough to answer questions that could not be specified at the time collection began. The data infrastructure that makes this possible is not a product or a service. It is a commitment to persistence.
The Human Arithmetic
A child born with a rare metabolic disease in 2026 whose structured data enters a longitudinal dataset at birth and continues for 20 years generates a record that, combined with similar records from hundreds of other children, could reveal treatment modifications that improve outcomes for the generation born in 2046. If data collection does not begin until 2031, the first 20-year datasets are not available until 2051. The children born in 2046 who would have benefited from the earlier data will have grown up without it.
The gap between 2046 and 2051 is five years of children growing up with suboptimal treatment because the data that could have improved it was not collected in time. The cost is measured in developmental milestones, cognitive outcomes, fertility, bone density, and quality of life.
Every year of delay in building longitudinal data infrastructure for rare disease is a year of irreplaceable data lost and a year of preventable harm extended. The instruments will improve. The standards will mature. The analytical tools will become more powerful. The only thing that cannot be recovered is the time.