Accelerating innovation

When Diagnostic Categories Dissolve

The political fights over diagnostic criteria are artifacts of a system that could not see biology clearly. What happens when longitudinal data clusters patients by signal rather than checklist.

Cancer classification used to be organized by organ. Lung cancer. Breast cancer. Colon cancer. The organ was the category. Treatment followed the category: lung cancer drugs, breast cancer surgery, colon cancer chemotherapy.

Molecular oncology dissolved those categories. A breast tumor with a HER2 amplification and a gastric tumor with the same amplification respond to the same drug (trastuzumab), despite originating in different organs. A colon tumor that is microsatellite-unstable responds to immunotherapy regardless of its location. The organ is where the cancer lives. The molecular classification is what the cancer is.

The same dissolution is coming for rare disease. The diagnostic categories that currently define clinical practice, research funding, advocacy organizations, and insurance coverage are artifacts of a system that lacked the data to see biology clearly. As genomic, proteomic, and longitudinal phenotypic data accumulate, the categories themselves become less important than the biological signals they were built to approximate.

The Bucket Problem

The 2017 diagnostic criteria for hypermobile Ehlers-Danlos syndrome created two buckets: hEDS and hypermobility spectrum disorder (HSD). A person who meets the Beighton score threshold, has enough systemic features, and excludes other connective tissue disorders goes in the hEDS bucket. A person who is hypermobile, has chronic pain, has POTS, has mast cell activation, has gastrointestinal dysfunction, and whose life is functionally identical to the person in the hEDS bucket but who scores one point too low on the Beighton scale goes in the HSD bucket.

The clinical distinction between these two buckets is, for many affected individuals, indistinguishable. The symptom trajectories overlap. The comorbidity patterns overlap. The treatment approaches are the same. The difference is a score on a physical exam developed in 1973 from a study of a South African population, applied to a person whose hypermobility may have decreased with age, whose racial background may produce different baseline flexibility, and whose most clinically significant symptoms have nothing to do with how far she can bend her thumb.

The 2017 criteria were designed to narrow the diagnostic boundary and create a more homogeneous research population. The intent was reasonable: genetic studies need clean phenotypic groups to detect gene-disease associations. The consequence was that many people who had previously been diagnosed with EDS no longer met criteria. The HSD label, intended as a clinical classification, was experienced as a demotion. Advocacy organizations, insurance coverage decisions, and disability evaluations treat hEDS and HSD differently, despite the absence of evidence that they represent different biological entities.

What Data Dissolves

If 10,000 people across the hEDS/HSD/fibromyalgia/chronic fatigue/POTS/MCAS spectrum contribute structured longitudinal data, including symptom trajectories, comorbidity patterns, treatment responses, biomarker measurements, and wearable device data, to a shared data infrastructure, the diagnostic labels become secondary to the biological clustering.

Computational clustering of that data will not produce two clean buckets (hEDS vs. HSD) or five clean buckets (hEDS, HSD, fibromyalgia, chronic fatigue syndrome, POTS). It will produce a cloud with density regions. Some density regions will correspond to recognized diagnoses. Others will not.

The density regions that do not correspond to any recognized diagnosis are the undiscovered conditions. They are subgroups of people who share specific symptom trajectories, specific comorbidity patterns, specific biomarker signatures, and eventually specific genetic variants, but who are currently distributed across multiple diagnostic categories because the categories were drawn before the data existed to draw them accurately.

The historical parallel in oncology took decades and billions of dollars in genomic research to accomplish. The rare disease version can happen faster because the analytical tools (machine learning, unsupervised clustering algorithms, large-scale phenotype-genotype association methods) already exist. What is missing is the data.

The hEDS Gene Problem

Hypermobile EDS is the only EDS subtype without an identified causative gene. Every other type, classical, vascular, kyphoscoliotic, arthrochalasia, and the rest, has a known genetic cause, typically in a gene encoding collagen or a collagen-processing enzyme. The genetic test confirms the diagnosis. For hEDS, no test exists. The diagnosis is entirely clinical.

The gene-hunting approach for hEDS has been straightforward in theory: take clinically diagnosed individuals, sequence their genomes, and look for shared variants. The problem is that if hEDS is genetically heterogeneous, if multiple different genes produce similar clinical presentations, the signal from any single gene is diluted across a phenotypically defined population that is actually composed of multiple genetically distinct subgroups. The genome-wide association studies find nothing because they are looking for one signal in a population that contains several.

The data-driven approach inverts the method. Instead of starting with a clinical diagnosis and searching for a gene, start with structured phenotypic data and search for subgroups. Cluster thousands of people by their actual symptom trajectories over time: which joints are most affected, what age the symptoms appeared, whether POTS is present, whether MCAS is present, what the cardiac autonomic profile looks like, what the gastrointestinal pattern is, how the symptoms responded to specific treatments. The clustering produces phenotypically homogeneous subgroups. Then sequence each subgroup separately.

The statistical power for gene discovery increases dramatically when the population is phenotypically clean. A genome-wide association study in 200 people who all share a specific symptom trajectory, a specific comorbidity pattern, and a specific treatment response profile has far more power to detect a shared genetic variant than the same study in 2,000 people whose only shared feature is a clinical label that may encompass five distinct conditions.

The Prediction

Within a decade of sufficient data collection, at least three currently unrecognized conditions will be computationally identified within the population currently labeled hEDS or HSD. They will have distinct symptom trajectories, distinct comorbidity patterns, and eventually distinct genetic signatures. The "hEDS is probably multiple diseases" hypothesis, which clinicians and researchers have discussed informally for years, will be confirmed by data.

When that happens, the clinical management of each newly identified subgroup will improve because the population is now homogeneous enough to study. A clinical trial that enrolls all of "hEDS" is testing a treatment on a mixed population; the signal is diluted by the diversity. A clinical trial that enrolls Subgroup A (characterized by a specific genetic variant, a specific symptom trajectory, and a specific comorbidity pattern) is testing a treatment on a biologically coherent group. The results are interpretable. The treatment can be optimized.

The criteria wars, whether a person is "really" hEDS or "just" HSD, become irrelevant. The question changes from "which bucket do you fit in?" to "what is your biological signal, and what does the data say works for people with your signal?"

Beyond EDS

The same logic applies across rare disease.

The adults with PKU who are experiencing early cognitive decline despite adequate blood phenylalanine control may share a biological subgroup with adults who have galactosemia and similar cognitive trajectories despite different dietary treatments. The underlying mechanism may involve a shared metabolic pathway, a shared vulnerability of specific brain regions to metabolic stress, or a shared genetic modifier that neither community would detect in a single-disease dataset.

The children with different organic acidemias who all respond unexpectedly to the same off-label medication may share a metabolic pathway feature that is invisible when each organic acidemia is studied in isolation. The signal appears only when data from multiple conditions occupies the same analytical space.

The Nurses' Health Study generated findings that no single-disease study would have produced. The association between hormone replacement therapy and cardiovascular risk emerged because the cohort data supported analyses across conditions. The rare disease equivalent requires the same multi-condition architecture: structured longitudinal data from people with different diagnoses, in the same infrastructure, using compatible formats, over years.

The End of the Odyssey

The diagnostic odyssey, the 10 to 22 years that the average person with hEDS spends seeking a diagnosis, is a product of the current categorical system. Each specialist looks for conditions within their category. Rheumatology looks for inflammatory joint disease. Gastroenterology looks for IBS. Cardiology looks for cardiac arrhythmias. Psychiatry looks for anxiety. The multi-system pattern is invisible because no category contains it.

Genomic newborn screening, if it becomes standard, will eliminate the diagnostic odyssey for conditions with identified genes. A child born in 2035 whose genome is sequenced at birth will receive diagnoses for vascular EDS, classical EDS, and every other genetically defined connective tissue disorder within weeks, not decades.

But hEDS, the most common form, has no identified gene. Genomic screening cannot find what genomics has not yet identified. The diagnostic odyssey for hEDS ends only when the gene or genes are found. The genes are found only when the phenotypic data is clean enough to enable their detection. The phenotypic data gets clean only when enough people contribute structured longitudinal observations to a shared infrastructure.

The diagnostic categories dissolve from both ends simultaneously. From the genomic end, genetic sequencing identifies conditions before symptoms appear. From the phenotypic end, structured data clustering identifies subgroups that the current categories obscure. The categories that remain are the ones that biology confirms: coherent groups of people who share a genetic cause, a disease mechanism, and a response to treatment. Everything else is a historical artifact of a system that did not have the data to see clearly.