Accelerating innovation

Cross-condition signal detection

The signals that hide in single-condition silos: shared treatment response, shared biomarker trajectory, shared environmental trigger. What a Nurses' Health Study for rare disease would catch.

The rare disease research infrastructure is organized by condition. The PKU registry studies PKU patients. The propionic acidemia working group studies PA patients. The hEDS researchers study hEDS patients. Each condition-specific dataset answers questions about that condition. The boundaries between datasets are administrative and historical, not biological.

The biology is rarely confined to the boundary. Many rare diseases share metabolic pathways, regulatory networks, and clinical features. The medication that helps one condition often helps another condition that shares a step in the affected pathway. The environmental trigger that causes a metabolic crisis in one disorder often causes the same crisis in adjacent disorders. The biomarker that tracks progression in one disease often tracks progression in related ones. These shared features are invisible to research that operates within single-condition silos.

The signals that hide in single-condition data

Three categories of signal go undetected when data is fragmented by condition.

The first is shared treatment response. If patients with three different organic acidemias all show unexpected improvement on the same off-label medication, the signal is detectable only if the data from all three conditions is queryable together. Each individual condition's dataset would show a small, possibly noise-level improvement. The combined dataset shows a consistent effect that reaches statistical significance because the effect is shared across the conditions even though the conditions are diagnostically distinct.

The second is shared biomarker trajectory. The inflammatory marker that elevates in hEDS patients with autonomic features may also elevate in patients with a specific newborn screening panel disorder for the same biological reason. Detecting the shared trajectory requires data from both populations measured with comparable assays at comparable intervals. The biological inference, that the two conditions share a pathway despite their diagnostic distance, is the kind of finding that condition-specific research cannot generate.

The third is shared environmental trigger. Metabolic crisis rates in three different organic acidemias spike in the same geographic region in the same month. A single-condition dataset interprets the spike as random variation. The cross-condition view identifies an environmental trigger, perhaps a viral strain, a temperature pattern, or a food supply change, that affects all three conditions through their common downstream physiology. The hypothesis can be tested with subsequent data; the hypothesis itself is generated by the cross-condition view.

What the Nurses' Health Study did with this approach

The Nurses' Health Study did not generate findings about one disease. It generated findings about the relationship between hormone replacement therapy and cardiovascular risk. The relationship between dietary fat and breast cancer. The relationship between physical activity and diabetes, depression, and mortality. The relationship between vitamin D and a long list of conditions.

These findings emerged because the cohort was characterized broadly and the data was queryable across conditions. A study that started with one disease in mind, even with the same number of participants and the same length of follow-up, could not have produced the same findings, because the cross-condition queries that produced the most important results would not have been part of the original protocol.

The rare disease parallel requires a dataset structured for cross-condition queries from the outset. The structuring choices include common terminology for symptoms (Human Phenotype Ontology), common units and reference ranges for laboratory values, common definitions for clinical events (metabolic crisis, hospitalization, surgery), and common metadata for medications (RxNorm) and outcomes. A dataset built without these structuring choices captures information richly but cannot be queried across conditions. A dataset built with them is computationally usable in ways that condition-specific datasets are not.

What the rare disease space has not had

There has not been a Nurses' Health Study for rare disease. The closest analogues are condition-specific registries (the PKU Registry, the SMA Care Registry, the BioNews registries for several conditions) and pharma-sponsored natural history studies (Marsi for SMA, the Pompe Registry, the X-ALD Disease Registry). Each captures data within its scope. None is designed for cross-condition query.

The infrastructure cost of building a cross-condition longitudinal dataset is paid once. The questions the dataset can answer compound over time. The first ten years of the dataset support analyses that are within the scope of any well-funded condition-specific registry. The next ten years support analyses that no single-condition registry can produce because the cross-condition signal requires the multi-condition data structure to be detectable at all.

The lived experience makes the case as strongly as the research argument does. The person with hEDS who also has POTS, MCAS, and chronic fatigue is one person. The family with two organic acidemias is one family. The fragmentation of the data into condition-specific silos makes the lived experience invisible to the research infrastructure that purports to study it. Building a dataset that holds the whole person, across the whole presentation, is the engineering problem the research community has not solved because the incentives at the institutional level point away from solving it.

The community incentive points the other way. The data trust governed by the affected community can hold cross-condition data because the governance is on the community side of the institutional boundary. The trust does not have to choose between the PKU registry and the PA registry. It holds both. It supports cross-condition queries by design. The signals that have been hiding in condition-specific silos become detectable when the silos come down.