Methodology
deltatrials monitors ClinicalTrials.gov daily and records every change to every trial. This page explains how we do it.
Data Source
Our data comes exclusively from the ClinicalTrials.gov API, maintained by the U.S. National Library of Medicine. Each daily sync fetches all trials modified since the previous run.
We track a broad set of fields including overall status, enrollment count, start and completion dates, sponsor information, study phase, conditions, interventions, eligibility criteria, contact details, and facility locations.
Location display names are enriched with GeoNames data (cities1000 + countryInfo dumps), licensed under CC BY 4.0.
SCD2 Versioning
The core of our approach is SCD2 — Slowly Changing Dimension Type 2. In plain language: we never overwrite old records. Instead, we keep every version of every trial, each with the dates it was valid.
When a trial's status changes from Recruiting to Completed, we mark the old record as closed (with
a valid_to date) and create
a new record (with a fresh valid_from date).
The result is a complete, timestamped audit trail of every change a trial has ever undergone.
This approach lets us answer questions that a current-snapshot database cannot: When did this trial stop recruiting? How many times has the enrollment count changed? What was the status two years ago?
Update Frequency
Our automated pipeline runs daily. Each run:
- Fetches all trials modified on ClinicalTrials.gov since the last sync
- Compares the incoming data against our stored records field-by-field
- Creates new SCD2 versions for any trial where data has changed
- Writes a freshness artifact so the site can display the last-updated date
Data Quality
A few important caveats about our data:
- Accuracy depends on submissions. Data quality is only as good as what sponsors report to ClinicalTrials.gov. We do not independently verify trial data.
- Registry lag. Real-world events (enrollment closure, trial termination) may not be reflected in the registry immediately. There can be days or weeks between an event and its appearance in our data.
- As-reported. We preserve data exactly as submitted to ClinicalTrials.gov without editorial changes or corrections.
- Trial status can change. A trial's eligibility criteria, status, and contact information can change at any time. Always verify directly with the trial team.
Pipeline Architecture
Our data pipeline is automated and built for reliability at scale:
- Python pipeline — fetches and diffs trial data from the ClinicalTrials.gov API daily
- DuckDB — high-performance in-process analytics database used for data transformation and SCD2 computation
- Cloudflare R2 — object storage for the processed dataset, enabling low-cost global distribution
- Cloudflare Workers — edge-served API layer that queries data with sub-100ms latency worldwide