Methodology
deltatrials monitors ClinicalTrials.gov daily and records every change to every trial. This page explains how we do it.
Data Source
Our data comes exclusively from the ClinicalTrials.gov API, maintained by the U.S. National Library of Medicine. Each daily sync fetches all trials modified since the previous run.
We track a broad set of fields including overall status, enrollment count, start and completion dates, sponsor information, study phase, conditions, interventions, eligibility criteria, contact details, and facility locations.
Location display names are enriched with GeoNames data (cities1000 + countryInfo dumps), licensed under CC BY 4.0.
SCD2 Versioning
The core of our approach is SCD2 — Slowly Changing Dimension Type 2. In plain language: we never overwrite old records. Instead, we keep every version of every trial, each with the dates it was valid.
When a trial's status changes from Recruiting to Completed, we mark the old record as closed (with
a valid_to date) and create
a new record (with a fresh valid_from date).
The result is a complete, timestamped audit trail of every change a trial has ever undergone.
This approach lets us answer questions that a current-snapshot database cannot: When did this trial stop recruiting? How many times has the enrollment count changed? What was the status two years ago?
Update Frequency
Our automated pipeline runs daily. Each run:
- Fetches all trials modified on ClinicalTrials.gov since the last sync
- Compares the incoming data against our stored records field-by-field
- Creates new SCD2 versions for any trial where data has changed
- Writes a freshness artifact so the site can display the last-updated date
Data Quality
A few important caveats about our data:
- Accuracy depends on submissions. Data quality is only as good as what sponsors report to ClinicalTrials.gov. We do not independently verify trial data.
- Registry lag. Real-world events (enrollment closure, trial termination) may not be reflected in the registry immediately. There can be days or weeks between an event and its appearance in our data.
- As-reported. We preserve data exactly as submitted to ClinicalTrials.gov without editorial changes or corrections.
- Trial status can change. A trial's eligibility criteria, status, and contact information can change at any time. Always verify directly with the trial team.
Pipeline Architecture
Our data pipeline is automated and built for reliability at scale:
- Python pipeline — fetches and diffs trial data from the ClinicalTrials.gov API daily
- DuckDB — high-performance in-process analytics database used for data transformation and SCD2 computation
- Cloudflare R2 — object storage for the processed dataset, enabling low-cost global distribution
- Cloudflare Workers — edge-served API layer that queries data with sub-100ms latency worldwide
Sponsor Identity
Sponsor pages (such as /sponsor/pfizer) aggregate trials by lead sponsor. Deciding which distinct ClinicalTrials.gov name strings correspond to the same organization is not trivial — a single sponsor may appear as “Pfizer”, “Pfizer Inc.”, “Pfizer Inc”, and several other variants in the public registry.
Our approach is deliberately simple:
-
Strip common corporate suffixes (
Inc,LLC,Ltd,Corp,GmbH,AG,SA,plc,NV,AB,Oy,Pty,Holdings,Group, and the trailing commas or periods that accompany them) before slugifying. - Apply a curated override list for the top-50 sponsors by trial count, which handles variations that suffix stripping alone cannot (for example, multiple subsidiary descriptions and NIH institute acronym conventions).
- Generate a sponsor hub page only when a sponsor has ten or more lead-sponsored trials. Below that threshold, the trial page still shows the sponsor name as plain text — there is no link to a thin hub page.
Known limitations
Lead sponsor heuristic. The ClinicalTrials.gov export we use lists all sponsors
(lead and collaborator) as a concatenated list, and our current pipeline picks the first entry as
the lead. For trials where the lead_or_collaborator
flag would place the lead elsewhere in the list — approximately five percent of records —
this can surface a collaborator as if they were the lead. We plan to upgrade this to the canonical
field in a future pipeline rebuild.
No acquisition rollups. We do not roll up acquired companies into their parent (Genentech remains distinct from Roche; Celgene remains distinct from Bristol-Myers Squibb; Allergan remains distinct from AbbVie). This preserves the historical record of which entity registered each trial, but means related portfolios appear as separate sponsor pages. Combined acquisition rollups may be added in a future phase.
Termination Reason Categories
When a trial’s status is Terminated, Withdrawn, or Suspended, ClinicalTrials.gov typically includes a free-text explanation. We classify each explanation into one of eight categories so sponsor pages can show an aggregate breakdown (for example, “41% of terminations cite recruitment issues”). The categories are:
- Recruitment — slow or insufficient enrollment, accrual failure.
- Efficacy — interim futility findings, primary endpoint miss.
- Safety — adverse events, toxicity signals, serious adverse events.
- Business — sponsor decisions, mergers, acquisitions, strategic reprioritization.
- Funding — grant expiration, budget cuts, financial constraints.
- Regulatory — FDA clinical holds, IRB terminations.
- PI or Site — principal investigator departures, site closures.
- Other — all unclassified or empty explanations (around 25% of terminations).
Classification is rule-based: we look for specific keywords in a fixed priority order (Safety > Efficacy > Regulatory > Business > Funding > PI/Site > Recruitment > Other). When multiple keywords match, the higher-priority category wins — so a trial whose explanation mentions both a safety signal and an FDA action is recorded as Safety, not Regulatory. We do not use a large language model for this step; the classifier is a short deterministic function in our pipeline and produces the same output every time.
Where categories appear. We show the category breakdown only at the sponsor-aggregate level (the “Termination reasons” line on a sponsor page) and we use the category internally to group cross-trial peer links on terminated trial pages. We do not show a category label next to an individual trial. Classifying a specific trial’s termination carries risk we are not willing to take with data that affects patient decisions — so we keep the raw ClinicalTrials.gov quote on the trial page verbatim.