deltatrials

Methodology

deltatrials monitors ClinicalTrials.gov daily and records every change to every trial. This page explains how we do it.

Data Source

Our data comes exclusively from the ClinicalTrials.gov API, maintained by the U.S. National Library of Medicine. Each daily sync fetches all trials modified since the previous run.

We track a broad set of fields including overall status, enrollment count, start and completion dates, sponsor information, study phase, conditions, interventions, eligibility criteria, contact details, and facility locations.

Location display names are enriched with GeoNames data (cities1000 + countryInfo dumps), licensed under CC BY 4.0.

SCD2 Versioning

The core of our approach is SCD2 — Slowly Changing Dimension Type 2. In plain language: we never overwrite old records. Instead, we keep every version of every trial, each with the dates it was valid.

When a trial's status changes from Recruiting to Completed, we mark the old record as closed (with a valid_to date) and create a new record (with a fresh valid_from date). The result is a complete, timestamped audit trail of every change a trial has ever undergone.

This approach lets us answer questions that a current-snapshot database cannot: When did this trial stop recruiting? How many times has the enrollment count changed? What was the status two years ago?

Update Frequency

Our automated pipeline runs daily. Each run:

  1. Fetches all trials modified on ClinicalTrials.gov since the last sync
  2. Compares the incoming data against our stored records field-by-field
  3. Creates new SCD2 versions for any trial where data has changed
  4. Writes a freshness artifact so the site can display the last-updated date

Data Quality

A few important caveats about our data:

Pipeline Architecture

Our data pipeline is automated and built for reliability at scale:

Sponsor Identity

Sponsor pages (such as /sponsor/pfizer) aggregate trials by lead sponsor. Deciding which distinct ClinicalTrials.gov name strings correspond to the same organization is not trivial — a single sponsor may appear as “Pfizer”, “Pfizer Inc.”, “Pfizer Inc”, and several other variants in the public registry.

Our approach is deliberately simple:

Known limitations

Lead sponsor heuristic. The ClinicalTrials.gov export we use lists all sponsors (lead and collaborator) as a concatenated list, and our current pipeline picks the first entry as the lead. For trials where the lead_or_collaborator flag would place the lead elsewhere in the list — approximately five percent of records — this can surface a collaborator as if they were the lead. We plan to upgrade this to the canonical field in a future pipeline rebuild.

No acquisition rollups. We do not roll up acquired companies into their parent (Genentech remains distinct from Roche; Celgene remains distinct from Bristol-Myers Squibb; Allergan remains distinct from AbbVie). This preserves the historical record of which entity registered each trial, but means related portfolios appear as separate sponsor pages. Combined acquisition rollups may be added in a future phase.

Termination Reason Categories

When a trial’s status is Terminated, Withdrawn, or Suspended, ClinicalTrials.gov typically includes a free-text explanation. We classify each explanation into one of eight categories so sponsor pages can show an aggregate breakdown (for example, “41% of terminations cite recruitment issues”). The categories are:

Classification is rule-based: we look for specific keywords in a fixed priority order (Safety > Efficacy > Regulatory > Business > Funding > PI/Site > Recruitment > Other). When multiple keywords match, the higher-priority category wins — so a trial whose explanation mentions both a safety signal and an FDA action is recorded as Safety, not Regulatory. We do not use a large language model for this step; the classifier is a short deterministic function in our pipeline and produces the same output every time.

Where categories appear. We show the category breakdown only at the sponsor-aggregate level (the “Termination reasons” line on a sponsor page) and we use the category internally to group cross-trial peer links on terminated trial pages. We do not show a category label next to an individual trial. Classifying a specific trial’s termination carries risk we are not willing to take with data that affects patient decisions — so we keep the raw ClinicalTrials.gov quote on the trial page verbatim.