Comparison and Verification
Use comparison to answer: did my result change? and why?
devqubit compares two runs (or bundles) across: metadata, parameters, metrics, program artifacts (exact + structural matching), results with noise-aware context, and device calibration drift.
Comparing Runs
from devqubit.compare import diff
# By run name (recommended)
result = diff("baseline-v1", "candidate-v2", project="bell-state")
# Or by run ID
result = diff("01KDNZYSNPZFVPZG94DATP1DT6", "01KDNZZ9KBYK1DDCW6KP38DA34")
print(result)
Output:
══════════════════════════════════════════════════════════════════════
RUN COMPARISON
══════════════════════════════════════════════════════════════════════
Baseline: 01KDNZYSNPZFVPZG94DATP1DT6 [vqe-h2]
Candidate: 01KDNZZ9KBYK1DDCW6KP38DA34 [vqe-h2]
──────────────────────────────────────────────────────────────────────
RESULT: ✗ DIFFER
──────────────────────────────────────────────────────────────────────
SUMMARY
───────
✓ Program: structural match
✗ Parameters: 2 changed, 1 added
! Results: TVD = 0.0870 (3.5x noise)
! Device: calibration drift detected
METADATA DIFFERENCES
────────────────────
backend: ibm_brisbane => ibm_kyoto
PARAMETER CHANGES
─────────────────
learning_rate 0.01 => 0.02 (+100.0%)
num_layers 4 => 6 (+50.0%)
METRIC CHANGES
──────────────
energy -1.136 => -1.142 (-0.5%)
fidelity 0.952 => 0.891 (-6.4%)
DISTRIBUTION ANALYSIS
─────────────────────
TVD: 0.087000 !
Expected noise: 0.025000
Noise ratio: 3.48x
Assessment: Moderate divergence
DEVICE CALIBRATION
──────────────────
Baseline cal: 2026-01-15T10:00:00Z
Candidate cal: 2026-01-15T14:30:00Z
! Significant drift in 2 metric(s):
median_t2_us 120.5 => 98.3 (-18.4%)
median_t1_us 95.2 => 82.1 (-13.8%)
CIRCUIT DIFFERENCES
───────────────────
depth 4 => 6 (+50.0%)
2Q gates 8 => 12 (+50.0%)
+ gates rz, rzz
WARNINGS
────────
! Backend changed between runs
! Calibration data may not be comparable across devices
diff accepts run names (with project), run IDs, or bundle files:
result = diff("baseline", "candidate", project="myproj") # Two run names
result = diff("01JD7X...", "01JD8Y...") # Two run IDs
result = diff("baseline.zip", "candidate.zip") # Two bundles
ComparisonResult
result = diff("baseline", "candidate", project="myproj")
# Overall
result.identical # True if everything matches
result.run_id_a # Baseline run ID
result.run_id_b # Candidate run ID
# Metadata
result.metadata["project_match"]
result.metadata["backend_match"]
# Parameters and metrics
result.params["match"] # True if all params match
result.params["changed"] # {"shots": {"a": 1000, "b": 2000}}
result.metrics["match"]
# Program comparison
result.program.exact_match # Artifact digests identical
result.program.structural_match # Circuit structure matches
result.program.parametric_match # Structure + params match
result.program.matches("either") # Check with specific mode
# Results
result.tvd # Total variation distance
result.counts_a # {"00": 500, "11": 500}
result.noise_context # Bootstrap noise analysis
# Device and circuit
result.device_drift # Calibration drift analysis
result.circuit_diff # Semantic circuit comparison
# Output
result.to_dict()
result.format_json()
result.format_summary()
TVD and Noise Context
TVD measures distribution difference: 0.0 = identical, 0.01–0.05 = typical shot noise, >0.15 = significant difference.
The noise_context uses parametric bootstrap to estimate shot noise thresholds:
if result.noise_context:
ctx = result.noise_context
print(f"Noise p95: {ctx.noise_p95:.4f}") # 95th percentile threshold
print(f"p-value: {ctx.p_value:.4f}") # Empirical p-value
print(f"Exceeds noise: {ctx.exceeds_noise}") # tvd > noise_p95?
print(f"Interpretation: {ctx.interpretation()}")
p-value |
Interpretation |
|---|---|
≥ 0.10 |
Consistent with sampling noise |
0.05–0.10 |
Borderline; consider increasing shots |
< 0.05 |
Likely exceeds sampling noise |
Batch mode (item_index="all")
For batch experiments, TVD is the max across all item pairs (worst-case). Items are matched by their item_index, not by list position, so skipped or reordered items are handled correctly.
The noise threshold is calibrated for the max-TVD statistic via bootstrap of max(TVD_i) under H0, which controls the family-wise error rate without Bonferroni approximations.
result = diff("a", "b", project="batch-exp", item_index="all")
print(result.tvd_aggregation) # "max"
print(result.tvd_batch_size) # Number of item pairs compared
print(result.tvd_item_index) # item_index of the worst pair
Baseline Verification
Verify a candidate run against the project’s baseline:
from devqubit.compare import verify_baseline, VerifyPolicy
policy = VerifyPolicy(
params_must_match=True,
program_must_match=True,
noise_factor=1.0,
)
result = verify_baseline(
"nightly-run", # run name or ID
project="vqe-h2",
policy=policy,
)
print(f"Passed: {result.ok}")
if not result.ok:
print(result.failures)
VerifyPolicy
from devqubit.compare import VerifyPolicy, ProgramMatchMode
policy = VerifyPolicy(
# Structural checks
params_must_match=True,
program_must_match=True,
program_match_mode=ProgramMatchMode.EITHER, # exact, structural, or either
fingerprint_must_match=False,
# TVD checks
tvd_max=0.1, # Hard limit
noise_factor=1.0, # Dynamic: fail if TVD > N × noise_p95
# Bootstrap settings
noise_alpha=0.95,
noise_n_boot=1000,
noise_seed=12345,
# Behavior
allow_missing_baseline=False,
)
When both tvd_max and noise_factor are set, the stricter (minimum) threshold is used.
Program match modes: EXACT (identical digests), STRUCTURAL (same circuit structure, VQE-friendly), EITHER (default).
Recommended noise_factor: 1.0 for strict CI, 1.2 for standard CI (recommended), 1.5 for noisy hardware.
Setting Baselines
from devqubit.runs import get_baseline, set_baseline, clear_baseline
set_baseline("vqe-h2", "production-v1") # by name or ID
baseline = get_baseline("vqe-h2")
clear_baseline("vqe-h2")
Or via CLI:
devqubit baseline set vqe-h2 production-v1
devqubit baseline get vqe-h2
devqubit baseline clear vqe-h2
Auto-promote on pass:
result = verify_baseline(
"nightly-run",
project="vqe-h2",
policy=policy,
promote_on_pass=True,
)
Device Drift Detection
Calibration drift is automatically detected during comparison:
if result.device_drift and result.device_drift.significant_drift:
print("Significant calibration drift detected")
for metric in result.device_drift.top_drifts[:3]:
print(f" {metric.metric}: {metric.percent_change:+.1f}%")
CI Integration
# GitHub Actions
- name: Verify against baseline
run: |
devqubit verify --project vqe-h2 nightly-run \
--noise-factor 1.0 \
--junit results.xml
from devqubit.compare import verify_baseline
from devqubit.ci import write_junit
result = verify_baseline("nightly-run", project="vqe-h2")
write_junit(result, "results.xml")
CLI
devqubit diff baseline-v1 candidate-v2 --project myproj
devqubit diff baseline-v1 candidate-v2 --project myproj --format json
devqubit verify --project vqe-h2 nightly-run
devqubit verify --project vqe-h2 nightly-run --noise-factor 1.0 --promote