Method-equivalence validation: the double-well on one engine¶
This page records a reproducibility check that every sampling strategy
and every code path in PyRETIS reproduces the same crossing
probability and rate constant on one simple system, and that the
result matches an absolute analytical truth rather than only agreeing
with itself. The system is a 1D double well sampled on the internal
engine; the cases and the driver script live in
examples/validation/methods/ and are launched by
examples/validation/run_validation.py.
The point of the test is that the in-process RETIS loop, the infinite-swapping sampler analysed with WHAM, and the native run routed through the infinite-swapping scheduler (the Stage C route, including the multi-worker pool) all integrate the identical potential, so if their rate estimates agree – and land on the analytical Kramers rate – the sampling machinery is reproducible across strategies and across code paths. This complements the cross-engine validation, which checks engine equivalence on a shared force field.
The analytical reference – a known truth¶
The double well is simple enough to have a closed-form escape rate from
Kramers’ theory, so the suite checks the simulations against an
absolute truth, not only against each other.
analytical_double_well_rate() evaluates it in three increasingly
complete forms:
| Estimator | Rate | Meaning |
|---|---|---|
k_TST |
2.81e-07 | transition-state theory, no recrossing (upper bound) |
k_Kramers (spatial) |
2.61e-07 | moderate/high-friction Kramers prefactor |
k_Kramers-MM (truth) |
2.52e-07 |
|
With \(\beta\,\Delta V = 14.3\) and reduced energy loss \(\delta = 11.4\) the turnover factor is \(\Upsilon = 0.97\) – the system sits firmly in the spatial-diffusion regime, so the Kramers result is accurate to a few percent. Converged cases should land on 2.5e-07 within their statistical error. A case can agree with every other case yet sit several sigma from the analytical rate – a shared systematic that the self-consistency check alone cannot see.
The methods¶
The same double well is sampled with every available strategy and through
every code path. The native_* cases run the in-process RETIS loop via
pyretisrun; infswap_wham runs the infinite-swapping sampler and is
analysed with WHAM; the scheduler_* cases run a native config
through the infinite-swapping scheduler (the Stage C native->scheduler
route, PYRETIS_NATIVE_VIA_INFSWAP), which emits native-format output
and is therefore analysed exactly like a native case and must reproduce
its native sibling’s rate.
| Case | Strategy / what it adds |
|---|---|
native_sh |
standard shooting (pyretisrun, in-process loop) |
native_ss |
stone skipping (pyretisrun) |
native_wt |
web throwing (pyretisrun) |
native_wf |
wire fencing (pyretisrun) |
native_wf_ha |
wire fencing with high acceptance (pyretisrun) |
native_wf_cap_* |
wire fencing with a capped window ([tis] interface_cap) |
native_wt_sour_* |
web throwing with a shifted source sub-interface
([tis] interface_sour) |
native_relshoot |
RETIS with relative per-ensemble shoot frequencies
([retis] relative_shoots) |
native_*_pcg64 |
the same native methods driven by the PCG64 generator
(rgen = "pcg64") – the A3.4 RNG-migration validation set |
infswap_wham |
infinite swapping (pyretisrun) analysed with WHAM –
a cross-check of the second sampler / code path |
scheduler_sh |
standard shooting routed through the infinite-swapping
scheduler (Stage C native->scheduler route; native-format
output, analysed like native_sh) |
scheduler_sh_w3 |
the same, with a multi-worker pool of three workers
(PYRETIS_NATIVE_WORKERS=3) – the parallel-worker route |
scheduler_wf |
wire fencing routed through the scheduler (WHAM
Cxy/HA unweighting on the native route) |
scheduler_ss |
stone skipping routed through the scheduler |
scheduler_wt |
web throwing routed through the scheduler |
The native_wf_cap_* / native_wt_sour_* / native_relshoot cases
(group params) vary non-default TIS knobs that change the sampling
but not the physics, so each must still reach the same analytical rate.
The wf_convention group (native_wf/native_wf_ha,
infswap_wham, scheduler_wf) cross-checks wire fencing across the
code paths. On the scheduler’s native-output route the wf occupancy is
HA-weighted (the compute_weight crossing count drives the swap, as in
infretis), so the native writer applies the WHAM Cxy/HA unweighting
(Weight = compute_weight / frac) to recover the per-ensemble crossing
probability – without it the rate over-counts by ~two orders of magnitude;
scheduler_ss carries the same treatment for stone skipping. The suite
config’s select key (group tags such as wf_convention or
params) runs a chosen subset without dropping the rest.
The native sh/ss/wt/wf cases (analysed with the standard PyRETIS
crossing probability), the infinite-swapping infswap_wham (analysed
with WHAM), and the scheduler-routed cases (the Stage C path, analysed as
native) are a direct cross-check of the distinct code paths. The
native_*_pcg64 cases are the A3.4 step 2 go/no-go: they run the
identical native methods with the canonical PCG64 generator instead of the
legacy MT19937 and must reproduce the MT19937 rate within statistical
error before the default generator is flipped (see MERGE_TODO.md
A3.4). As the in-process and infinite-swapping code paths unify, the
infswap_* and scheduler_* cases should collapse onto their native
equivalents.
Note
scheduler_ss (stone skipping through the scheduler) earlier crashed
the internal engine’s streaming dump (FileNotFoundError on
ss_shoot.xyz) because the move re-pointed a persistent path’s
phase point at the transient dump file. That is fixed (stone skipping
now dumps a copy), and the case is enabled like the others.
Running it and reading the output¶
The suite is not a unit test and not a tutorial: the GitLab CI
cannot afford runs long enough to drive the rare-event statistics down, so
it is launched manually, once, on a cluster. It is driven by a
per-machine config, validation.toml – an ordered list of cases with
their target cycle counts plus a per-machine seed and a
reverse_list toggle. The first positional argument selects the action:
cd examples/validation
# usage helper (also printed when no action is given)
python run_validation.py
# show the recorded results -- READ-ONLY: runs nothing, writes nothing
python run_validation.py status
python run_validation.py status native_sh # ... for one case
# (re)analyse the runs already on disk and update the results table
python run_validation.py analyze
python run_validation.py analyze native_sh # ... for one case
# run every case listed in ./validation.toml, then analyse
python run_validation.py run
python run_validation.py run native_sh # ... for one case
One-off overrides, usable with run:
python run_validation.py run --config machineB.toml # different per-machine config
python run_validation.py run --cycles 20000 # override every case's target
python run_validation.py run --seed 2 # this machine's seed
python run_validation.py run --jobs 8 # internal cases in parallel
--jobs N runs that many internal-engine (methods/) cases at
once; they are independent single-process runs, so this only uses more
cores and does not change any result (each case has its own directory
and seed). --cycles is a target total, not an increment – a case
with existing output is continued up to that count, otherwise it starts
fresh from seed. python run_validation.py --write-config
validation.toml writes a fresh config template with every case enabled.
Each case is analysed as soon as it finishes – its rate prints on the
go – and a combined inventory, convergence plot and comparison follow at
the end. The analysis prints a “Rate vs analytical Kramers reference”
table (rate, k/k_ref, |d|/sigma, agree?) next to the suite-mean
table. The persistent summaries land in
validation_results.json (one entry per case: rate, relative error,
cycles, number of independent runs, agreement flag, plus provenance –
when, at which git commit, and whether the tree was dirty) and the
human-readable validation_results.html (the same table with the
convergence rate_vs_cycles.png embedded). A CHECK in the
comparison almost always means not converged yet – raise that case’s
cycles and re-analyse.
See examples/validation/README.rst for the full driver reference,
including the two-machine combine-for-statistics workflow.