Method-equivalence validation: the double-well on one engine¶

This page records a reproducibility check that every sampling strategy and every code path in PyRETIS reproduces the same crossing probability and rate constant on one simple system, and that the result matches an absolute analytical truth rather than only agreeing with itself. The system is a 1D double well sampled on the internal engine; the cases and the driver script live in examples/validation/methods/ and are launched by examples/validation/run_validation.py.

The point of the test is that every sampling strategy, both RNG generations (the legacy MT19937 pin and the shipping PCG64 default), the multi-worker pool, and both rate estimators (the matched crossing-probability report and WHAM) all work over the identical potential, so if their rate estimates agree – and land on the analytical Kramers rate – the sampling machinery is reproducible across strategies, generators and estimators. This complements the cross-engine validation, which checks engine equivalence on a shared force field.

The shared system¶

Every method case models the same 1D double well, so they must give the same rate within statistical error:

Potential: the quartic double well \(V = a x^4 - b x^2 + c\) with a = 1 and b = 2 (so the barrier is \(\Delta V = b^2 / (4 a) = 1\)), c = 0.
Dynamics: inertial Langevin at temperature T = 0.07 with a time step dt = 0.025 and friction gamma = 0.3 (reduced units), unit mass. With these numbers \(\beta\,\Delta V = 14.3\), so escape over the barrier is a genuine rare event.
Order parameter: the particle position, with eight interfaces spanning \([-0.99 \ldots 1.0]\).

The system parameters are not hard-coded into the analysis – they are read back out of each case’s input TOML (read_system_params), so the analytical reference below tracks the actual runs, and every method case is cross-checked to confirm it really models the same well.

The analytical reference – a known truth¶

The double well is simple enough to have a closed-form escape rate from Kramers’ theory, so the suite checks the simulations against an absolute truth, not only against each other. analytical_double_well_rate() evaluates it in three increasingly complete forms:

Table 61 Analytical double-well escape rate (reduced units).¶
Estimator	Rate	Meaning
`k_TST`	2.81e-07	transition-state theory, no recrossing (upper bound)
`k_Kramers` (spatial)	2.61e-07	moderate/high-friction Kramers prefactor
`k_Kramers-MM` (truth)	2.52e-07	Mel’nikov–Meshkov turnover factor (headline)

With \(\beta\,\Delta V = 14.3\) and reduced energy loss \(\delta = 11.4\) the turnover factor is \(\Upsilon = 0.97\) – the system sits firmly in the spatial-diffusion regime, so the Kramers result is accurate to a few percent. Converged cases should land on 2.5e-07 within their statistical error. A case can agree with every other case yet sit several sigma from the analytical rate – a shared systematic that the self-consistency check alone cannot see.

The methods¶

The same double well is sampled with every available strategy and checked through both estimators. Every run goes through the single pyretis run entry point; the cases differ in their RNG pin (*_mt19937 requests the legacy generator explicitly, *_pcg64 pins the shipping default, *_default leaves the config unpinned – byte-identical to its *_pcg64 sibling) and in their analysis kind: matched (the point-matching product of per-ensemble crossing probabilities, reported by pyretis analyse) or wham (the WHAM stitch of the same per-ensemble output, e.g. infswap_wham).

Table 62 Method cases on the shared double well.¶
Case	Strategy / what it adds
`sh_mt19937`	standard shooting, pinned to the legacy MT19937 generator
`ss_mt19937`	stone skipping, legacy MT19937 pin
`wt_mt19937`	web throwing, legacy MT19937 pin
`wf_mt19937`	wire fencing, legacy MT19937 pin
`wf_ha_mt19937`	wire fencing with high acceptance, MT19937 pin
`wf_cap_*`	wire fencing with a capped window (`[tis] interface_cap`)
`wt_sour_*`	web throwing with a shifted source sub-interface (`[tis] interface_sour`)
`relshoot`	RETIS with relative per-ensemble shoot frequencies (`[retis] relative_shoots`)
`*_pcg64`	the same samplers pinned to the PCG64 generator (`rgen = "pcg64"`) – the A3.4 RNG-migration validation set
`infswap_wham`	infinite swapping analysed with WHAM – the cross-check of the second estimator
`*_default`	the same samplers with NO RNG pin (the shipping default): byte-identical twins of their `*_pcg64` siblings, a routing/determinism guard
`sh_w3`	the default config with a multi-worker pool of three workers (`PYRETIS_WORKERS=3`) – the parallel-worker route
`bias_gaussian`	biased shooting-point selection (Gaussian selector toward the barrier) – the move biases the shooting point but preserves the path ensemble, so it must reproduce the same rate (`bias_gaussian_default` is its byte-twin, `bias_uniform` the uniform-selector identity check)

The wf_cap_* / wt_sour_* / relshoot cases (group params) vary non-default TIS knobs that change the sampling but not the physics, so each must still reach the same analytical rate. The wf_convention group (wf_mt19937/wf_ha_mt19937, infswap_wham, wf_default) cross-checks wire fencing across the estimators. On the scheduler’s per-ensemble-output route the wf occupancy is HA-weighted (the compute_weight crossing count drives the swap, as in the infinite-swapping sampler), so the output writer applies the WHAM Cxy/HA unweighting (Weight = compute_weight / frac) to recover the per-ensemble crossing probability – without it the rate over-counts by ~two orders of magnitude; ss_default carries the same treatment for stone skipping. The suite config’s select key (group tags such as wf_convention or params) runs a chosen subset without dropping the rest.

The matched-kind rows (analysed with the standard PyRETIS crossing-probability report) and the wham-kind rows (analysed with WHAM over the identical per-ensemble output) are a direct cross-check of the two estimators. The *_pcg64 cases are the A3.4 step 2 go/no-go: they run the identical samplers with the PCG64 generator instead of the legacy MT19937 and must reproduce the MT19937 rate within statistical error before the default generator is flipped (see MERGE_TODO.md A3.4). The *_default twins must stay byte-identical to their *_pcg64 equivalents – a routing/determinism guard.

Note

ss_default (stone skipping under the worker pool) earlier crashed the internal engine’s streaming dump (FileNotFoundError on ss_shoot.xyz) because the move re-pointed a persistent path’s phase point at the transient dump file. That is fixed (stone skipping now dumps a copy), and the case is enabled like the others.

Running it and reading the output¶

The suite is not a unit test and not a tutorial. Short developer runs can be launched manually, while scheduled GitLab pipelines can enable the method_validation three-seed matrix by setting PYRETIS_RUN_METHOD_VALIDATION=1. Each seed’s JSON, HTML, plot, and run logs are retained as CI artifacts for one year. Longer convergence campaigns can still be launched on a cluster. All modes use a per-run config such as validation.toml – an ordered list of cases with their target cycle counts plus a per-machine seed and a reverse_list toggle. The first positional argument selects the action:

cd examples/validation

# usage helper (also printed when no action is given)
python run_validation.py

# show the recorded results -- READ-ONLY: runs nothing, writes nothing
python run_validation.py status
python run_validation.py status sh_mt19937  # ... for one case

# (re)analyse the runs already on disk and update the results table
python run_validation.py analyze
python run_validation.py analyze sh_mt19937  # ... for one case

# run every case listed in ./validation.toml, then analyse
python run_validation.py run
python run_validation.py run sh_mt19937      # ... for one case

One-off overrides, usable with run:

python run_validation.py run --config machineB.toml  # different per-machine config
python run_validation.py run --cycles 20000          # override every case's target
python run_validation.py run --seed 2                # this machine's seed
python run_validation.py run --jobs 8                # internal cases in parallel

--jobs N runs that many internal-engine (methods/) cases at once; they are independent single-process runs, so this only uses more cores and does not change any result (each case has its own directory and seed). --cycles is a target total, not an increment – a case with existing output is continued up to that count, otherwise it starts fresh from seed. python run_validation.py --write-config validation.toml writes a fresh config template with every case enabled.

Each case is analysed as soon as it finishes – its rate prints on the go – and a combined inventory, convergence plot and comparison follow at the end. The analysis prints a “Rate vs analytical Kramers reference” table (rate, k/k_ref, |d|/sigma, agree?) next to the suite-mean table. The persistent summaries land in validation_results.json (one entry per case: rate, relative error, cycles, number of independent runs, agreement flag, plus provenance – when, at which git commit, and whether the tree was dirty) and the human-readable validation_results.html (the same table with the convergence rate_vs_cycles.png embedded). A CHECK in the comparison almost always means not converged yet – raise that case’s cycles and re-analyse.

See examples/validation/README.rst for the full driver reference, including the two-machine combine-for-statistics workflow.

Scheduled CI contract¶

The scheduled matrix uses seeds 101, 202, and 303 and defaults to 20,000 target cycles per case. VALIDATION_CYCLES and VALIDATION_JOBS remain CI variables so a dedicated schedule can increase the statistical budget without changing code. A scheduled row is evidence only when its validation_results.json records the analytical-rate agreement and confidence information; job completion alone is not a scientific pass.