The other posts in this series walk around finished models and read the shapes they built — across layers, across sizes, across training. This one opens the instrument that produces those readings. Mechanically it is two things bolted together: a generator that writes time-series signals whose every generative factor is known exactly, and a probe harness that pushes those signals through a frozen model and measures — with a linear readout and with geometry — how faithfully the representation lays the factor out.
Because the signal is synthesised, the ground truth isn't inferred from correlations in messy data; it's the dial we turned. That single fact is what lets the geometry be a measurement rather than a picture. The rest of this page is the build: how a signal is generated, how one factor becomes a 1,024-point ruler, the two ways we read the activations back, and the four lenses that turn a manifold into numbers.
What it isA signal generator bolted to a probe
Aionoscope has two halves, and keeping them straight makes everything else click into place.
-
The generator (
aiono). A synthesis engine. A Process samples a latent state — which components are present and their exact parameters — and a View chain renders that state into an observed waveform. The latent state is the answer key. - The probe harness. It feeds generated signals to a frozen foundation model, taps the hidden activations at every layer, and scores them two ways: a linear probe (is the factor decodable?) and a manifold analysis (is the factor's geometry laid out faithfully?).
Nothing about the model is changed — no fine-tuning, no gradient through the encoder. We only read what the frozen network already computes. Click through the pipeline; each stage expands.
The generatorA signal you understand to the last decimal
The generator is a tiny world with a fixed grammar. Everything observable is built from a library of 14 components across four families:
-
Periodic —
sine,sawtooth,square(amplitude, frequency, phase, offset, duty cycle). -
Trend —
linear,quadratic,log,sigmoid(slope, curvature, centre, sharpness). -
Event —
spike,level_change,gaussianbump (time, magnitude, width). -
Noise & baseline —
gaussian_noise,uniform_noise,random_walk_noise,constant(scale, offset).
A sample mixes
num_enabled ∈ {1, 2, 3} of them. Every
knob each component exposes is drawn from a sampler,
and — this is the whole point — the
exact drawn values are written into
meta['process']['samples']. Those numbers
are the labels every probe is trained against and
every manifold is scored against. Build a few signals
below: toggle components, reshuffle the draws, and
watch the answer key on the right.
Reproducible by construction. Splits
are materialised from fixed seeds (train seed
0; ten validation streams with disjoint
seeds), so the same command regenerates the same
65,536 signals every time — no dataset to download, no
drift between runs.
The sweepOne factor, a 1,024-point ruler
To study geometry we stop mixing and switch to a
controlled slice. Pick one factor — say a
sine's phase — switch every other
component off (num_enabled = 1), and step
that one knob across a fine grid of
1,024 values. Push all 1,024 signals
through the model, average the activations at each grid
point into a centroid, and order the centroids
by the value you swept. They trace a path through
representation space. That path is the object
we score.
Below, Toto-2.0-2.5B at layer 4 sweeping phase across a full cycle. Drag the slider (or hit play): on the left the input sine slides along; on the right a dot rides the manifold the centroids trace.
Experimental setup (protocol
manifold_v0).
-
Grid. 1,024 points
(
grid_size_1d), one sample each (repeats_per_grid_point = 1); only the target factor varies. -
Factor geometry. Each factor is
swept on the ruler that matches it —
interval(time, amplitude),circlewith a known period (phase), orsigned-log(slope). -
Nuisance. Everything else is
pinned at a canonical, non-degenerate value
(
fixed_factor_policy = canonical_non_degenerate). -
Validation. The same grid shifted
half a step
(
validation_policy = half_grid_offset), train seed0, val seed1— an interleaved hold-out, free of leakage. -
Representation. Activations are
mean-pooled per layer and reduced to
pca_dim = 64before any geometry is computed; every layer is measured. -
Geodesics. A k-nearest-neighbour
graph over
k ∈ {4, 6, 8}, best k chosen per layer.
Two readoutsDecodable is not organized
The standard way to interrogate a frozen representation is a linear probe: train a small linear readout and ask "can I recover the factor from these activations?" We run it as the primary benchmark. But a probe only needs one direction where the factor reads out — so it is silent on a deeper question:
Does the representation arrange examples according to the true shape of the process that generated them?
The manifold answers that. And it catches things the other instruments structurally cannot — because they are asking different questions:
| Method | The question it answers | Blind spot |
|---|---|---|
| Linear probe | Is factor X linearly decodable? | One direction suffices — silent on curvature, folding, neighbourhood scramble. |
| Attribution / saliency | Which inputs drove this prediction? | Explains an output in input-space, not how a factor is laid out in the representation. |
| SAE features | Which monosemantic features fire? | A curved manifold is diluted across many features — you see the endpoints, not the path. |
| Aionoscope geometry | Did the model build the right shape for a known factor? | Needs a controllable generative factor; v0 reads frozen, mean-pooled features linearly. |
The cleanest demonstration in the suite is a single spike slid across time. Its position is a plain interval from 0 to 1 — a linear probe recovers it well. Yet look at the raw geometry at the input-facing layer.
Geometry is also a per-layer reading. Watch the straight-line score climb with depth as the network promotes coarse position into a clean dominant axis — the folding doesn't vanish, it gets out-voted.
Four lensesHow the geometry is scored
"Faithful geometry" isn't one number. The harness scores
each manifold through four independent lenses, and the
artifact even records the best layer under each
(best_isometry_layer,
best_neighborhood_layer,
best_projection_layer,
best_fiber_layer). A representation can ace
one and fail another — which is exactly what makes them
worth separating.
1 · Isometry
Does distance in the representation track distance in factor-space? Compared both as a straight line and as a walk along the manifold; the gap between them is how badly the geometry is folded.
stress_scaled · spearman_latent_vs_linear / _geodesic · geodesic_gain2 · Neighbourhood
Are a point's true neighbours still its neighbours in the representation? Local resolution, independent of the global shape.
knn_recall_at_1 / _3 / _5 · trustworthiness · continuity3 · Projection & dimension
Can a low-dimensional chart hold the factor, and
how many independent axes does the path really
use? Participation ratio
PR = (Σλ)² / Σλ² reads the effective
dimension straight off the covariance spectrum.
4 · Fiber
Over each grid point sits a little cloud of repeats — the fiber. Does it stay tight on the curve, or smear off it? For circular factors, does the loop actually close?
mean / median / max_fiber_ratio · between_to_within_snr · cycle_closure_errorThey genuinely disagree. Sine phase is a good example: the ring is globally pristine at every depth (isometry ≈ 1.0), yet local resolution sags toward the middle of the network — the model trades away fine neighbour ordering for whatever the forecasting objective needs.
ReproducibilityWhat you'd need to rerun it
Every reading on the dashboard is one JSON artifact, and it carries its own provenance — the exact grid that was swept, the seeds, the config, and the per-layer scores. Nothing is plotted that isn't stored; this page only reads artifacts. Here is the shape of one (arrays and layers abbreviated):
// results/manifolds/Toto-2.0-2.5B/sine_phase/metrics.json { "schema_version": "manifold_result_v0", "target": { "target_name": "sine_phase", "component": "sine", "parameter": "phase", "geometry": "circle", "period": 6.2831853 }, "config": { "mode": "controlled_factor_slices", "grid_size_1d": 1024, "num_enabled": 1, "pca_dim": 64, "geodesic_neighbors": [4, 6, 8], "validation_policy": "half_grid_offset" }, "train_slice_manifest": { "physical_values": [0.0, 0.0061, …, 6.2832], // the dial we turned "latent_coordinates": [0.0, 0.00098, …, 1.0], // normalised grid "seed": 0, "grid_size": 1024 }, "by_layer": { "4": { "stress_scaled": …, "spearman_latent_vs_geodesic": …, "knn_recall_at_5": …, "projection_r2": …, "circular_order_score": …, "mean_fiber_ratio": … } }, "summary": { "best_isometry_layer": {…}, "best_neighborhood_layer": {…}, "best_projection_layer": {…}, "best_fiber_layer": {…} } }
And the two commands that produce a reading:
# geometry: sweep one factor for one model uv run python scripts/run_manifold_calibration_sequential.py \ --model Toto-2.0-2.5B --target sine_phase # probe benchmark: categorical + dense linear readouts uv run python -m aionoscope_benchmarks.run_model \ --model Toto-2.0-2.5B --num-enabled 1
What this is — and what it isn't.
- Readouts are linear and run on mean-pooled features (v0). Non-linear probes and token-level pooling are future work; pooling can hide structure a probe might still find.
- One dataset seed and a fixed probe schedule. Full seed sweeps multiply runtime by orders of magnitude and aren't run yet.
- Each model is evaluated at its native sequence length — ecological, but lengths differ across models, so absolute layer indices aren't comparable one-to-one.
- Code release. The generator and the benchmark harness aren't public yet — we're preparing them for release and will link them here soon.
Get involved
Point it at your model
If you train or ship a time-series model, the most useful thing you can do is tell us where the geometry should be cleaner than it is — a factor your model ought to nail, a layer that surprises you, a component we don't yet generate. The instrument is only as sharp as the questions put to it.
- Suggest a model or a generative factor to add.
- Tell us which lens matters for your use case.
- Flag a reading that looks wrong — those are the best bug reports.