Langotime Aionoscope · The Instrument · Part 0

The answer key

Every interpretability method shares one handicap: it reads a model's activations on data nobody fully labelled, then guesses what the model understood. Aionoscope removes the guess. It synthesises the signal — a sine of exactly this phase, a spike at exactly this sample, a trend of exactly this slope — so the latent truth is known to the last decimal before the model ever sees it. Then it asks a sharper question than "can you decode it?": did the model build the right shape?

The other posts in this series walk around finished models and read the shapes they built — across layers, across sizes, across training. This one opens the instrument that produces those readings. Mechanically it is two things bolted together: a generator that writes time-series signals whose every generative factor is known exactly, and a probe harness that pushes those signals through a frozen model and measures — with a linear readout and with geometry — how faithfully the representation lays the factor out.

Because the signal is synthesised, the ground truth isn't inferred from correlations in messy data; it's the dial we turned. That single fact is what lets the geometry be a measurement rather than a picture. The rest of this page is the build: how a signal is generated, how one factor becomes a 1,024-point ruler, the two ways we read the activations back, and the four lenses that turn a manifold into numbers.

What it isA signal generator bolted to a probe

Aionoscope has two halves, and keeping them straight makes everything else click into place.

  • The generator (aiono). A synthesis engine. A Process samples a latent state — which components are present and their exact parameters — and a View chain renders that state into an observed waveform. The latent state is the answer key.
  • The probe harness. It feeds generated signals to a frozen foundation model, taps the hidden activations at every layer, and scores them two ways: a linear probe (is the factor decodable?) and a manifold analysis (is the factor's geometry laid out faithfully?).

Nothing about the model is changed — no fine-tuning, no gradient through the encoder. We only read what the frozen network already computes. Click through the pipeline; each stage expands.

The Aionoscope pipeline
Generator on the left of the dashed seam, probe harness on the right. Tap a stage.
The first four stages are the generator; from the dashed arrow on, the probe harness. Only the generator knows the truth — and it tells the metrics, which is what makes the score trustworthy.

The generatorA signal you understand to the last decimal

The generator is a tiny world with a fixed grammar. Everything observable is built from a library of 14 components across four families:

  • Periodicsine, sawtooth, square (amplitude, frequency, phase, offset, duty cycle).
  • Trendlinear, quadratic, log, sigmoid (slope, curvature, centre, sharpness).
  • Eventspike, level_change, gaussian bump (time, magnitude, width).
  • Noise & baselinegaussian_noise, uniform_noise, random_walk_noise, constant (scale, offset).

A sample mixes num_enabled ∈ {1, 2, 3} of them. Every knob each component exposes is drawn from a sampler, and — this is the whole point — the exact drawn values are written into meta['process']['samples']. Those numbers are the labels every probe is trained against and every manifold is scored against. Build a few signals below: toggle components, reshuffle the draws, and watch the answer key on the right.

Build a signal · read its answer key
Generated in your browser, faithful to aiono's component grammar.
Observed signal (1.024 s @ 500 Hz)
The answer key the harness records
On real data you would have to infer these quantities and hope you were right. Here they are the inputs — so when the harness later asks "did the model represent the spike's time?", the correct answer is sitting in the table, not estimated from a downstream metric.

Reproducible by construction. Splits are materialised from fixed seeds (train seed 0; ten validation streams with disjoint seeds), so the same command regenerates the same 65,536 signals every time — no dataset to download, no drift between runs.

The sweepOne factor, a 1,024-point ruler

To study geometry we stop mixing and switch to a controlled slice. Pick one factor — say a sine's phase — switch every other component off (num_enabled = 1), and step that one knob across a fine grid of 1,024 values. Push all 1,024 signals through the model, average the activations at each grid point into a centroid, and order the centroids by the value you swept. They trace a path through representation space. That path is the object we score.

Below, Toto-2.0-2.5B at layer 4 sweeping phase across a full cycle. Drag the slider (or hit play): on the left the input sine slides along; on the right a dot rides the manifold the centroids trace.

Input signal ↔ manifold · Toto-2.0-2.5B · sine_phase · layer 4
Sweep one factor; watch the representation move.
Input signal (1.024 s @ 500 Hz)
Representation manifold (PCA)
The phase of a sine is a circular quantity: phase 0 and phase 2π are the same signal. As the dial turns, the centroids close into a near-perfect ring and return to the start. The grid (1,024 points), the half-step-offset validation grid, the 64-dimensional PCA the path lives in — those are the knobs of the sweep, spelled out next.

Experimental setup (protocol manifold_v0).

  • Grid. 1,024 points (grid_size_1d), one sample each (repeats_per_grid_point = 1); only the target factor varies.
  • Factor geometry. Each factor is swept on the ruler that matches it — interval (time, amplitude), circle with a known period (phase), or signed-log (slope).
  • Nuisance. Everything else is pinned at a canonical, non-degenerate value (fixed_factor_policy = canonical_non_degenerate).
  • Validation. The same grid shifted half a step (validation_policy = half_grid_offset), train seed 0, val seed 1 — an interleaved hold-out, free of leakage.
  • Representation. Activations are mean-pooled per layer and reduced to pca_dim = 64 before any geometry is computed; every layer is measured.
  • Geodesics. A k-nearest-neighbour graph over k ∈ {4, 6, 8}, best k chosen per layer.

Two readoutsDecodable is not organized

The standard way to interrogate a frozen representation is a linear probe: train a small linear readout and ask "can I recover the factor from these activations?" We run it as the primary benchmark. But a probe only needs one direction where the factor reads out — so it is silent on a deeper question:

Does the representation arrange examples according to the true shape of the process that generated them?

The manifold answers that. And it catches things the other instruments structurally cannot — because they are asking different questions:

Method The question it answers Blind spot
Linear probe Is factor X linearly decodable? One direction suffices — silent on curvature, folding, neighbourhood scramble.
Attribution / saliency Which inputs drove this prediction? Explains an output in input-space, not how a factor is laid out in the representation.
SAE features Which monosemantic features fire? A curved manifold is diluted across many features — you see the endpoints, not the path.
Aionoscope geometry Did the model build the right shape for a known factor? Needs a controllable generative factor; v0 reads frozen, mean-pooled features linearly.

The cleanest demonstration in the suite is a single spike slid across time. Its position is a plain interval from 0 to 1 — a linear probe recovers it well. Yet look at the raw geometry at the input-facing layer.

0.12
straight-line distance order (Spearman) — looks almost random
0.97
distance-along-the-manifold order (geodesic) — nearly perfect
+0.85
geodesic gain — straight-line distance is blind to the order
Input signal ↔ manifold · Toto-2.0-2.5B · spike_time_frac · layer 0
Colour = spike position in time.
Input signal — a single impulse
Representation manifold (PCA)
The factor is decodable, but the geometry is folded: Toto reads the signal in 32-sample patches, so a one-sample spike mostly tells a patch where inside it the spike landed. The time axis folds into ~11 stacked copies. A probe finds its one good direction and scores fine; the shape is still scrambled.
Distance scatter · Toto-2.0-2.5B · spike_time_frac · layer 0
~4 MB of pairwise distances — loads on scroll.
For every pair of grid points: true spike-distance (x) vs. representation distance (y). Blue (straight-line) is a scrambled cloud — folding stacks far-apart spikes on top of each other. Red (geodesic), measured along the path, threads through the folds and snaps into a clean rising band. Same activations, two rulers — only one sees the order.

Geometry is also a per-layer reading. Watch the straight-line score climb with depth as the network promotes coarse position into a clean dominant axis — the folding doesn't vanish, it gets out-voted.

Metrics across layers · Toto-2.0-2.5B · spike_time_frac
Watch the blue and red lines converge.
Early on, geodesic order is near-perfect while straight-line order is poor — a big positive geodesic gain. Deeper, the straight-line score climbs to ~0.99 and the gain falls to zero. Decodability barely changed across all of this; the shape is the part that moved.

Four lensesHow the geometry is scored

"Faithful geometry" isn't one number. The harness scores each manifold through four independent lenses, and the artifact even records the best layer under each (best_isometry_layer, best_neighborhood_layer, best_projection_layer, best_fiber_layer). A representation can ace one and fail another — which is exactly what makes them worth separating.

1 · Isometry

Does distance in the representation track distance in factor-space? Compared both as a straight line and as a walk along the manifold; the gap between them is how badly the geometry is folded.

stress_scaled · spearman_latent_vs_linear / _geodesic · geodesic_gain

2 · Neighbourhood

Are a point's true neighbours still its neighbours in the representation? Local resolution, independent of the global shape.

knn_recall_at_1 / _3 / _5 · trustworthiness · continuity

3 · Projection & dimension

Can a low-dimensional chart hold the factor, and how many independent axes does the path really use? Participation ratio PR = (Σλ)² / Σλ² reads the effective dimension straight off the covariance spectrum.

projection_r2 · pca explained-variance · participation_ratio

4 · Fiber

Over each grid point sits a little cloud of repeats — the fiber. Does it stay tight on the curve, or smear off it? For circular factors, does the loop actually close?

mean / median / max_fiber_ratio · between_to_within_snr · cycle_closure_error

They genuinely disagree. Sine phase is a good example: the ring is globally pristine at every depth (isometry ≈ 1.0), yet local resolution sags toward the middle of the network — the model trades away fine neighbour ordering for whatever the forecasting objective needs.

Metrics across layers · Toto-2.0-2.5B · sine_phase
Isometry vs. neighbourhood — same manifold, two lenses.
Distance preservation (the isometry lens) stays pinned near 1.0 from the first layer to the last — the circle is robust. But 5-NN recall (the neighbourhood lens) peaks early and sags in the middle: the shape holds while local resolution is spent elsewhere. One lens would have hidden this.

ReproducibilityWhat you'd need to rerun it

Every reading on the dashboard is one JSON artifact, and it carries its own provenance — the exact grid that was swept, the seeds, the config, and the per-layer scores. Nothing is plotted that isn't stored; this page only reads artifacts. Here is the shape of one (arrays and layers abbreviated):

// results/manifolds/Toto-2.0-2.5B/sine_phase/metrics.json
{
  "schema_version": "manifold_result_v0",
  "target": { "target_name": "sine_phase", "component": "sine",
              "parameter": "phase", "geometry": "circle", "period": 6.2831853 },
  "config": { "mode": "controlled_factor_slices", "grid_size_1d": 1024,
              "num_enabled": 1, "pca_dim": 64, "geodesic_neighbors": [4, 6, 8],
              "validation_policy": "half_grid_offset" },
  "train_slice_manifest": {
      "physical_values":    [0.0, 0.0061, , 6.2832],   // the dial we turned
      "latent_coordinates": [0.0, 0.00098, , 1.0],     // normalised grid
      "seed": 0, "grid_size": 1024 },
  "by_layer": {
      "4": { "stress_scaled": , "spearman_latent_vs_geodesic": ,
             "knn_recall_at_5": , "projection_r2": ,
             "circular_order_score": , "mean_fiber_ratio":  } },
  "summary": { "best_isometry_layer": {}, "best_neighborhood_layer": {},
               "best_projection_layer": {}, "best_fiber_layer": {} }
}

And the two commands that produce a reading:

# geometry: sweep one factor for one model
uv run python scripts/run_manifold_calibration_sequential.py \
    --model Toto-2.0-2.5B --target sine_phase

# probe benchmark: categorical + dense linear readouts
uv run python -m aionoscope_benchmarks.run_model \
    --model Toto-2.0-2.5B --num-enabled 1

What this is — and what it isn't.

  • Readouts are linear and run on mean-pooled features (v0). Non-linear probes and token-level pooling are future work; pooling can hide structure a probe might still find.
  • One dataset seed and a fixed probe schedule. Full seed sweeps multiply runtime by orders of magnitude and aren't run yet.
  • Each model is evaluated at its native sequence length — ecological, but lengths differ across models, so absolute layer indices aren't comparable one-to-one.
  • Code release. The generator and the benchmark harness aren't public yet — we're preparing them for release and will link them here soon.

Get involved

Point it at your model

If you train or ship a time-series model, the most useful thing you can do is tell us where the geometry should be cleaner than it is — a factor your model ought to nail, a layer that surprises you, a component we don't yet generate. The instrument is only as sharp as the questions put to it.

  • Suggest a model or a generative factor to add.
  • Tell us which lens matters for your use case.
  • Flag a reading that looks wrong — those are the best bug reports.