What does that look like? When a model truly grasps a factor, the shape of the data shows up as the shape of the activations: the phase of a sine closes into a loop, a spike's position stretches into a line, a trend's slope fans out along an axis. (That framing — world-structure as activation-geometry — comes from Goodfire's Neural Geometry series, made for language models; we highly recommend reading it.)
Time series are the natural place to look. Language interpretability has to hunt for structure in a discrete, tangled space of tokens and concepts; time series hand it to you. The generative factors are continuous — the phase and frequency of a sine, the position of a spike, the slope of a trend — and a perfectly controlled sweep is free to make: step one knob across a fine grid, switch everything else off, and watch, layer by layer, what the representation does. Ground truth isn't inferred from correlations; it's the dial you turned.
That turns geometry into a practical diagnostic — which is what the Langotime Aionoscope manifold suite measures and its dashboard lets you explore. Using Toto-2.0-2.5B (Datadog's 2.5-billion-parameter forecasting model) as the first running example, this post explains the project, how to read the charts, and where the geometry exposes a model's blind spots — like a factor Toto's tokenizer quietly discards.
Why we built itDecodable is not the same as organized
The standard way to interrogate a frozen representation is a linear probe: train a small linear readout and ask "can I recover the phase / the spike position / the slope from these activations?" It's a great question, and we still run it as our primary benchmark. But it only tells you whether a variable is linearly decodable. It is silent on a deeper question:
Does the representation arrange examples according to the true shape of the process that generated them?
Linear decodability can't settle that. A probe only needs one direction where the factor reads out — so it can score high while the manifold is curved, folded, or has its neighbours scrambled in every other direction, and it can score low on structure that's perfectly real but nonlinear, like a circle. Decodable and geometrically clean are different questions. We wanted to see the shape directly, so for each model we run a controlled experiment:
-
Sweep one factor. Pick a
generative knob (say,
sine_phase) and step it across a fine 1,024-point grid, holding everything else fixed. Every other component is switched off (num_enabled = 1), so the only thing changing is the factor we care about. - Collect frozen features. Push each signal through the model and read out the hidden activations at every layer. No fine-tuning, no probe training — just the representations the model already has.
-
Build the manifold. Each grid
point gives one activation vector — its
centroid. Ordered by the swept value, those centroids trace a path through representation space. That path is the object we score.
We're using one sample per grid point here, so the "centroid" is actually just a single activation vector. We keep the name "centroid" to be in line with the general protocol, where several samples are collected at each grid point and averaged.
- Measure the geometry. Does the path preserve distances? Local neighbourhoods? The cyclic topology of a phase, or the monotone order of a spike position? Does it close into a loop when it should?
We do this for 45 foundation models (Toto, Moirai, Chronos, TiRex, Mantis, TimesFM, Sundial, LeNEPA and more) across seven generative factors, at every layer.
How to read the dashboardThree views, one manifold
Pick a model, a target (the swept factor) and a layer, and the dashboard shows the same manifold from three angles. The rest of this post walks through each one with a real example.
- Centroid path — the manifold itself, projected from its 64 PCA dimensions down to 2D or 3D in your browser. Each dot is one grid point, coloured by the value of the swept factor. A clean shape here is the headline result.
- Metrics across layers — line charts of each geometry score against depth, so you can see where in the network the structure lives and how it changes.
- Distance scatter & heatmap — the raw evidence behind the scores: for every pair of grid points, true distance in factor-space vs. distance in the representation, measured both as a straight line and as a walk along the manifold.
The charts below are live: they stream the very same JSON artifacts the dashboard reads, and render as you scroll.
Scope & caveatsWhat Langotime Aionoscope measures — and what it doesn't
- This is a representation-geometry evaluation, not a steering or causal-intervention result. We read frozen activations; we never replace them or continue generation from an edited state.
- It complements the linear-probe leaderboard — it does not replace it. A manifold can be clean but curved, or decodable but tangled; the two views answer different questions.
- The numbers are honest about geometry, not downstream accuracy. A high projection R² means a coordinate is recoverable by following the manifold — not that the model forecasts well.
- It is deliberately simple: controlled one-factor sweeps with everything else switched off, classical PCA and shortest-path geodesics. No UMAP or t-SNE in the metrics — only as optional eye-candy, never as a score.
How the four scores are computed
Every score starts from the same ingredients. Each grid point gives one centroid — its activation vector, reduced by PCA to 64 dimensions. For each pair of grid points \((i,j)\) we measure three distances:
- True distance \(d^{\text{true}}_{ij}\) — how far apart the two swept values actually are: \(|v_i - v_j|\), or the shorter way round the circle \(\min(|v_i-v_j|,\,P-|v_i-v_j|)\) for a cyclic factor like phase. This is the ground truth the metric names call “latent.”
- Straight-line distance \(d^{\text{lin}}_{ij}=\lVert c_i-c_j\rVert_2\) — Euclidean distance between the two centroids in the 64-D representation.
- Geodesic distance \(d^{\text{geo}}_{ij}\) — shortest path from \(i\) to \(j\) along a \(k\)-nearest-neighbour graph (edges weighted by straight-line distance; \(k\) picked from \(\{4,6,8\}\) to maximise the score). Distance along the manifold rather than straight through it.
Then, over all \(N(N-1)/2\) pairs of grid points:
- Spearman vs. linear \(=\rho_s\!\big(d^{\text{true}},\,d^{\text{lin}}\big)\) — rank correlation between the true distances and the straight-line distances. \(1.0\) means straight lines order every pair exactly as the true factor does.
- Spearman vs. geodesic \(=\rho_s\!\big(d^{\text{true}},\,d^{\text{geo}}\big)\) — the same, but with distance measured along the manifold.
- 5-NN recall \(=\dfrac{1}{N}\sum_i \dfrac{\big|\,\mathcal{N}^{\text{true}}_5(i)\cap\mathcal{N}^{\text{lin}}_5(i)\,\big|}{5}\) — for each point, the fraction of its 5 true nearest neighbours that are also among its 5 nearest in the representation, averaged over all points. Pure local faithfulness.
- Geodesic gain \(=(\text{Spearman vs. geodesic})-(\text{Spearman vs. linear})\) — a difference: how much pairwise order you recover by walking along the manifold instead of cutting straight across. Large and positive ⇒ the shape is curved or folded but locally faithful.
All three distance matrices are built from the same
centroids; Spearman and recall use the
upper-triangular pairs \((i<j)\). Computed in
manifold_eval.py.
Example 1 · sine_phaseA phase is a circle
Start with the cleanest case. The phase of a sine wave is a circular quantity: phase 0 and phase 2π are the same signal. If a model truly understands phase, its centroids should trace a closed loop — and points near the start should sit right next to points near the end.
Here is Toto-2.0-2.5B at layer 4, sweeping phase across a full cycle. Drag the slider (or hit play): on the left, the input sine slides along; on the right, a dot rides around the manifold. Colour runs from low phase (blue) to high (red).
One subtlety the layer view reveals: a shape can be globally right but locally fuzzy. The ring is preserved at every depth, yet the fine-grained ordering of immediate neighbours blurs in the deeper layers.
None of this is automatic. A clean ring is something a model has to build — and not every one does. Here is NuTime-Bias9 on the identical phase sweep, at layer 1. Drag the slider (or rotate the 3-D view) and watch the same dot try to ride the same loop.
Give it a few layers, though, and it recovers. By layer 4 NuTime has found the circle again — just not Toto's circle.
Example 2 · spike_time_fracWhen a straight line lies
Now slide a single spike from the start of the window to the end. The factor is just its position in time — a plain interval from 0 to 1, exactly the kind of thing you'd expect to come out as a tidy line. At the input-facing layer 0 it comes out as a tangled triangle instead, with almost every position crushed into one corner. Drag the spike across time and watch the dot leap around it — then read on for why.
It helps to be explicit about how Toto reads a signal, since that's the whole cause — and it's tempting to picture the model taking one number at a time. It doesn't. Just as a language model splits text into tokens, Toto splits the series into fixed-length patches — 32 samples each — and the transformer sees one token per patch, not one per time step. A patch is the model's smallest unit of time, so a single spike can't say exactly when it occurred; its patch records only which patch it fell in and where inside that patch it sat — the two coordinates this whole example turns on. Everything jumpy you just watched is the geometry being quantised to that patch grid instead of to time.
That folding is exactly why a naïve straight-line distance lies — and it's the single most important idea in the suite, so it's worth seeing in the raw pairwise data. The distance scatter plots, for every pair of grid points, true spike-distance on the x-axis against representation distance on the y-axis. A faithful geometry would form a tight rising band.
So is the model stuck with this folded mess? No — with depth it untangles, and in a revealing way. Watch the straight-line score climb.
And here is where it gets pretty. By layer 21 the two things the tokenizer was really measuring — which patch, and where inside it — have pulled apart onto separate, orthogonal axes. The manifold turns into a comb: a row of teeth strung along one dominant axis. Rotate it (it defaults to 3-D).
The spike has a smoother cousin:
gaussian_time_frac slides a soft bump
across the same window. The bump is wide enough to
light up several patches at once, so it never gets
folded onto the patch grid the way a one-sample
spike does — even at layer 0 its manifold is already
a clean line (straight-line order ≈ 0.94 vs. the
spike's 0.12). Drag it and compare.
Example 3 · linear_trend_slopeWhat the tokenizer throws away
The last example is the most subtle, because the
geometry doesn't look broken — it looks
empty, for a reason that has nothing to do
with attention or depth. Before the first
transformer layer runs, a time-series model has to
turn raw numbers into tokens, and nearly all of them
begin the same way: standardize the
window — subtract a location, divide by a scale.
Toto then squashes the result through
asinh, its own signed-log, before
patching. That front-end is what lets one model
swallow temperatures, prices and megawatts
interchangeably. The price is that it erases the
absolute scale of the input.
For most factors that is a harmless convenience.
Trend slope is the pathological
case, because slope is pure scale — exactly
what standardization is built to remove. Standardize
a clean ramp y = slope·(t − t̄) by its
own standard deviation and the slope cancels
exactly: every magnitude collapses onto one and the
same normalized ramp, and only the
sign survives. Sweep from
−10⁶ to +10⁶ and the model
effectively sees one descending ramp, then one
ascending ramp, and — across more than four orders
of magnitude — almost nothing in between.
You might first blame the measurement. Scored across all 49 layers, the same representation looks nearly geometry-free under a linear ruler yet structured under a signed-log one — so perhaps we were only holding it to the wrong scale.
Before trusting either number, ask a blunter question: how much does the representation move at all as the slope sweeps? And ask it at layer 0 — the tokenizer's own output, before a single transformer block — so whatever we see is the front-end's doing, not something the network pieced together later. Walk the centroids in grid order and measure the jump between neighbours.
10⁻¹⁰ once
|slope| exceeds about
10⁻⁴). All the motion lives inside
|slope| ≲ 5×10⁻⁵ — about 99% of
the manifold's whole arc length, crammed into the
centre — and it is the scaler's fingerprint, not
slope geometry. A standardizer can't divide by zero,
so it pins its divisor at a small floor: while the
ramp is steep its own spread sets the scale and the
slope cancels perfectly, but once the ramp is
shallow enough to slip under that floor the division
stops cancelling and a shrinking, slope-proportional
residue leaks back in. The two
ragged shoulders are that residue — the
model's jittery response to a ramp dwindling into
the floor — while the lone tall spike is the
sign flip at dead centre, where the
grid steps from the smallest negative slope straight
to the smallest positive one and the residual ramp
inverts. Every layer tells the same story.
You can feel it by hand. Drag the sweep below: the
raw ramp swings from dead flat to a
10⁶ cliff — an enormous, obvious change
— while the representation cursor stays pinned to
one of two points and only comes alive as you cross
the centre. Your eye tracks the full dynamic range;
the model is holding on to a single sign bit.
Because layer 0 is the tokenizer's output with no transformer block behind it, what collapsed above is the tokenizer's doing — which means a model that tokenizes differently should leave a different trace at the very same stage. Toto's front-end divides the whole window by a single global scale, and for a pure ramp that scale is the slope, so it cancels. Our LeNEPA model never takes that global quotient: instead of rescaling the whole series at once, it cuts the signal into short patches and embeds each one with a small convolution, normalizing only locally — so the overall size of the trend is never divided away. Here is the identical measurement, same ±10⁶ sweep, at LeNEPA's layer 0: its own tokenizer output, the very stage where Toto's slope had already vanished.
And you can feel that by hand, the same way. Drag
the sweep below: this time the cursor never freezes
— as the slope scans from one 10⁶ cliff
to the other, it slides along the manifold the whole
way, the dead extremes included.
The lesson is about the tokenizer, not the ruler. Any factor that is a scale — trend slope, amplitude, a DC offset — is partly or wholly removed by a standardizing front-end before the transformer sees a thing, whereas a tokenizer that keeps the raw scale, like LeNEPA's, encodes it across the entire range. A flat geometry score under one model can be a genuine signal under another; it depends on what the front-end threw away. Reading these manifolds well means knowing what each model keeps at its door — and treating scale-like factors with suspicion, however clean a logarithmic ruler can make them look.
Example 4 · sine_frequency_hzBuilt layer by layer
The examples so far each froze on a single layer. This one is about depth itself. Phase arrived as a clean ring at the very first layer; frequency is something Toto has to build. Sweep a sine's frequency from about 1 Hz up to the edge of what 500 Hz sampling can carry, and watch the same manifold at eight depths at once.
One thing stays stubborn: the highest frequencies (red) never join the tidy arc — they sit in a scattered cloud at every depth. It is tempting to read that as "the model can't tell them apart," and it is about the sampling limit: the sweep deliberately stops at 0.9 × Nyquist (225 Hz at 500 Hz sampling — barely ~2.2 samples per cycle). But the cloud is the opposite of a collapse.
Measured in the full 64-dimensional space, neighbouring high frequencies are the farthest apart of any band — the representation moves about 10× more per step there than among the low frequencies. Near Nyquist a sine is sampled only two-to-four times per cycle, so nudging the frequency throws the sample points onto wildly different parts of each cycle: the input jumps around erratically, and the representation follows. That motion is high-dimensional and disordered — only about 55% of it lands in the flat plane the picture shows, against ~80% for the clean low-frequency arc. So PCA lays the orderly low frequencies flat and crushes the erratic near-Nyquist ones into a corner. The "clump" is a shadow: a real tangle, but one that points out of the page.
Why it folds: the n-th sample of a sampled sine swings with frequency at a rate proportional to n, so near Nyquist the late samples wind almost a full cycle for every extra hertz — faster than our ~1 Hz frequency grid can follow. The representation is still a single, deterministic curve in frequency — a smooth fit recovers about 90% of it — it has just coiled into too many dimensions to lay flat, and the last few percent genuinely outruns the grid.
You could ask the obvious follow-up — what if we just sample finer at the tail? Let's see what happens when we re-encode the same near-Nyquist band on a grid ten times denser — 97 → 225 Hz at Δf = 0.1 Hz, 1,281 frequencies, where neighbouring sines are 0.93-correlated instead of decorrelated.
The visual "spray" does resolve into a thread you can now follow, but it never lies flat. Three PCA axes still hold only a quarter to a half of it, and laying it down to 90% would take roughly two dozen axes. Sampling the coil more finely just reveals more of the coiling — at the deepest layer the dense tail is, if anything, less flat than the sparse one (participation ratio 16 → 22). The manifold is real and deterministic; it simply lives in more directions than a flat page can draw.
Open research · a community effort
This is research in progress — come build it with us
Everything here is early, evolving, and deliberately open. We think representation geometry for time-series models is far too interesting — and too big — to keep to ourselves, so we'd love to turn this into a genuine community effort. Collaboration, questions, pushback, and wild ideas are all very welcome.
- Want your model in the benchmark? We're happy to add more architectures and checkpoints.
- Missing a factor or geometry you care about? Suggest new targets and sweep families.
- Spotted a bug, a misleading metric, or a better way to measure? Tell us — we'll fix it.
- Want to dig in together? We're keen to co-investigate and publish more findings as a group.
If any of that resonates, just send us a message. The open research wiki below has the full idea writeup, current thinking, and how to reach us.