LeNEPA — No-augmentation representation learning for time series

Section 01The augmentation tax

Self-supervised learning works because the objective teaches a model which transformations of the input should leave the meaning unchanged. In vision, that design space is comparatively well understood — and even there it is fragile. Removing multi-crop views drops DINO's ImageNet linear-probe top-1 from 76.1% to 72.5%; the dependence is stronger still in SimCLR and BYOL.

Time series make it worse, because what counts as a "safe" transformation depends entirely on the signal. A time-shift is harmless for a periodic vibration sensor and catastrophic for an ECG aligned to the R-peak. One augmentation study found that the choice of augmentation alone moved fault-detection accuracy by up to 32 points. We see the same fragility in our own baselines: take a JEPA recipe tuned for ECG and change only its masking keep-ratio by 2–3×, and the quality of the learned representation degrades sharply.

For a single dataset you can pay this cost once and move on. For a reusable encoder — one you would like to point at server metrics today and physiology tomorrow — the augmentation recipe becomes a per-domain engineering bill that comes due every time the signal family changes.

When a view recipe is reused outside the domain it was designed for, augmentation engineering — not the model — becomes the bottleneck.

LeNEPA's wager is simple: remove the augmentations entirely, and replace the invariances they encoded with an objective and a regularizer that don't assume anything domain-specific in the first place.

Section 02What LeNEPA is

LeNEPA — Latent Euclidean Next-Embedding Prediction Architecture — is a lean variant of next-embedding prediction (NEPA). The encoder is ordinary: a strided convolution chops the signal into patch tokens, and a causal transformer reads them left to right. The learning signal is just as plain. At every position the model predicts the next latent token and minimizes the squared error to it. No views, no masks, no negative pairs, no second "target" network to keep in sync.

Two ingredients keep that simple objective from collapsing, and they are where "Le" (lean) comes from. Vanilla NEPA leans on an EMA teacher and a stop-gradient to avoid the trivial solution where every token predicts a constant. LeNEPA throws both away and instead regularizes the token distribution toward an isotropic Gaussian with SIGReg, applied along time within each sample. And following Guillotine Regularization, the prediction loss is computed inside a lightweight projector that is discarded at evaluation — you keep the backbone, not the head.

The architecture, and what survives evaluation causal next-latent prediction · guillotine projector

input signalx → Conv patch embedtokens z → Causal ViTdepth 8, RoPE → Representationmean-pool · kept & frozen

Projectordiscarded at eval → predict next latent tokenMSE to z_t+1 + temporal SIGRegno EMA · no stop-gradient

The top row is what you deploy: a frozen backbone whose mean-pooled tokens are the representation. The bottom row exists only during training — the projector absorbs objective-specific structure and is then cut away.

That is the whole method. The configuration is held fixed across datasets — a causal ViT-XS, 20k training steps, temporal SIGReg on the tokenizer and top layer — and the only thing that changes between experiments is the data the model is pretrained on. The rest of this post is what that buys.

Section 03The headline: a 6.3M-parameter underdog on UCR-128

The cleanest external test for a frozen time-series encoder is the UCR archive: 128 univariate classification datasets, a frozen backbone, a Random-Forest probe on top, and the mean accuracy across all of them. It is the benchmark Mantis was purpose-built to win — with bidirectional attention, a handcrafted first-difference branch, and a contrastive objective with crop augmentations.

We pretrained one LeNEPA encoder on the synthetic CauKer generator — no UCR data, no augmentations, causal attention, no difference branch — and ran it through the same frozen-feature protocol. It lands at 77.65%: above NuTime, and within 0.24 points of the 161M-parameter MOMENT.

UCR-128, frozen features + Random Forest accuracy · parameters · architecture

LeNEPA (teal) reaches the neighborhood of models built for this benchmark while being the only causal, augmentation-free entry in the table. MOMENT spends about 25× the parameters for a 0.24-point edge. NuTime and MOMENT figures are the protocol-matched Random-Forest numbers reported by MantisV2; NuTime's pretraining even includes the UCR/UEA training splits.

We treat this as an existence proof, not a leaderboard claim — it is a single pretraining seed, and we report the best checkpoint over the run. But the point stands: every structural choice that separates LeNEPA from Mantis and MOMENT — causal instead of bidirectional, no difference branch, no augmentations — is a handicap for UCR classification, and the no-augmentation recipe still reaches the same band.

How the UCR score develops — and where it lives over training · across depth

UCR-128 accuracy over pretraining

accuracy by backbone layer

Left: the reported (best-layer) accuracy climbs from 67% at initialization to a 77.65% peak around 17k steps, while the final layer trails by ~2.5 points the whole way. Right: that gap is structural — the most useful representation sits in the middle of the backbone (around layer 4), not at the end. More on why in Section 05.

Section 04Why it generalizes: a portability stress test

A single benchmark number can't tell you whether a recipe travels. So we ran a controlled stress test. Take a strong, ECG-tuned JEPA recipe and LeNEPA's fixed no-augmentation recipe. Train each on PTB-XL (12-lead ECG — exactly what the JEPA masking was designed for), then train each again, unchanged, on Aionoscope, a structurally different synthetic signal family. The question is not "which is better after tuning" — it is "what does it cost to reuse the same recipe somewhere new?"

Same recipes, two signal families best-layer AUROC over pretraining

PTB-XL · ECG (in-domain for JEPA)

Aionoscope · new signal family (recipe reused)

On its home turf the ECG-tuned JEPA is excellent (0.891 AUROC) and LeNEPA is right behind it (0.881). Reuse the same recipes on a different signal family and they diverge: LeNEPA climbs to 0.920 while JEPA stalls at 0.880. LeNEPA also gets there faster — 80% of its final gain in 2–5k steps, against 5–10k for the JEPA readout.

The reading we draw is narrow and honest. This does not show that JEPA "can't" do Aionoscope — a JEPA recipe redesigned for that signal family would likely recover. It shows the cost of reuse: the masking schedule encodes an assumption about which temporal regions are meaningful, and that assumption is tied to ECG morphology. Move to a signal where it doesn't hold and the gains shrink. A regularization-based, augmentation-free objective simply has less domain knowledge baked in to be wrong about.

What "fixed recipe" means here

Both methods are retrained from scratch on each dataset — this is not zero-shot checkpoint transfer. What is held fixed is the method-specific configuration: JEPA keeps its ECG-tuned masking, LeNEPA keeps its no-augmentation objective and SIGReg settings. Both were developed on PTB-XL; neither was tuned for Aionoscope.

Section 05Three ideas that make it work

1 · Temporal SIGReg instead of an EMA teacher

A next-token objective with no stop-gradient can collapse in two ways: globally (every sample maps to the same point) or, for time series, along time (the tokens within one sample become nearly constant). We found that applying SIGReg across the time axis within each sample — pushing the per-sample token cloud toward isotropy — is the single regularizer that delivers sustained gains on both datasets. Batch-wise, pooled, and innovation variants each fall short somewhere; temporal SIGReg is the one that holds. It also removes the EMA teacher and stop-gradient entirely, which is one fewer moving part to tune.

2 · A projector you throw away

The prediction loss is not applied to the representation you keep. It is applied inside a small MLP projector that is discarded at evaluation. This "guillotine" trick lets the projector absorb the parts of the objective that are useful for prediction but not for downstream tasks. Turning it on improved 22 of 24 frozen-probe comparisons in our ablations — and crucially, the benefit comes from separating the loss space from the evaluated space, not from adding capacity: bigger projectors didn't help.

3 · The representation lives in the middle

Because the objective is "predict the next token", the upper blocks of the backbone start to specialize into an implicit predictor — they bend high-level features into the regression space the loss wants. That makes the final layer a poor place to read from. The most transferable features sit a few layers earlier.

Where the useful representation sits frozen-probe AUROC by layer, step 20k

PTB-XL

Aionoscope

JEPA (red) climbs monotonically and peaks at the top layer — the standard pattern. LeNEPA and NEPA peak in the middle (around layers 4–5) and fall off at the end. We therefore probe every layer and read from the best one; a fixed mid-layer readout recovers almost all of that advantage, while the final layer leaves real accuracy on the table.

This is worth knowing if you adopt the recipe: don't reach for the last hidden state out of habit. For latent-prediction encoders, the deepest layer is partly a predictor, not a representation.

Section 06Aionoscope: from crafting to engineering

Everything above leaned on a tool we built alongside the method. On real benchmarks like PTB-XL the labels are high-level and entangled — if a model underperforms, you can't easily tell which underlying factor it failed to encode. So self-supervised research has largely been crafting: try a change, watch one headline metric, guess. Aionoscope is our attempt to make it engineering.

It is a GPU-first synthetic generator with an explicit Process → View factorization: a latent state with known generative factors (noise scale, periodic frequency, trend slope, event timing…) is rendered into a signal. You pretrain on the unlabeled stream; then you freeze the backbone and use the ground-truth factors — which the model never saw — as linear probes. The labels are exact and available for every sample, so "which factors of variation did the encoder keep, and which did it smooth away?" becomes a direct measurement instead of a guess.

A microscope, not just a benchmark two streams from one generator

Processlatent factors s → Viewrender x = g(s) → unlabeled streamx → SSL pretrain→ freeze backbone → per-factor probesmicroscope readout

Ground-truth labels y = ℓ(s) come straight from the generator and are used only after freezing — never during training. Streams are seeded and reproducible, and effectively infinite.

Pointed at our own encoders, the microscope is already opinionated. NEPA and LeNEPA are strongest on periodic and noise-like components, and they recover scale factors (frequency, amplitude) far more readily than location factors (absolute offset, event timing). Their clearest blind spot is sparse events: the objective encodes that a spike happened much more reliably than exactly when — which is precisely the information an anomaly- or fault-detection system needs. That is the kind of actionable, factor-level verdict a single AUROC can never give you.

The same instrument powers our companion series on representation geometry — watching these manifolds form, scale, and reshape over training. (Links to follow as those posts are published.)

Section 07Limitations & what's next

LeNEPA removes one source of time-series SSL engineering — augmentations — and we want to be precise about what it does not yet remove:

The tokenizer still has knobs. The convolutional patch embedding assumes regular sampling and depends on kernel size and stride; principled defaults across signals with very different time scales remain future work.
SIGReg has its own hyperparameters. Its scale and layer placement matter, and our CauKer run used a smaller temporal scale than the PTB-XL/Aionoscope runs. We traded augmentation tuning for regularizer tuning — a better trade, we think, but not a free one.
The UCR result is a single seed. It is an existence proof from one pretraining run and a best-checkpoint readout, not a foundation-model or leaderboard claim. Additional seeds would characterize variance.
Univariate, so far. Everything here is univariate; broader multivariate (UEA-style) evaluation is the obvious next step.

Our read of the evidence is that augmentation-free latent prediction is a strong building block for portable time-series encoders — recipes you can reuse with minimal per-dataset view engineering — and that a microscope like Aionoscope is what lets you improve them deliberately rather than by intuition.

Get involved

Read it, run it, break it

LeNEPA and Aionoscope are part of an ongoing effort at Langotime toward domain-agnostic time-series representations. If you work on time-series SSL, foundation encoders, or diagnostic benchmarks, we'd like to compare notes.

The paper, with full protocol, ablations, and the Aionoscope microscope appendix.
Source code and reproducible artifacts for the recipe and the generator.
The companion representation-geometry series built on the same instrument.

Read the paper → Code & artifacts Questions or collaborations: [email protected]