Section 01The augmentation tax
Self-supervised learning works because the objective teaches a model which transformations of the input should leave the meaning unchanged. In vision, that design space is comparatively well understood — and even there it is fragile. Removing multi-crop views drops DINO's ImageNet linear-probe top-1 from 76.1% to 72.5%; the dependence is stronger still in SimCLR and BYOL.
Time series make it worse, because what counts as a "safe" transformation depends entirely on the signal. A time-shift is harmless for a periodic vibration sensor and catastrophic for an ECG aligned to the R-peak. One augmentation study found that the choice of augmentation alone moved fault-detection accuracy by up to 32 points. We see the same fragility in our own baselines: take a JEPA recipe tuned for ECG and change only its masking keep-ratio by 2–3×, and the quality of the learned representation degrades sharply.
For a single dataset you can pay this cost once and move on. For a reusable encoder — one you would like to point at server metrics today and physiology tomorrow — the augmentation recipe becomes a per-domain engineering bill that comes due every time the signal family changes.
When a view recipe is reused outside the domain it was designed for, augmentation engineering — not the model — becomes the bottleneck.
LeNEPA's wager is simple: remove the augmentations entirely, and replace the invariances they encoded with an objective and a regularizer that don't assume anything domain-specific in the first place.
Section 02What LeNEPA is
LeNEPA — Latent Euclidean Next-Embedding Prediction Architecture — is a lean variant of next-embedding prediction (NEPA). The encoder is ordinary: a strided convolution chops the signal into patch tokens, and a causal transformer reads them left to right. The learning signal is just as plain. At every position the model predicts the next latent token and minimizes the squared error to it. No views, no masks, no negative pairs, no second "target" network to keep in sync.
Two ingredients keep that simple objective from collapsing, and they are where "Le" (lean) comes from. Vanilla NEPA leans on an EMA teacher and a stop-gradient to avoid the trivial solution where every token predicts a constant. LeNEPA throws both away and instead regularizes the token distribution toward an isotropic Gaussian with SIGReg, applied along time within each sample. And following Guillotine Regularization, the prediction loss is computed inside a lightweight projector that is discarded at evaluation — you keep the backbone, not the head.
The top row is what you deploy: a frozen backbone whose mean-pooled tokens are the representation. The bottom row exists only during training — the projector absorbs objective-specific structure and is then cut away.
That is the whole method. The configuration is held fixed across datasets — a causal ViT-XS, 20k training steps, temporal SIGReg on the tokenizer and top layer — and the only thing that changes between experiments is the data the model is pretrained on. The rest of this post is what that buys.
Section 03The headline: a 6.3M-parameter underdog on UCR-128
The cleanest external test for a frozen time-series encoder is the UCR archive: 128 univariate classification datasets, a frozen backbone, a Random-Forest probe on top, and the mean accuracy across all of them. It is the benchmark Mantis was purpose-built to win — with bidirectional attention, a handcrafted first-difference branch, and a contrastive objective with crop augmentations.
We pretrained one LeNEPA encoder on the synthetic CauKer generator — no UCR data, no augmentations, causal attention, no difference branch — and ran it through the same frozen-feature protocol. It lands at 77.65%: above NuTime, and within 0.24 points of the 161M-parameter MOMENT.
We treat this as an existence proof, not a leaderboard claim — it is a single pretraining seed, and we report the best checkpoint over the run. But the point stands: every structural choice that separates LeNEPA from Mantis and MOMENT — causal instead of bidirectional, no difference branch, no augmentations — is a handicap for UCR classification, and the no-augmentation recipe still reaches the same band.
Section 04Why it generalizes: a portability stress test
A single benchmark number can't tell you whether a recipe travels. So we ran a controlled stress test. Take a strong, ECG-tuned JEPA recipe and LeNEPA's fixed no-augmentation recipe. Train each on PTB-XL (12-lead ECG — exactly what the JEPA masking was designed for), then train each again, unchanged, on Aionoscope, a structurally different synthetic signal family. The question is not "which is better after tuning" — it is "what does it cost to reuse the same recipe somewhere new?"
The reading we draw is narrow and honest. This does not show that JEPA "can't" do Aionoscope — a JEPA recipe redesigned for that signal family would likely recover. It shows the cost of reuse: the masking schedule encodes an assumption about which temporal regions are meaningful, and that assumption is tied to ECG morphology. Move to a signal where it doesn't hold and the gains shrink. A regularization-based, augmentation-free objective simply has less domain knowledge baked in to be wrong about.
What "fixed recipe" means here
Both methods are retrained from scratch on each dataset — this is not zero-shot checkpoint transfer. What is held fixed is the method-specific configuration: JEPA keeps its ECG-tuned masking, LeNEPA keeps its no-augmentation objective and SIGReg settings. Both were developed on PTB-XL; neither was tuned for Aionoscope.
Section 05Three ideas that make it work
1 · Temporal SIGReg instead of an EMA teacher
A next-token objective with no stop-gradient can collapse in two ways: globally (every sample maps to the same point) or, for time series, along time (the tokens within one sample become nearly constant). We found that applying SIGReg across the time axis within each sample — pushing the per-sample token cloud toward isotropy — is the single regularizer that delivers sustained gains on both datasets. Batch-wise, pooled, and innovation variants each fall short somewhere; temporal SIGReg is the one that holds. It also removes the EMA teacher and stop-gradient entirely, which is one fewer moving part to tune.
2 · A projector you throw away
The prediction loss is not applied to the representation you keep. It is applied inside a small MLP projector that is discarded at evaluation. This "guillotine" trick lets the projector absorb the parts of the objective that are useful for prediction but not for downstream tasks. Turning it on improved 22 of 24 frozen-probe comparisons in our ablations — and crucially, the benefit comes from separating the loss space from the evaluated space, not from adding capacity: bigger projectors didn't help.
3 · The representation lives in the middle
Because the objective is "predict the next token", the upper blocks of the backbone start to specialize into an implicit predictor — they bend high-level features into the regression space the loss wants. That makes the final layer a poor place to read from. The most transferable features sit a few layers earlier.
This is worth knowing if you adopt the recipe: don't reach for the last hidden state out of habit. For latent-prediction encoders, the deepest layer is partly a predictor, not a representation.
Section 06Aionoscope: from crafting to engineering
Everything above leaned on a tool we built alongside the method. On real benchmarks like PTB-XL the labels are high-level and entangled — if a model underperforms, you can't easily tell which underlying factor it failed to encode. So self-supervised research has largely been crafting: try a change, watch one headline metric, guess. Aionoscope is our attempt to make it engineering.
It is a GPU-first synthetic generator with an explicit Process → View factorization: a latent state with known generative factors (noise scale, periodic frequency, trend slope, event timing…) is rendered into a signal. You pretrain on the unlabeled stream; then you freeze the backbone and use the ground-truth factors — which the model never saw — as linear probes. The labels are exact and available for every sample, so "which factors of variation did the encoder keep, and which did it smooth away?" becomes a direct measurement instead of a guess.
Ground-truth labels y = ℓ(s) come straight from the generator and are used only after freezing — never during training. Streams are seeded and reproducible, and effectively infinite.
Pointed at our own encoders, the microscope is already opinionated. NEPA and LeNEPA are strongest on periodic and noise-like components, and they recover scale factors (frequency, amplitude) far more readily than location factors (absolute offset, event timing). Their clearest blind spot is sparse events: the objective encodes that a spike happened much more reliably than exactly when — which is precisely the information an anomaly- or fault-detection system needs. That is the kind of actionable, factor-level verdict a single AUROC can never give you.
The same instrument powers our companion series on representation geometry — watching these manifolds form, scale, and reshape over training. (Links to follow as those posts are published.)
Section 07Limitations & what's next
LeNEPA removes one source of time-series SSL engineering — augmentations — and we want to be precise about what it does not yet remove:
- The tokenizer still has knobs. The convolutional patch embedding assumes regular sampling and depends on kernel size and stride; principled defaults across signals with very different time scales remain future work.
- SIGReg has its own hyperparameters. Its scale and layer placement matter, and our CauKer run used a smaller temporal scale than the PTB-XL/Aionoscope runs. We traded augmentation tuning for regularizer tuning — a better trade, we think, but not a free one.
- The UCR result is a single seed. It is an existence proof from one pretraining run and a best-checkpoint readout, not a foundation-model or leaderboard claim. Additional seeds would characterize variance.
- Univariate, so far. Everything here is univariate; broader multivariate (UEA-style) evaluation is the obvious next step.
Our read of the evidence is that augmentation-free latent prediction is a strong building block for portable time-series encoders — recipes you can reuse with minimal per-dataset view engineering — and that a microscope like Aionoscope is what lets you improve them deliberately rather than by intuition.
Get involved
Read it, run it, break it
LeNEPA and Aionoscope are part of an ongoing effort at Langotime toward domain-agnostic time-series representations. If you work on time-series SSL, foundation encoders, or diagnostic benchmarks, we'd like to compare notes.
- The paper, with full protocol, ablations, and the Aionoscope microscope appendix.
- Source code and reproducible artifacts for the recipe and the generator.
- The companion representation-geometry series built on the same instrument.