Google Research reports that a continued pre-training approach - "in-context fine-tuning" (ICF) - turns its TimesFM time-series foundation model into a few-shot learner at inference. On a diverse suite of unseen datasets, ICF improves accuracy over the base model and matches a fully supervised fine-tuned variant. The work was presented at ICML 2025 as In-Context Fine-Tuning for Time-Series Foundation Models and summarized on the Google Research blog.
Executive snapshot
- 6.8% average accuracy gain vs. the base TimesFM across 23 unseen datasets, measured via the geometric mean of mean absolute scaled errors (MASE) normalized to a naive repeat of the last seasonal pattern.
- Matches TimesFM with supervised fine-tuning (per-dataset fine-tune, then test) without requiring user-side fine-tuning workflows.
- Introduces a learnable separator token and continued pre-training so the model can attend to multiple related in-context series without conflating them.
- Architecture context: TimesFM tokenizes 32-point patches as input tokens and decodes output tokens back to 128 timepoints via a shared multilayer perceptron.
- More in-context examples improve accuracy at the cost of longer inference time; ICF uses context more effectively than a purely long-context model without in-context learning.
Implication for marketers: If you already run time-series forecasting, a few relevant examples from similar products, regions, or time windows can deliver fine-tune-level accuracy without spinning up new training jobs.
Method and source notes
- What was measured: Forecast accuracy improvements when the base TimesFM model is continued-pre-trained to learn from in-context examples at inference, compared with (a) base TimesFM and (b) a supervised fine-tuned TimesFM per dataset. Metric: geometric mean of MASE normalized by a seasonal naive baseline.
- Who/when: Google Research; presented at ICML 2025 (In-Context Fine-Tuning for Time-Series Foundation Models); public summary via the Google Research blog. Original TimesFM details are in the 2023 TimesFM announcement.
- Sample: 23 datasets not seen during any training phase; each dataset contains multiple time series. In-context examples were sampled from histories of the target series and peer series in the same dataset to avoid leakage.
- Approach: Continue pre-training TimesFM with sequences that include task history plus relevant in-context series, separated by learnable common separator tokens; standard decoder-only next-token training with causal self attention in a transformer backbone.
- Key caveats: The public write-up does not enumerate the 23 datasets, per-dataset scores, compute or latency deltas, domain breakdowns, or operational costs. Results are aggregated (GM-MASE) and reported as relative improvements.
Findings
- Few-shot adaptation via in-context examples yields a 6.8% average improvement over the base model and attains parity with supervised fine-tuning, removing the need for separate per-dataset training to reach top accuracy in this setup.
- Separator tokens are central for disambiguating multiple series in the prompt; without them, concatenated series can appear as a single jagged stream and degrade learning. Continued pre-training teaches the model to attend to separators and exploit related examples productively.
- Scaling in-context examples increases accuracy while adding inference latency, reflecting an accuracy-latency trade-off. The ICF model exploits context more effectively than simply extending context length without in-context learning capability.
Performance and behavior details
- Metric and aggregation: Results use the geometric mean of MASE normalized to a seasonal naive repeat across 23 unseen datasets, reducing sensitivity to outliers across heterogeneous series.
- Baselines: (1) TimesFM Base (zero-shot learning) and (2) TimesFM-FT (supervised fine-tuned per dataset). ICF outperforms Base by 6.8% and equals FT on the aggregated metric.
- Architecture specifics: TimesFM encodes 32 timepoints per input token; outputs are mapped back via a shared MLP to 128 timepoints per token. ICF keeps the decoder-only backbone and adds separator tokens plus continued pre-training for in-context learning.
Interpretation and implications
- Likely: For common marketing forecasting tasks (category or product demand, site traffic, store visits, promo-lift baselines), prompting with a small set of recent, relevant series (e.g., similar SKUs, adjacent regions, prior periods) can close most of the gap to supervised fine-tuning while avoiding fine-tune pipelines and retraining delays.
- Likely: Teams can run a single general model and adapt per task at inference with curated exemplars, simplifying MLOps and reducing per-project engineering overhead compared with maintaining many fine-tuned variants.
- Tentative: Because more in-context examples improve accuracy at the cost of latency, batch-planning forecasts (weekly budget pacing, inventory buys) can afford larger exemplar sets, while time-critical use cases (in-day bidding, intraday replenishment) may need tighter prompts or caching to meet latency targets.
- Tentative: The separator-token design implies prompt construction matters; grouping exemplars by similarity (seasonality, trend, category, locality) should increase relevance and reduce noise during attention.
- Speculative: If parity with supervised fine-tuning holds across more domains, teams may shift spend from periodic retraining to data curation and retrieval that select the right exemplars for each forecast call.
Contradictions and gaps
- Unreported details: The public summary does not list the 23 datasets, domain mix, horizon lengths, or per-dataset wins and losses, limiting domain-specific conclusions (e.g., retail vs. mobility vs. energy).
- Cost and latency not quantified: The accuracy-latency trade-off is qualitative; there are no milliseconds or compute figures or throughput benchmarks, making production capacity planning uncertain.
- Prompt construction: The authors propose selecting relevant in-context examples and note plans to automate selection, but the study does not compare selection strategies, which could materially impact accuracy and latency.
- Generalization scope: Results are aggregated via GM-MASE; sensitivity to other metrics (e.g., MAPE, sMAPE, quantile loss) and to extreme seasonality or shock events is not covered.
- Reproducibility: The blog references a conference paper; until artifacts or detailed appendices are public, independent replication and apples-to-apples evaluations remain limited.
Sources
- Google Research. “Time-series foundation models can be few-shot learners.” Sept 23, 2025. https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/
- Google Research. “A decoder-only foundation model for time-series forecasting (TimesFM).” 2023. https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/