Google's ICF gives TimesFM few-shot capabilities - 6.8% accuracy gains without user fine-tuning

Google Research reports that a continued pre-training approach - "in-context fine-tuning" (ICF) - turns its TimesFM time-series foundation model into a few-shot learner at inference. On a diverse suite of unseen datasets, ICF improves accuracy over the base model and matches a fully supervised fine-tuned variant. The work was presented at ICML 2025 as In-Context Fine-Tuning for Time-Series Foundation Models and summarized on the Google Research blog.

Executive snapshot

6.8% average accuracy gain vs. the base TimesFM across 23 unseen datasets, measured via the geometric mean of mean absolute scaled errors (MASE) normalized to a naive repeat of the last seasonal pattern.
Matches TimesFM with supervised fine-tuning (per-dataset fine-tune, then test) without requiring user-side fine-tuning workflows.
Introduces a learnable separator token and continued pre-training so the model can attend to multiple related in-context series without conflating them.
Architecture context: TimesFM tokenizes 32-point patches as input tokens and decodes output tokens back to 128 timepoints via a shared multilayer perceptron.
More in-context examples improve accuracy at the cost of longer inference time; ICF uses context more effectively than a purely long-context model without in-context learning.

Implication for marketers: If you already run time-series forecasting, a few relevant examples from similar products, regions, or time windows can deliver fine-tune-level accuracy without spinning up new training jobs.

Method and source notes

What was measured: Forecast accuracy improvements when the base TimesFM model is continued-pre-trained to learn from in-context examples at inference, compared with (a) base TimesFM and (b) a supervised fine-tuned TimesFM per dataset. Metric: geometric mean of MASE normalized by a seasonal naive baseline.
Who/when: Google Research; presented at ICML 2025 (In-Context Fine-Tuning for Time-Series Foundation Models); public summary via the Google Research blog. Original TimesFM details are in the 2023 TimesFM announcement.
Sample: 23 datasets not seen during any training phase; each dataset contains multiple time series. In-context examples were sampled from histories of the target series and peer series in the same dataset to avoid leakage.
Approach: Continue pre-training TimesFM with sequences that include task history plus relevant in-context series, separated by learnable common separator tokens; standard decoder-only next-token training with causal self attention in a transformer backbone.
Key caveats: The public write-up does not enumerate the 23 datasets, per-dataset scores, compute or latency deltas, domain breakdowns, or operational costs. Results are aggregated (GM-MASE) and reported as relative improvements.

Findings

Few-shot adaptation via in-context examples yields a 6.8% average improvement over the base model and attains parity with supervised fine-tuning, removing the need for separate per-dataset training to reach top accuracy in this setup.
Separator tokens are central for disambiguating multiple series in the prompt; without them, concatenated series can appear as a single jagged stream and degrade learning. Continued pre-training teaches the model to attend to separators and exploit related examples productively.
Scaling in-context examples increases accuracy while adding inference latency, reflecting an accuracy-latency trade-off. The ICF model exploits context more effectively than simply extending context length without in-context learning capability.

Performance and behavior details

Metric and aggregation: Results use the geometric mean of MASE normalized to a seasonal naive repeat across 23 unseen datasets, reducing sensitivity to outliers across heterogeneous series.
Baselines: (1) TimesFM Base (zero-shot learning) and (2) TimesFM-FT (supervised fine-tuned per dataset). ICF outperforms Base by 6.8% and equals FT on the aggregated metric.
Architecture specifics: TimesFM encodes 32 timepoints per input token; outputs are mapped back via a shared MLP to 128 timepoints per token. ICF keeps the decoder-only backbone and adds separator tokens plus continued pre-training for in-context learning.

Interpretation and implications

Likely: For common marketing forecasting tasks (category or product demand, site traffic, store visits, promo-lift baselines), prompting with a small set of recent, relevant series (e.g., similar SKUs, adjacent regions, prior periods) can close most of the gap to supervised fine-tuning while avoiding fine-tune pipelines and retraining delays.
Likely: Teams can run a single general model and adapt per task at inference with curated exemplars, simplifying MLOps and reducing per-project engineering overhead compared with maintaining many fine-tuned variants.
Tentative: Because more in-context examples improve accuracy at the cost of latency, batch-planning forecasts (weekly budget pacing, inventory buys) can afford larger exemplar sets, while time-critical use cases (in-day bidding, intraday replenishment) may need tighter prompts or caching to meet latency targets.
Tentative: The separator-token design implies prompt construction matters; grouping exemplars by similarity (seasonality, trend, category, locality) should increase relevance and reduce noise during attention.
Speculative: If parity with supervised fine-tuning holds across more domains, teams may shift spend from periodic retraining to data curation and retrieval that select the right exemplars for each forecast call.

Contradictions and gaps

Unreported details: The public summary does not list the 23 datasets, domain mix, horizon lengths, or per-dataset wins and losses, limiting domain-specific conclusions (e.g., retail vs. mobility vs. energy).
Cost and latency not quantified: The accuracy-latency trade-off is qualitative; there are no milliseconds or compute figures or throughput benchmarks, making production capacity planning uncertain.
Prompt construction: The authors propose selecting relevant in-context examples and note plans to automate selection, but the study does not compare selection strategies, which could materially impact accuracy and latency.
Generalization scope: Results are aggregated via GM-MASE; sensitivity to other metrics (e.g., MAPE, sMAPE, quantile loss) and to extreme seasonality or shock events is not covered.
Reproducibility: The blog references a conference paper; until artifacts or detailed appendices are public, independent replication and apples-to-apples evaluations remain limited.

Sources

Google Research. “Time-series foundation models can be few-shot learners.” Sept 23, 2025. https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/
Google Research. “A decoder-only foundation model for time-series forecasting (TimesFM).” 2023. https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/