Google's New GIST Algorithm Could Rewrite How You Sample Training Data For Giant Models

Google Research has introduced GIST, a smart sampling algorithm for machine learning subset selection. Announced January 23, 2026 on the Google Research blog and presented at NeurIPS 2025, GIST targets large-scale training datasets where efficient, diverse and informative subsets are needed.

Key Details on GIST Algorithm

GIST stands for Greedy Independent Set Thresholding. It was developed by Google Research scientists Morteza Zadimoghaddam and Matthew Fahrbach. Their work is documented in an arXiv paper and an official Google Research article.

The algorithm selects a fixed-size subset that balances max-min diversity and task-specific utility. Utility is modeled using monotone submodular functions, capturing how much information the subset covers. The method targets single-shot downsampling, where the subset is chosen once before training.

According to the paper, GIST always finds a subset whose value is at least half of the optimal solution. The authors also show it is NP-hard to find any subset achieving more than 0.56 of the optimal value. For any minimum distance d achieved by the optimal solution, GIST attains comparable utility using distance threshold d/2.

GIST produces diversity by enforcing a minimum distance between any two selected data points.
It evaluates multiple distance thresholds and applies a greedy selection rule under each threshold.
The final subset is chosen from the best-performing threshold according to the utility function.
In reported experiments, the selection step ran much faster than subsequent model training.

Background Context

Large language models and computer vision systems often rely on datasets containing millions or billions of examples. Processing every example can be costly in time and computation. Subset selection addresses this by choosing a smaller representative training set.

Diversity is measured as max-min diversity, maximizing the smallest pairwise distance among selected items. This avoids near-duplicate or highly similar data points. Utility measures the overall relevance of the subset for the target prediction task.

Maximizing diversity alone can select uninformative points, while maximizing utility alone can yield redundant clusters. Combining both goals leads to an NP-hard optimization problem, making exact solutions infeasible for large datasets. Approximation algorithms therefore play a central role in this setting.

GIST tackles this problem by thresholding the diversity requirement and building a graph over the dataset. Two data points connect if their distance is below the chosen minimum, so they cannot both enter the subset. The algorithm then approximates a maximum independent set in this graph, with node weights given by utility scores, using a bicriteria greedy algorithm.

The authors also experimented with hybrid methods such as GIST-margin and GIST-submod. These variants combine GIST's diversity mechanism with existing margin and submodular-based selection strategies. Reported results indicate improved performance compared with the original standalone methods.

According to Google, a related max-min diversity strategy has been used by the YouTube Home ranking team. The team applied similar principles to increase diversity in video recommendations. Google reports that this change improved long-term user value.

Source Citations

The following official sources provide detailed technical and experimental information about GIST. They originate from Google and the research authors.

Google Research blog announcement - January 23, 2026 article introducing GIST and summarizing results.
GIST arXiv paper - research paper detailing the algorithm, proofs and experimental evaluation.

Google's New GIST Algorithm Could Rewrite How You Sample Training Data For Giant Models

Key Details on GIST Algorithm

Background Context

Source Citations

More articles