Google's MapTrace Dataset Targets a Major Blind Spot in How AI Reads Maps

On February 17, 2026, Google researchers introduced MapTrace, a synthetic route-tracing task and dataset designed to teach AI systems to read maps. The work, led by student researcher Artemis Panagopoulou and senior software engineer Mohit Goyal, was published on the Google Research blog. The project focuses on improving the ability of multimodal large language models to follow valid paths on visually represented maps.

MapTrace synthetic route-tracing task and dataset

MapTrace defines a task in which models receive a map with marked start and end points and must output a valid, traversable path. The team built an automatic data-generation pipeline that creates annotated maps and corresponding route traces. They are releasing 2 million route-related question-answer pairs built from these maps for public research use, available as a HuggingFace dataset (2M question-answer pairs).

The pipeline runs in four main stages. First, a large language model generates diverse textual descriptions of locations such as zoos, shopping malls, and theme parks. Imagen-4 then converts these prompts into map images. These images are processed to identify walkable regions, converted into pixel graphs, and paired with optimal routes computed using Dijkstra's algorithm.

Key details

The Mask Critic model evaluates candidate walkable-region masks and filters out low-quality ones.
The Path Critic model checks whether generated paths stay within traversable areas and follow reasonable routes on each map.
Manual review found that Path Critic reached 76% accuracy with an 8% false-positive rate on 120 decisions across 56 maps.
The Mask Critic reached 83% accuracy with a 9% false-positive rate based on 200 judgments across 20 maps.
The released dataset contains 2 million route-related question-answer pairs, hosted on HuggingFace as "google/MapTrace".
The authors note that generated maps sometimes render text incorrectly and emphasize that their analysis focuses on path quality.

Background and evaluation results

The research targets known limitations of multimodal large language models in fine-grained spatial reasoning. Existing models can recognize objects in images but often fail to respect walls and connectivity when tracing paths. Collecting large numbers of human-annotated, pixel-accurate paths on real maps is time-consuming and difficult, especially for proprietary indoor maps.

To evaluate MapTrace, the team fine-tuned several models on a subset of 23,000 generated paths, including Gemma 3 27B and Gemini 2.5 Flash. Performance was measured on MapBench, a separate dataset of real-world maps that were not part of training.

The main metric was normalized dynamic time warping (NDTW), which compares predicted and reference coordinate sequences while accounting for path length. Lower scores indicate routes that track the reference more closely. According to the blog, the fine-tuned Gemini 2.5 Flash model reduced its NDTW score on MapBench from 1.29 to 0.87. The fine-tuned Gemma 3 27B model reduced its score from 1.29 to 1.13. The authors also report higher success rates, defined as the share of cases where the model produced a valid, machine-parsable path. Gemma 3 27B, for example, saw a 6.4-point increase in this success metric after training on MapTrace data.

Source citations and official resources

MapTrace project page
Google Research blog: Teaching AI to read a map
MapTrace paper on arXiv: MapTrace: Scalable Data Generation for Route Tracing on Maps
MapTrace dataset on HuggingFace: google/MapTrace
MapBench evaluation suite on arXiv: MapBench
Gemini 2.5 Pro model documentation: Gemini 2.5 Pro
Gemini 2.5 Flash model documentation: Gemini 2.5 Flash
Imagen-4 model information: Imagen-4
Gemma 3 27B model card: Gemma 3 27B

Google's MapTrace Dataset Targets a Major Blind Spot in How AI Reads Maps

MapTrace synthetic route-tracing task and dataset

Key details

Background and evaluation results

Source citations and official resources

More articles