# GeoBench

> Source: https://aiwiki.ai/wiki/geobench
> Updated: 2026-06-09
> Categories: AI Benchmarks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

| GeoBench |
| --- |
| Overview |
| Full name | Geospatial Benchmarks Collection |
| Description | A family of benchmarks for AI models on geospatial reasoning, earth monitoring, and geographic localization |
| First release | June 2023 (GEO-Bench v1) |
| Latest variant | GEO-Bench-2 (preview, October 2025) |
| Lead authors | Alexandre Lacoste et al. (GEO-Bench), Muhammad Sohail Danish et al. (GEOBench-VLM), CCMDI (GeoGuessr GeoBench) |
| Organizations | [ServiceNow Research](/wiki/servicenow_research), [MBZUAI](/wiki/mbzuai), IBM Research, [TU Munich](/wiki/tu_munich), [The AI Alliance](/wiki/the_ai_alliance), CCMDI |
| Technical details |
| Modality | Multispectral satellite, RGB, [SAR](/wiki/synthetic_aperture_radar), text, multi-temporal |
| Task formats | Classification, segmentation, [object detection](/wiki/object_detection), counting, captioning, MCQs, lat/long regression |
| Number of tasks | 12 (GEO-Bench v1), 19 datasets / 9 subsets (GEO-Bench-2), 31 sub-tasks (GEOBench-VLM), 5 maps (CCMDI) |
| Total examples | ~10,000 manually verified MCQs (GEOBench-VLM); ~65 GB imagery (GEO-Bench v1) |
| Evaluation metrics | Accuracy, F1, mean IoU ([mIoU](/wiki/iou)), bootstrap interquartile mean, BERTScore, haversine distance |
| Domains | [Geography](/wiki/geography), [remote sensing](/wiki/remote_sensing), agriculture, disaster response, urban planning |
| Performance |
| Top GEOBench-VLM score | 41.72% (LLaVA-OneVision, 2024) |
| Top CCMDI GeoBench score | 2,268.97 (Claude 3.5 Sonnet, 2024) |
| Human GeoGuessr expert | 4,579.4 average |
| Saturated | No |
| Resources |
| GEO-Bench paper | [arXiv:2306.03831](https://arxiv.org/abs/2306.03831) |
| GEOBench-VLM paper | [arXiv:2411.19325](https://arxiv.org/abs/2411.19325) |
| GitHub | [ServiceNow/geo-bench](https://github.com/ServiceNow/geo-bench), [The-AI-Alliance/GEO-Bench-VLM](https://github.com/The-AI-Alliance/GEO-Bench-VLM), [ccmdi/geobench](https://github.com/ccmdi/geobench) |
| Dataset | [HuggingFace](https://huggingface.co/datasets/servicenow/geo-bench) |
| Licenses | Apache 2.0 (ServiceNow, AI Alliance), MIT (CCMDI) |

**GeoBench** is the umbrella name for a family of [artificial intelligence](/wiki/artificial_intelligence) benchmarks evaluating models on geospatial reasoning, earth monitoring, and geographic localization. The name has been used independently by several groups, so it now covers three distinct projects: ServiceNow Research's GEO-Bench for earth observation [foundation models](/wiki/foundation_models)[1] and its 2025 successor GEO-Bench-2 from IBM, ServiceNow, and the [AI Alliance](/wiki/the_ai_alliance)[2]; GEOBench-VLM for [vision-language models](/wiki/vision_language_model) on remote sensing[3]; and the CCMDI GeoBench, a community suite that asks models to play [GeoGuessr](/wiki/geoguessr) on street-level photographs[4]. Modern AI is very good at general image and language tasks and noticeably worse at reading the physical world the way a remote sensing analyst would; the GeoBench projects exist to put numbers on that gap.

## Overview

ServiceNow's original GEO-Bench, released in June 2023 and presented at NeurIPS 2023 Datasets and Benchmarks, focuses on classification and segmentation of satellite imagery[1]. GEOBench-VLM, released in late 2024 and accepted to ICCV 2025, extends that focus to VLMs with thousands of MCQs about remote sensing scenes[3]. The CCMDI variant takes a different angle and tests whether multimodal models can geolocate ground-level photographs in the spirit of GeoGuessr[4]. GEO-Bench-2, previewed in October 2025, expands v1 to 19 datasets across 9 capability subsets and shifts the framing from raw performance to capability profiling[2]. Adjacent projects like GeoBenchX (LLM agents on multi-step GIS tasks) reuse the brand[5].

## Variants

### ServiceNow GEO-Bench (v1)

GEO-Bench v1 was introduced by Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Hannah Kerner, Bjorn Lutjens, Jeremy Irvin, [Yoshua Bengio](/wiki/yoshua_bengio), Stefano Ermon, Xiao Xiang Zhu, and others (arXiv:2306.03831, June 2023)[1]. ServiceNow Research led the work with partners at Stanford, MIT, ASU, TU Munich, and Clark University; the paper appeared at NeurIPS 2023 Datasets and Benchmarks[6]. The package is distributed under Apache 2.0 as a Python library (`pip install geobench`); the full download is around 65 GB[7]. It contains six classification and six segmentation tasks drawn from earlier remote sensing datasets, with fixed splits plus reduced training splits for sample efficiency studies.

#### Classification tasks (GEO-Bench v1)

| Dataset | Sensor | Train / test | Classes |
| --- | --- | --- | --- |
| m-bigearthnet | Sentinel-2 | 20,000 / 1,000 | 43 |
| m-so2sat | Sentinel-1 + Sentinel-2 | 19,992 / 986 | 17 |
| m-brick-kiln | Sentinel-2 | 15,063 / 999 | 2 |
| m-forestnet | Landsat-8 | 6,464 / 993 | 12 |
| m-eurosat | Sentinel-2 | 2,000 / 1,000 | 10 |
| m-pv4ger | RGB aerial | 11,814 / 999 | 2 |

#### Segmentation tasks (GEO-Bench v1)

| Dataset | Sensor | Train / test | Classes |
| --- | --- | --- | --- |
| m-pv4ger-seg | RGB aerial | 3,000 / 403 | 2 |
| m-chesapeake-landcover | RGB + NIR | 3,000 / 1,000 | 7 |
| m-cashew-plantation | Sentinel-2 | 1,350 / 50 | 7 |
| m-SA-crop-type | Sentinel-2 | 3,000 / 1,000 | 10 |
| m-nz-cattle | RGB aerial | 524 / 65 | 2 |
| m-NeonTree | RGB + hyperspectral + LiDAR | 270 / 93 | 2 |

The protocol aggregates results across tasks using a normalized score with bootstrap confidence intervals, intended to make the benchmark less sensitive to which task happens to be easiest in a given year[1]. The paper reports 20 baseline configurations covering ResNet18, ResNet50, ConvNeXt-Base, ViT-Tiny, ViT-Small, and SwinV2-Tiny, trained from scratch, from a [timm](/wiki/timm) ImageNet checkpoint, or from a remote-sensing pretraining scheme such as MoCo or [SeCo](/wiki/seasonal_contrast). SwinV2-Tiny is the strongest aggregated model on RGB satellite tasks; ConvNeXt-Base overtakes it in the low-data regime, evidence that convolutions remain more sample-efficient on small remote sensing datasets[1].

### GEO-Bench-2 (2025 successor)

GEO-Bench-2, developed by IBM, ServiceNow, MBZUAI, NASA, ESA Phi-lab, TU Munich, ASU, and Clark University under the AI Alliance Climate and Sustainability Group, was previewed on October 15, 2025 with a public leaderboard on Hugging Face[2]. It expands v1 to 19 datasets across 9 capability subsets and tightens licensing so every dataset can be redistributed cleanly; new datasets include `biomassters`, `cloudsen12`, `dynamic_earthnet`, `flair2`, `kuro_siwo`, `pastis`, `spacenet2`, `spacenet7`, `substation`, and `treesatai`[8]. The new protocol reports over 15,000 runs orchestrated through TerraTorch, uses dataset-wise min-max normalization, and aggregates with bootstrapped interquartile means; the leaderboard separates full fine-tuning from frozen-encoder evaluation. TerraMind and Prithvi-EO-2.0 dominate multispectral subsets, while DINOv3 and ConvNeXt remain competitive on RGB and high-resolution scenarios[2].

### GEOBench-VLM

GEOBench-VLM, introduced by Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan (arXiv:2411.19325, November 2024), evaluates [vision-language models](/wiki/vision_language_model) on remote sensing[3]. Authors come from MBZUAI, UCL, Linkoping University, IBM Research Europe, ServiceNow Research, and ANU. The paper was accepted to [ICCV 2025](/wiki/iccv) in Honolulu and released under Apache 2.0[9]. It is built around 31 sub-tasks across 8 broad categories with more than 10,000 manually verified instructions.

#### Categories

| Category | Sub-tasks (selected) |
| --- | --- |
| Scene understanding | Scene classification, land-use classification, crop classification |
| Object classification | Ship type, aircraft type |
| Localization and counting | Referring expression detection, spatial relationships, vehicle/aircraft/building/tree/marine debris counting |
| Event detection | Fire risk assessment, disaster type classification |
| Caption generation | Scene and object-aware image captioning |
| Semantic segmentation | Referring expression segmentation, urban vs. non-urban masks |
| Temporal understanding | Change detection, damaged-building counting, farm-pond change |
| Non-optical | [SAR](/wiki/synthetic_aperture_radar) ship detection, SAR flood detection, earthquake magnitude estimation |

Multiple-choice questions are generated by [GPT-4o](/wiki/gpt_4o) with five options (one correct, one closely-related distractor verified by humans, three plausible alternatives), then manually reviewed. Counting questions add plus or minus 20 and 40 percent deviations to prevent coasting on rough estimates[3].

#### Models and headline results

The paper benchmarks 13 VLMs. Generic models include [LLaVA-1.5](/wiki/llava), LLaVA-NeXT, LLaVA-OneVision, Sphinx, Ferret, [InternVL2](/wiki/internvl), [Qwen2-VL](/wiki/qwen2_vl), and GPT-4o. Geospatial-specific models include [GeoChat](/wiki/geochat), RS-LLaVA, SkySenseGPT, EarthDial, and LHRS-Bot-Nova[3].

| Model | Average MCQ accuracy |
| --- | --- |
| LLaVA-OneVision | 41.72% |
| GPT-4o | 41.14% |
| Qwen2-VL | 40.25% |
| EarthDial | 37.70% |
| Random baseline | 25.00% |

LLaVA-OneVision leads counting tasks (buildings, vehicles, marine debris, tree health). GPT-4o is the best fine-grained object classifier (ships, aircraft) and wins disaster classification and damaged-building counting. EarthDial, the only geospatial-specific model among the leaders, is best on land-use classification and event detection. Sphinx wins image captioning by BERTScore. Qwen2-VL leads earthquake magnitude estimation from SAR imagery, where GPT-4o is worst[3]. No model exceeds 42 percent average accuracy, roughly double random guessing, and multispectral inputs degrade performance sharply because the VLMs are trained on RGB[3].

### CCMDI GeoBench (GeoGuessr benchmark)

The CCMDI GeoBench puts language and vision models in front of GeoGuessr-style street-level photographs and asks them to predict the country and exact latitude/longitude. The benchmark, the [geobench.org](https://geobench.org) leaderboard, and a writeup were built by ccmdi, with code released under MIT license[4][10]; results are tracked by Epoch AI[10]. Images come from five real GeoGuessr community maps, with "A Community World" as the headline test set. Models receive the photograph and a short prompt and return a JSON object with country and coordinates. Temperature is 0.4 where exposed. Each prediction is scored with the GeoGuessr formula: a max 5,000 points per image for a guess within roughly 25 meters, decaying exponentially with haversine distance[10].

Key reported results (CCMDI's 2024 leaderboard, A Community World map):

| Rank | Model | Average score | Notes |
| --- | --- | --- | --- |
| 1 | Claude 3.5 Sonnet | 2,268.97 | Strongest overall on subtle scene cues |
| 2 | GPT-4V | 2,145.32 | Best on text-heavy scenes (signage, scripts) |
| 3 | Gemini 1.5 Pro | 2,087.54 | Strong on terrain and vegetation cues |
| - | Human (average) | 2,109.40 | Casual GeoGuessr players |
| - | Human (expert) | 4,579.40 | Top-ranked GeoGuessr players |

The top model still trails the strongest human players by roughly 2,300 points on a 5,000-point scale. Epoch AI folded GeoBench into its broader multimodal evaluation, where it produced suggestive evidence that Gemini is more heavily optimized for vision and Claude leans toward code[10].

## Methodology and scoring

| Variant | Primary metric | Aggregation |
| --- | --- | --- |
| GEO-Bench v1 (classification) | Top-1 accuracy, F1 | Normalized score with bootstrap CIs |
| GEO-Bench v1 (segmentation) | Mean IoU | Normalized aggregation |
| GEO-Bench-2 | Min-max normalized per task | Bootstrapped interquartile mean |
| GEOBench-VLM (MCQ) | Five-option MCQ accuracy | Per-category and overall mean |
| GEOBench-VLM (detection) | Precision @ IoU 0.25 and 0.50 | Reported separately |
| GEOBench-VLM (captions) | BERTScore vs. references | Per-model average |
| CCMDI GeoBench | Distance, country accuracy | Mean GeoGuessr score (max 5,000) |

GEO-Bench v1 uses normalized scores and bootstrap intervals so improving one easy dataset cannot silently dominate the headline number[1]. GEO-Bench-2 extends this with interquartile means that trim outliers[2]. GEOBench-VLM keeps mean accuracy but invests heavily in question quality through manual verification.

## Key findings

Large gaps remain between human and machine performance: on CCMDI GeoBench, expert humans roughly double the score of the strongest model, and even average GeoGuessr players are competitive with frontier VLMs[4][10]. On GEOBench-VLM no model exceeds 42 percent average accuracy across 31 tasks[3]. Domain-specific pretraining helps but only sometimes: ConvNeXt-Base and SwinV2-Tiny remain the most consistent generic architectures on GEO-Bench v1, and Sentinel-2 self-supervised pretraining reliably improves ResNet50 on multispectral tasks[1]; in GEO-Bench-2 TerraMind and Prithvi-EO-2.0 dominate multispectral subsets while DINOv3 and ConvNeXt lead on RGB and high-resolution scenarios[2]. Counting, multispectral, and temporal reasoning remain weak across the board: LLaVA-OneVision counts well on GEOBench-VLM but accuracy collapses past 50 objects per scene, multispectral inputs collapsed to pseudo-RGB tank crop classification, and VLMs do not yet exploit temporal context for change detection[3].

## Applications

| Domain | Example application | Relevant variant |
| --- | --- | --- |
| Disaster response | xBD damage assessment, flood mapping | GEOBench-VLM, GEO-Bench-2 |
| Agriculture | [Crop type](/wiki/crop_classification) classification, field boundary segmentation | GEO-Bench v1, GEO-Bench-2 |
| Urban planning | Land-use classification, building footprint extraction | GEO-Bench v1, GEOBench-VLM |
| Environmental monitoring | Deforestation, brick kilns, biomass estimation | GEO-Bench v1, GEO-Bench-2 |
| Energy | Solar PV detection, substation identification | GEO-Bench v1, GEO-Bench-2 |
| Travel and OSINT | GeoGuessr-style image geolocation | CCMDI GeoBench |

GEO-Bench v1 and GEO-Bench-2 sit closest to operational earth observation, drawing on real datasets for agriculture (cashew plantations, South African crops), deforestation (ForestNet), and carbon-relevant infrastructure (PV4Ger, substations). GEOBench-VLM extends those tasks into language-conditioned use cases. CCMDI GeoBench is more consumer-facing but exposes the same core skill of integrating visual cues into a geographic guess.

## Related work and limitations

GeoBenchX (Krechetova and Kochedykov, arXiv:2503.18129, March 2025) tests tool-calling [LLM agents](/wiki/llm_agent) on multi-step geospatial tasks with 23 GIS functions[5]. Related benchmarks like VRSBench[11], MMBench, and MM-Vet evaluate VLMs more broadly, while domain suites such as PASTIS and xBD feed into GEO-Bench-2 and GEOBench-VLM. The "GeoBench" name has also been used for unrelated work on monocular geometry and image editing[12]. Each variant has caveats: GEO-Bench v1 is biased toward Europe and Africa and most tasks fit on a single GPU; GEO-Bench-2 is still early at preview[2]; GEOBench-VLM uses GPT-4o to generate questions, partly shaping the test distribution though manual verification helps[3]; CCMDI GeoBench depends on community GeoGuessr maps that change over time, so reproducing scores requires pinning the map and prompt[4][10]. None of the variants evaluate 3D spatial reasoning in depth, and long-horizon temporal reasoning is mostly limited to bi-temporal change detection.

## See also

- [CharXiv](/wiki/charxiv)
- [Remote sensing](/wiki/remote_sensing)
- [Earth observation](/wiki/earth_observation)
- [Foundation models](/wiki/foundation_models)
- [Vision-language models](/wiki/vision_language_model)
- [GeoChat](/wiki/geochat), [GeoGuessr](/wiki/geoguessr), [TerraTorch](/wiki/terratorch)
- [Synthetic aperture radar](/wiki/synthetic_aperture_radar), [spatial reasoning](/wiki/spatial_reasoning)

## References

1. Lacoste, A., et al. (2023). *GEO-Bench: Toward Foundation Models for Earth Monitoring*. arXiv:2306.03831. https://arxiv.org/abs/2306.03831
2. The AI Alliance Climate and Sustainability Group (2025). *GEO-Bench-2: From Performance to Capability*. https://thealliance.ai/blog/geo-bench-2-from-performance-to-capability-rethinking-evaluation-in-geospatial-ai
3. Danish, M. S., Munir, M. A., Shah, S. R. A., Kuckreja, K., Khan, F. S., Fraccaro, P., Lacoste, A., and Khan, S. (2024). *GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks*. arXiv:2411.19325. https://arxiv.org/abs/2411.19325
4. CCMDI (2024). *GeoBench: Can LLMs play GeoGuessr?* https://ccmdi.com/blog/GeoBench
5. Krechetova, V., and Kochedykov, D. (2025). *GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks*. arXiv:2503.18129. https://arxiv.org/abs/2503.18129
6. NeurIPS 2023 Datasets and Benchmarks Track. *GEO-Bench* poster page. https://neurips.cc/virtual/2023/poster/73632
7. ServiceNow Research. *geo-bench* GitHub repository. https://github.com/ServiceNow/geo-bench
8. The AI Alliance. *GEO-Bench-2* GitHub repository. https://github.com/The-AI-Alliance/GEO-Bench-2
9. The AI Alliance. *GEO-Bench-VLM* GitHub and project page. https://github.com/The-AI-Alliance/GEO-Bench-VLM and https://the-ai-alliance.github.io/GEO-Bench-VLM/
10. Epoch AI. *GeoBench* benchmark page. https://epoch.ai/benchmarks/geobench
11. Li, X., Ding, J., and Elhoseiny, M. (2024). *VRSBench*. arXiv:2406.12384. https://arxiv.org/abs/2406.12384
12. Emergent Mind. *GeoBench: Unified Geospatial AI Benchmarks*. https://www.emergentmind.com/topics/geobench

