GeoBench
Last reviewed
May 10, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 · 2,394 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 · 2,394 words
Add missing citations, update stale details, or suggest a clearer explanation.
| GeoBench | |
|---|---|
| Overview | |
| Full name | Geospatial Benchmarks Collection |
| Description | A family of benchmarks for AI models on geospatial reasoning, earth monitoring, and geographic localization |
| First release | June 2023 (GEO-Bench v1) |
| Latest variant | GEO-Bench-2 (preview, October 2025) |
| Lead authors | Alexandre Lacoste et al. (GEO-Bench), Muhammad Sohail Danish et al. (GEOBench-VLM), CCMDI (GeoGuessr GeoBench) |
| Organizations | ServiceNow Research, MBZUAI, IBM Research, TU Munich, The AI Alliance, CCMDI |
| Technical details | |
| Modality | Multispectral satellite, RGB, SAR, text, multi-temporal |
| Task formats | Classification, segmentation, object detection, counting, captioning, MCQs, lat/long regression |
| Number of tasks | 12 (GEO-Bench v1), 19 datasets / 9 subsets (GEO-Bench-2), 31 sub-tasks (GEOBench-VLM), 5 maps (CCMDI) |
| Total examples | ~10,000 manually verified MCQs (GEOBench-VLM); ~65 GB imagery (GEO-Bench v1) |
| Evaluation metrics | Accuracy, F1, mean IoU (mIoU), bootstrap interquartile mean, BERTScore, haversine distance |
| Domains | Geography, remote sensing, agriculture, disaster response, urban planning |
| Performance | |
| Top GEOBench-VLM score | 41.72% (LLaVA-OneVision, 2024) |
| Top CCMDI GeoBench score | 2,268.97 (Claude 3.5 Sonnet, 2024) |
| Human GeoGuessr expert | 4,579.4 average |
| Saturated | No |
| Resources | |
| GEO-Bench paper | arXiv:2306.03831 |
| GEOBench-VLM paper | arXiv:2411.19325 |
| GitHub | ServiceNow/geo-bench, The-AI-Alliance/GEO-Bench-VLM, ccmdi/geobench |
| Dataset | HuggingFace |
| Licenses | Apache 2.0 (ServiceNow, AI Alliance), MIT (CCMDI) |
GeoBench is the umbrella name for a family of artificial intelligence benchmarks evaluating models on geospatial reasoning, earth monitoring, and geographic localization. The name has been used independently by several groups, so it now covers three distinct projects: ServiceNow Research's GEO-Bench for earth observation foundation models[1] and its 2025 successor GEO-Bench-2 from IBM, ServiceNow, and the AI Alliance[2]; GEOBench-VLM for vision-language models on remote sensing[3]; and the CCMDI GeoBench, a community suite that asks models to play GeoGuessr on street-level photographs[4]. Modern AI is very good at general image and language tasks and noticeably worse at reading the physical world the way a remote sensing analyst would; the GeoBench projects exist to put numbers on that gap.
ServiceNow's original GEO-Bench, released in June 2023 and presented at NeurIPS 2023 Datasets and Benchmarks, focuses on classification and segmentation of satellite imagery[1]. GEOBench-VLM, released in late 2024 and accepted to ICCV 2025, extends that focus to VLMs with thousands of MCQs about remote sensing scenes[3]. The CCMDI variant takes a different angle and tests whether multimodal models can geolocate ground-level photographs in the spirit of GeoGuessr[4]. GEO-Bench-2, previewed in October 2025, expands v1 to 19 datasets across 9 capability subsets and shifts the framing from raw performance to capability profiling[2]. Adjacent projects like GeoBenchX (LLM agents on multi-step GIS tasks) reuse the brand[5].
GEO-Bench v1 was introduced by Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Hannah Kerner, Bjorn Lutjens, Jeremy Irvin, Yoshua Bengio, Stefano Ermon, Xiao Xiang Zhu, and others (arXiv:2306.03831, June 2023)[1]. ServiceNow Research led the work with partners at Stanford, MIT, ASU, TU Munich, and Clark University; the paper appeared at NeurIPS 2023 Datasets and Benchmarks[6]. The package is distributed under Apache 2.0 as a Python library (pip install geobench); the full download is around 65 GB[7]. It contains six classification and six segmentation tasks drawn from earlier remote sensing datasets, with fixed splits plus reduced training splits for sample efficiency studies.
| Dataset | Sensor | Train / test | Classes |
|---|---|---|---|
| m-bigearthnet | Sentinel-2 | 20,000 / 1,000 | 43 |
| m-so2sat | Sentinel-1 + Sentinel-2 | 19,992 / 986 | 17 |
| m-brick-kiln | Sentinel-2 | 15,063 / 999 | 2 |
| m-forestnet | Landsat-8 | 6,464 / 993 | 12 |
| m-eurosat | Sentinel-2 | 2,000 / 1,000 | 10 |
| m-pv4ger | RGB aerial | 11,814 / 999 | 2 |
| Dataset | Sensor | Train / test | Classes |
|---|---|---|---|
| m-pv4ger-seg | RGB aerial | 3,000 / 403 | 2 |
| m-chesapeake-landcover | RGB + NIR | 3,000 / 1,000 | 7 |
| m-cashew-plantation | Sentinel-2 | 1,350 / 50 | 7 |
| m-SA-crop-type | Sentinel-2 | 3,000 / 1,000 | 10 |
| m-nz-cattle | RGB aerial | 524 / 65 | 2 |
| m-NeonTree | RGB + hyperspectral + LiDAR | 270 / 93 | 2 |
The protocol aggregates results across tasks using a normalized score with bootstrap confidence intervals, intended to make the benchmark less sensitive to which task happens to be easiest in a given year[1]. The paper reports 20 baseline configurations covering ResNet18, ResNet50, ConvNeXt-Base, ViT-Tiny, ViT-Small, and SwinV2-Tiny, trained from scratch, from a timm ImageNet checkpoint, or from a remote-sensing pretraining scheme such as MoCo or SeCo. SwinV2-Tiny is the strongest aggregated model on RGB satellite tasks; ConvNeXt-Base overtakes it in the low-data regime, evidence that convolutions remain more sample-efficient on small remote sensing datasets[1].
GEO-Bench-2, developed by IBM, ServiceNow, MBZUAI, NASA, ESA Phi-lab, TU Munich, ASU, and Clark University under the AI Alliance Climate and Sustainability Group, was previewed on October 15, 2025 with a public leaderboard on Hugging Face[2]. It expands v1 to 19 datasets across 9 capability subsets and tightens licensing so every dataset can be redistributed cleanly; new datasets include biomassters, cloudsen12, dynamic_earthnet, flair2, kuro_siwo, pastis, spacenet2, spacenet7, substation, and treesatai[8]. The new protocol reports over 15,000 runs orchestrated through TerraTorch, uses dataset-wise min-max normalization, and aggregates with bootstrapped interquartile means; the leaderboard separates full fine-tuning from frozen-encoder evaluation. TerraMind and Prithvi-EO-2.0 dominate multispectral subsets, while DINOv3 and ConvNeXt remain competitive on RGB and high-resolution scenarios[2].
GEOBench-VLM, introduced by Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan (arXiv:2411.19325, November 2024), evaluates vision-language models on remote sensing[3]. Authors come from MBZUAI, UCL, Linkoping University, IBM Research Europe, ServiceNow Research, and ANU. The paper was accepted to ICCV 2025 in Honolulu and released under Apache 2.0[9]. It is built around 31 sub-tasks across 8 broad categories with more than 10,000 manually verified instructions.
| Category | Sub-tasks (selected) |
|---|---|
| Scene understanding | Scene classification, land-use classification, crop classification |
| Object classification | Ship type, aircraft type |
| Localization and counting | Referring expression detection, spatial relationships, vehicle/aircraft/building/tree/marine debris counting |
| Event detection | Fire risk assessment, disaster type classification |
| Caption generation | Scene and object-aware image captioning |
| Semantic segmentation | Referring expression segmentation, urban vs. non-urban masks |
| Temporal understanding | Change detection, damaged-building counting, farm-pond change |
| Non-optical | SAR ship detection, SAR flood detection, earthquake magnitude estimation |
Multiple-choice questions are generated by GPT-4o with five options (one correct, one closely-related distractor verified by humans, three plausible alternatives), then manually reviewed. Counting questions add plus or minus 20 and 40 percent deviations to prevent coasting on rough estimates[3].
The paper benchmarks 13 VLMs. Generic models include LLaVA-1.5, LLaVA-NeXT, LLaVA-OneVision, Sphinx, Ferret, InternVL2, Qwen2-VL, and GPT-4o. Geospatial-specific models include GeoChat, RS-LLaVA, SkySenseGPT, EarthDial, and LHRS-Bot-Nova[3].
| Model | Average MCQ accuracy |
|---|---|
| LLaVA-OneVision | 41.72% |
| GPT-4o | 41.14% |
| Qwen2-VL | 40.25% |
| EarthDial | 37.70% |
| Random baseline | 25.00% |
LLaVA-OneVision leads counting tasks (buildings, vehicles, marine debris, tree health). GPT-4o is the best fine-grained object classifier (ships, aircraft) and wins disaster classification and damaged-building counting. EarthDial, the only geospatial-specific model among the leaders, is best on land-use classification and event detection. Sphinx wins image captioning by BERTScore. Qwen2-VL leads earthquake magnitude estimation from SAR imagery, where GPT-4o is worst[3]. No model exceeds 42 percent average accuracy, roughly double random guessing, and multispectral inputs degrade performance sharply because the VLMs are trained on RGB[3].
The CCMDI GeoBench puts language and vision models in front of GeoGuessr-style street-level photographs and asks them to predict the country and exact latitude/longitude. The benchmark, the geobench.org leaderboard, and a writeup were built by ccmdi, with code released under MIT license[4][10]; results are tracked by Epoch AI[10]. Images come from five real GeoGuessr community maps, with "A Community World" as the headline test set. Models receive the photograph and a short prompt and return a JSON object with country and coordinates. Temperature is 0.4 where exposed. Each prediction is scored with the GeoGuessr formula: a max 5,000 points per image for a guess within roughly 25 meters, decaying exponentially with haversine distance[10].
Key reported results (CCMDI's 2024 leaderboard, A Community World map):
| Rank | Model | Average score | Notes |
|---|---|---|---|
| 1 | Claude 3.5 Sonnet | 2,268.97 | Strongest overall on subtle scene cues |
| 2 | GPT-4V | 2,145.32 | Best on text-heavy scenes (signage, scripts) |
| 3 | Gemini 1.5 Pro | 2,087.54 | Strong on terrain and vegetation cues |
| - | Human (average) | 2,109.40 | Casual GeoGuessr players |
| - | Human (expert) | 4,579.40 | Top-ranked GeoGuessr players |
The top model still trails the strongest human players by roughly 2,300 points on a 5,000-point scale. Epoch AI folded GeoBench into its broader multimodal evaluation, where it produced suggestive evidence that Gemini is more heavily optimized for vision and Claude leans toward code[10].
| Variant | Primary metric | Aggregation |
|---|---|---|
| GEO-Bench v1 (classification) | Top-1 accuracy, F1 | Normalized score with bootstrap CIs |
| GEO-Bench v1 (segmentation) | Mean IoU | Normalized aggregation |
| GEO-Bench-2 | Min-max normalized per task | Bootstrapped interquartile mean |
| GEOBench-VLM (MCQ) | Five-option MCQ accuracy | Per-category and overall mean |
| GEOBench-VLM (detection) | Precision @ IoU 0.25 and 0.50 | Reported separately |
| GEOBench-VLM (captions) | BERTScore vs. references | Per-model average |
| CCMDI GeoBench | Distance, country accuracy | Mean GeoGuessr score (max 5,000) |
GEO-Bench v1 uses normalized scores and bootstrap intervals so improving one easy dataset cannot silently dominate the headline number[1]. GEO-Bench-2 extends this with interquartile means that trim outliers[2]. GEOBench-VLM keeps mean accuracy but invests heavily in question quality through manual verification.
Large gaps remain between human and machine performance: on CCMDI GeoBench, expert humans roughly double the score of the strongest model, and even average GeoGuessr players are competitive with frontier VLMs[4][10]. On GEOBench-VLM no model exceeds 42 percent average accuracy across 31 tasks[3]. Domain-specific pretraining helps but only sometimes: ConvNeXt-Base and SwinV2-Tiny remain the most consistent generic architectures on GEO-Bench v1, and Sentinel-2 self-supervised pretraining reliably improves ResNet50 on multispectral tasks[1]; in GEO-Bench-2 TerraMind and Prithvi-EO-2.0 dominate multispectral subsets while DINOv3 and ConvNeXt lead on RGB and high-resolution scenarios[2]. Counting, multispectral, and temporal reasoning remain weak across the board: LLaVA-OneVision counts well on GEOBench-VLM but accuracy collapses past 50 objects per scene, multispectral inputs collapsed to pseudo-RGB tank crop classification, and VLMs do not yet exploit temporal context for change detection[3].
| Domain | Example application | Relevant variant |
|---|---|---|
| Disaster response | xBD damage assessment, flood mapping | GEOBench-VLM, GEO-Bench-2 |
| Agriculture | Crop type classification, field boundary segmentation | GEO-Bench v1, GEO-Bench-2 |
| Urban planning | Land-use classification, building footprint extraction | GEO-Bench v1, GEOBench-VLM |
| Environmental monitoring | Deforestation, brick kilns, biomass estimation | GEO-Bench v1, GEO-Bench-2 |
| Energy | Solar PV detection, substation identification | GEO-Bench v1, GEO-Bench-2 |
| Travel and OSINT | GeoGuessr-style image geolocation | CCMDI GeoBench |
GEO-Bench v1 and GEO-Bench-2 sit closest to operational earth observation, drawing on real datasets for agriculture (cashew plantations, South African crops), deforestation (ForestNet), and carbon-relevant infrastructure (PV4Ger, substations). GEOBench-VLM extends those tasks into language-conditioned use cases. CCMDI GeoBench is more consumer-facing but exposes the same core skill of integrating visual cues into a geographic guess.
GeoBenchX (Krechetova and Kochedykov, arXiv:2503.18129, March 2025) tests tool-calling LLM agents on multi-step geospatial tasks with 23 GIS functions[5]. Related benchmarks like VRSBench[11], MMBench, and MM-Vet evaluate VLMs more broadly, while domain suites such as PASTIS and xBD feed into GEO-Bench-2 and GEOBench-VLM. The "GeoBench" name has also been used for unrelated work on monocular geometry and image editing[12]. Each variant has caveats: GEO-Bench v1 is biased toward Europe and Africa and most tasks fit on a single GPU; GEO-Bench-2 is still early at preview[2]; GEOBench-VLM uses GPT-4o to generate questions, partly shaping the test distribution though manual verification helps[3]; CCMDI GeoBench depends on community GeoGuessr maps that change over time, so reproducing scores requires pinning the map and prompt[4][10]. None of the variants evaluate 3D spatial reasoning in depth, and long-horizon temporal reasoning is mostly limited to bi-temporal change detection.