GeoBench

AI Benchmarks

12 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v3 · 2,396 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GeoBench
Overview
Full name	Geospatial Benchmarks Collection
Description	A family of benchmarks for AI models on geospatial reasoning, earth monitoring, and geographic localization
First release	June 2023 (GEO-Bench v1)
Latest variant	GEO-Bench-2 (preview, October 2025)
Lead authors	Alexandre Lacoste et al. (GEO-Bench), Muhammad Sohail Danish et al. (GEOBench-VLM), CCMDI (GeoGuessr GeoBench)
Organizations	ServiceNow Research, MBZUAI, IBM Research, TU Munich, The AI Alliance, CCMDI
Technical details
Modality	Multispectral satellite, RGB, SAR, text, multi-temporal
Task formats	Classification, segmentation, object detection, counting, captioning, MCQs, lat/long regression
Number of tasks	12 (GEO-Bench v1), 19 datasets / 9 subsets (GEO-Bench-2), 31 sub-tasks (GEOBench-VLM), 5 maps (CCMDI)
Total examples	~10,000 manually verified MCQs (GEOBench-VLM); ~65 GB imagery (GEO-Bench v1)
Evaluation metrics	Accuracy, F1, mean IoU (mIoU), bootstrap interquartile mean, BERTScore, haversine distance
Domains	Geography, remote sensing, agriculture, disaster response, urban planning
Performance
Top GEOBench-VLM score	41.72% (LLaVA-OneVision, 2024)
Top CCMDI GeoBench score	2,268.97 (Claude 3.5 Sonnet, 2024)
Human GeoGuessr expert	4,579.4 average
Saturated	No
Resources
GEO-Bench paper	arXiv:2306.03831
GEOBench-VLM paper	arXiv:2411.19325
GitHub	ServiceNow/geo-bench, The-AI-Alliance/GEO-Bench-VLM, ccmdi/geobench
Dataset	HuggingFace
Licenses	Apache 2.0 (ServiceNow, AI Alliance), MIT (CCMDI)

GeoBench is the umbrella name for a family of artificial intelligence benchmarks evaluating models on geospatial reasoning, earth monitoring, and geographic localization. The name has been used independently by several groups, so it now covers three distinct projects: ServiceNow Research's GEO-Bench for earth observation foundation models^[1] and its 2025 successor GEO-Bench-2 from IBM, ServiceNow, and the AI Alliance^[2]; GEOBench-VLM for vision-language models on remote sensing^[3]; and the CCMDI GeoBench, a community suite that asks models to play GeoGuessr on street-level photographs^[4]. Modern AI is very good at general image and language tasks and noticeably worse at reading the physical world the way a remote sensing analyst would; the GeoBench projects exist to put numbers on that gap.

Overview

ServiceNow's original GEO-Bench, released in June 2023 and presented at NeurIPS 2023 Datasets and Benchmarks, focuses on classification and segmentation of satellite imagery^[1]. GEOBench-VLM, released in late 2024 and accepted to ICCV 2025, extends that focus to VLMs with thousands of MCQs about remote sensing scenes^[3]. The CCMDI variant takes a different angle and tests whether multimodal models can geolocate ground-level photographs in the spirit of GeoGuessr^[4]. GEO-Bench-2, previewed in October 2025, expands v1 to 19 datasets across 9 capability subsets and shifts the framing from raw performance to capability profiling^[2]. Adjacent projects like GeoBenchX (LLM agents on multi-step GIS tasks) reuse the brand^[5].

Variants

ServiceNow GEO-Bench (v1)

GEO-Bench v1 was introduced by Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Hannah Kerner, Bjorn Lutjens, Jeremy Irvin, Yoshua Bengio, Stefano Ermon, Xiao Xiang Zhu, and others (arXiv:2306.03831, June 2023)^[1]. ServiceNow Research led the work with partners at Stanford, MIT, ASU, TU Munich, and Clark University; the paper appeared at NeurIPS 2023 Datasets and Benchmarks^[6]. The package is distributed under Apache 2.0 as a Python library (pip install geobench); the full download is around 65 GB^[7]. It contains six classification and six segmentation tasks drawn from earlier remote sensing datasets, with fixed splits plus reduced training splits for sample efficiency studies.

Classification tasks (GEO-Bench v1)

Dataset	Sensor	Train / test	Classes
m-bigearthnet	Sentinel-2	20,000 / 1,000	43
m-so2sat	Sentinel-1 + Sentinel-2	19,992 / 986	17
m-brick-kiln	Sentinel-2	15,063 / 999	2
m-forestnet	Landsat-8	6,464 / 993	12
m-eurosat	Sentinel-2	2,000 / 1,000	10
m-pv4ger	RGB aerial	11,814 / 999	2

Segmentation tasks (GEO-Bench v1)

Dataset	Sensor	Train / test	Classes
m-pv4ger-seg	RGB aerial	3,000 / 403	2
m-chesapeake-landcover	RGB + NIR	3,000 / 1,000	7
m-cashew-plantation	Sentinel-2	1,350 / 50	7
m-SA-crop-type	Sentinel-2	3,000 / 1,000	10
m-nz-cattle	RGB aerial	524 / 65	2
m-NeonTree	RGB + hyperspectral + LiDAR	270 / 93	2

The protocol aggregates results across tasks using a normalized score with bootstrap confidence intervals, intended to make the benchmark less sensitive to which task happens to be easiest in a given year^[1]. The paper reports 20 baseline configurations covering ResNet18, ResNet50, ConvNeXt-Base, ViT-Tiny, ViT-Small, and SwinV2-Tiny, trained from scratch, from a timm ImageNet checkpoint, or from a remote-sensing pretraining scheme such as MoCo or SeCo. SwinV2-Tiny is the strongest aggregated model on RGB satellite tasks; ConvNeXt-Base overtakes it in the low-data regime, evidence that convolutions remain more sample-efficient on small remote sensing datasets^[1].

GEO-Bench-2 (2025 successor)

GEO-Bench-2, developed by IBM, ServiceNow, MBZUAI, NASA, ESA Phi-lab, TU Munich, ASU, and Clark University under the AI Alliance Climate and Sustainability Group, was previewed on October 15, 2025 with a public leaderboard on Hugging Face^[2]. It expands v1 to 19 datasets across 9 capability subsets and tightens licensing so every dataset can be redistributed cleanly; new datasets include biomassters, cloudsen12, dynamic_earthnet, flair2, kuro_siwo, pastis, spacenet2, spacenet7, substation, and treesatai^[8]. The new protocol reports over 15,000 runs orchestrated through TerraTorch, uses dataset-wise min-max normalization, and aggregates with bootstrapped interquartile means; the leaderboard separates full fine-tuning from frozen-encoder evaluation. TerraMind and Prithvi-EO-2.0 dominate multispectral subsets, while DINOv3 and ConvNeXt remain competitive on RGB and high-resolution scenarios^[2].

GEOBench-VLM

GEOBench-VLM, introduced by Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan (arXiv:2411.19325, November 2024), evaluates vision-language models on remote sensing^[3]. Authors come from MBZUAI, UCL, Linkoping University, IBM Research Europe, ServiceNow Research, and ANU. The paper was accepted to ICCV 2025 in Honolulu and released under Apache 2.0^[9]. It is built around 31 sub-tasks across 8 broad categories with more than 10,000 manually verified instructions.

Models and headline results

The paper benchmarks 13 VLMs. Generic models include LLaVA-1.5, LLaVA-NeXT, LLaVA-OneVision, Sphinx, Ferret, InternVL2, Qwen2-VL, and GPT-4o. Geospatial-specific models include GeoChat, RS-LLaVA, SkySenseGPT, EarthDial, and LHRS-Bot-Nova^[3].

Model	Average MCQ accuracy
LLaVA-OneVision	41.72%
GPT-4o	41.14%
Qwen2-VL	40.25%
EarthDial	37.70%
Random baseline	25.00%

LLaVA-OneVision leads counting tasks (buildings, vehicles, marine debris, tree health). GPT-4o is the best fine-grained object classifier (ships, aircraft) and wins disaster classification and damaged-building counting. EarthDial, the only geospatial-specific model among the leaders, is best on land-use classification and event detection. Sphinx wins image captioning by BERTScore. Qwen2-VL leads earthquake magnitude estimation from SAR imagery, where GPT-4o is worst^[3]. No model exceeds 42 percent average accuracy, roughly double random guessing, and multispectral inputs degrade performance sharply because the VLMs are trained on RGB^[3].

CCMDI GeoBench (GeoGuessr benchmark)

The CCMDI GeoBench puts language and vision models in front of GeoGuessr-style street-level photographs and asks them to predict the country and exact latitude/longitude. The benchmark, the geobench.org leaderboard, and a writeup were built by ccmdi, with code released under MIT license^[4]^[10]; results are tracked by Epoch AI^[10]. Images come from five real GeoGuessr community maps, with "A Community World" as the headline test set. Models receive the photograph and a short prompt and return a JSON object with country and coordinates. Temperature is 0.4 where exposed. Each prediction is scored with the GeoGuessr formula: a max 5,000 points per image for a guess within roughly 25 meters, decaying exponentially with haversine distance^[10].

Key reported results (CCMDI's 2024 leaderboard, A Community World map):

Rank	Model	Average score	Notes
1	Claude 3.5 Sonnet	2,268.97	Strongest overall on subtle scene cues
2	GPT-4V	2,145.32	Best on text-heavy scenes (signage, scripts)
3	Gemini 1.5 Pro	2,087.54	Strong on terrain and vegetation cues
-	Human (average)	2,109.40	Casual GeoGuessr players
-	Human (expert)	4,579.40	Top-ranked GeoGuessr players

The top model still trails the strongest human players by roughly 2,300 points on a 5,000-point scale. Epoch AI folded GeoBench into its broader multimodal evaluation, where it produced suggestive evidence that Gemini is more heavily optimized for vision and Claude leans toward code^[10].

Methodology and scoring

Variant	Primary metric	Aggregation
GEO-Bench v1 (classification)	Top-1 accuracy, F1	Normalized score with bootstrap CIs
GEO-Bench v1 (segmentation)	Mean IoU	Normalized aggregation
GEO-Bench-2	Min-max normalized per task	Bootstrapped interquartile mean
GEOBench-VLM (MCQ)	Five-option MCQ accuracy	Per-category and overall mean
GEOBench-VLM (detection)	Precision @ IoU 0.25 and 0.50	Reported separately
GEOBench-VLM (captions)	BERTScore vs. references	Per-model average
CCMDI GeoBench	Distance, country accuracy	Mean GeoGuessr score (max 5,000)

GEO-Bench v1 uses normalized scores and bootstrap intervals so improving one easy dataset cannot silently dominate the headline number^[1]. GEO-Bench-2 extends this with interquartile means that trim outliers^[2]. GEOBench-VLM keeps mean accuracy but invests heavily in question quality through manual verification.

Key findings

Large gaps remain between human and machine performance: on CCMDI GeoBench, expert humans roughly double the score of the strongest model, and even average GeoGuessr players are competitive with frontier VLMs^[4]^[10]. On GEOBench-VLM no model exceeds 42 percent average accuracy across 31 tasks^[3]. Domain-specific pretraining helps but only sometimes: ConvNeXt-Base and SwinV2-Tiny remain the most consistent generic architectures on GEO-Bench v1, and Sentinel-2 self-supervised pretraining reliably improves ResNet50 on multispectral tasks^[1]; in GEO-Bench-2 TerraMind and Prithvi-EO-2.0 dominate multispectral subsets while DINOv3 and ConvNeXt lead on RGB and high-resolution scenarios^[2]. Counting, multispectral, and temporal reasoning remain weak across the board: LLaVA-OneVision counts well on GEOBench-VLM but accuracy collapses past 50 objects per scene, multispectral inputs collapsed to pseudo-RGB tank crop classification, and VLMs do not yet exploit temporal context for change detection^[3].

Applications

Domain	Example application	Relevant variant
Disaster response	xBD damage assessment, flood mapping	GEOBench-VLM, GEO-Bench-2
Agriculture	Crop type classification, field boundary segmentation	GEO-Bench v1, GEO-Bench-2
Urban planning	Land-use classification, building footprint extraction	GEO-Bench v1, GEOBench-VLM
Environmental monitoring	Deforestation, brick kilns, biomass estimation	GEO-Bench v1, GEO-Bench-2
Energy	Solar PV detection, substation identification	GEO-Bench v1, GEO-Bench-2
Travel and OSINT	GeoGuessr-style image geolocation	CCMDI GeoBench

GEO-Bench v1 and GEO-Bench-2 sit closest to operational earth observation, drawing on real datasets for agriculture (cashew plantations, South African crops), deforestation (ForestNet), and carbon-relevant infrastructure (PV4Ger, substations). GEOBench-VLM extends those tasks into language-conditioned use cases. CCMDI GeoBench is more consumer-facing but exposes the same core skill of integrating visual cues into a geographic guess.

GeoBenchX (Krechetova and Kochedykov, arXiv:2503.18129, March 2025) tests tool-calling LLM agents on multi-step geospatial tasks with 23 GIS functions^[5]. Related benchmarks like VRSBench^[11], MMBench, and MM-Vet evaluate VLMs more broadly, while domain suites such as PASTIS and xBD feed into GEO-Bench-2 and GEOBench-VLM. The "GeoBench" name has also been used for unrelated work on monocular geometry and image editing^[12]. Each variant has caveats: GEO-Bench v1 is biased toward Europe and Africa and most tasks fit on a single GPU; GEO-Bench-2 is still early at preview^[2]; GEOBench-VLM uses GPT-4o to generate questions, partly shaping the test distribution though manual verification helps^[3]; CCMDI GeoBench depends on community GeoGuessr maps that change over time, so reproducing scores requires pinning the map and prompt^[4]^[10]. None of the variants evaluate 3D spatial reasoning in depth, and long-horizon temporal reasoning is mostly limited to bi-temporal change detection.

References

Lacoste, A., et al. (2023). *GEO-Bench: Toward Foundation Models for Earth Monitoring*. arXiv:2306.03831. https://arxiv.org/abs/2306.03831 ↩
The AI Alliance Climate and Sustainability Group (2025). *GEO-Bench-2: From Performance to Capability*. https://thealliance.ai/blog/geo-bench-2-from-performance-to-capability-rethinking-evaluation-in-geospatial-ai ↩
Danish, M. S., Munir, M. A., Shah, S. R. A., Kuckreja, K., Khan, F. S., Fraccaro, P., Lacoste, A., and Khan, S. (2024). *GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks*. arXiv:2411.19325. https://arxiv.org/abs/2411.19325 ↩
CCMDI (2024). *GeoBench: Can LLMs play GeoGuessr?* https://ccmdi.com/blog/GeoBench ↩
Krechetova, V., and Kochedykov, D. (2025). *GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks*. arXiv:2503.18129. https://arxiv.org/abs/2503.18129 ↩
NeurIPS 2023 Datasets and Benchmarks Track. *GEO-Bench* poster page. https://neurips.cc/virtual/2023/poster/73632 ↩
ServiceNow Research. *geo-bench* GitHub repository. https://github.com/ServiceNow/geo-bench ↩
The AI Alliance. *GEO-Bench-2* GitHub repository. https://github.com/The-AI-Alliance/GEO-Bench-2 ↩
The AI Alliance. *GEO-Bench-VLM* GitHub and project page. https://github.com/The-AI-Alliance/GEO-Bench-VLM and https://the-ai-alliance.github.io/GEO-Bench-VLM/ ↩
Epoch AI. *GeoBench* benchmark page. https://epoch.ai/benchmarks/geobench ↩
Li, X., Ding, J., and Elhoseiny, M. (2024). *VRSBench*. arXiv:2406.12384. https://arxiv.org/abs/2406.12384 ↩
Emergent Mind. *GeoBench: Unified Geospatial AI Benchmarks*. https://www.emergentmind.com/topics/geobench ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

BALROG MMLU

Category	Sub-tasks (selected)
Scene understanding	Scene classification, land-use classification, crop classification
Object classification	Ship type, aircraft type
Localization and counting	Referring expression detection, spatial relationships, vehicle/aircraft/building/tree/marine debris counting
Event detection	Fire risk assessment, disaster type classification
Caption generation	Scene and object-aware image captioning
Semantic segmentation	Referring expression segmentation, urban vs. non-urban masks
Temporal understanding	Change detection, damaged-building counting, farm-pond change
Non-optical	SAR ship detection, SAR flood detection, earthquake magnitude estimation

GeoBench

Overview

Variants

ServiceNow GEO-Bench (v1)

Classification tasks (GEO-Bench v1)

Segmentation tasks (GEO-Bench v1)

GEO-Bench-2 (2025 successor)

GEOBench-VLM

Categories

Models and headline results

CCMDI GeoBench (GeoGuessr benchmark)

Methodology and scoring

Key findings

Applications

See also

References

Improve this article

What links here

What links here

Overview

Variants

ServiceNow GEO-Bench (v1)

Classification tasks (GEO-Bench v1)

Segmentation tasks (GEO-Bench v1)

GEO-Bench-2 (2025 successor)

GEOBench-VLM

Categories

Models and headline results

CCMDI GeoBench (GeoGuessr benchmark)

Methodology and scoring

Key findings

Applications

Related work and limitations

See also

References

Improve this article

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here

Related Articles

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

LLM Rankings

What links here