Fox is an evaluation suite for fine-grained, multi-page document understanding by large vision-language models. It was released in May 2024 alongside a model and training pipeline of the same name in the paper Focus Anywhere for Fine-grained Multi-page Document Understanding by Chenglong Liu, Haoran Wei, and collaborators from the University of Chinese Academy of Sciences and MEGVII Technology. The benchmark targets a gap that the authors identify in earlier work: many document understanding systems can transcribe a whole page but cannot answer questions about a specific column, paragraph, line, or colored span, and they break down further once a query spans several pages.
The "Fox" name doubles as the project's tagline, focus anywhere, and refers to the model's ability to be steered with positional prompts (click points, bounding boxes, or color cues) toward any region of any page. The benchmark is designed to test that capability against LVLMs of any architecture, not just the authors' own model.
By 2024, OCR and document VQA were dominated by single-page benchmarks such as DocVQA, ChartQA, and InfoVQA. These cover full-page transcription or short visual questions but rarely require a model to operate on a region defined by the user, and almost none of them stretch beyond a single page. The Fox authors argue that real reading is interactive: a user points at a paragraph, asks for a translation of a specific column, or compares a figure caption on page 3 with a table on page 7. Existing benchmarks did not test any of this.
Fox was introduced as a public probe for those skills. The paper, posted to arXiv as 2405.14295 on 23 May 2024, is also the first publication to describe what the authors call the Fox model, a vision-language model that combines two frozen vision encoders (a CLIP-style natural-image vocabulary and a Vary-style document vocabulary) and is fine-tuned on a small amount of synthetic, position-prompted data.
The Fox benchmark is bilingual and built from PDF pages collected from the open web. The paper reports 112 English pages and 100 Chinese pages, each with roughly a thousand or more characters and a mix of single- and multi-column layouts. A separate multi-page split groups pages into eight-page documents for the cross-page tasks. The OCRBench v2 survey from January 2025 summarises Fox as covering 2 scenarios (English and Chinese), 9 tasks, around 0.7k images, and around 2.2k instructions, which gives a sense of the overall annotation budget.
The benchmark's nine sub-tasks are summarised below. Names follow the paper.
| # | Sub-task | What it tests |
|---|---|---|
| 1 | Page-level (foreground) OCR | Full transcription of a dense, multi-column page in English or Chinese |
| 2 | Region-level OCR | Transcription of an arbitrary user-drawn box, often run as multi-turn dialogue over several boxes on one page |
| 3 | Line-level OCR | Transcription of a single line selected by a click point |
| 4 | Color-guided OCR | Transcription of text overlaid by a colored highlight (red, blue, or green) |
| 5 | Region-level translation | English-to-Chinese translation of the text inside a user-selected box |
| 6 | Region-level summary | Short summary of a user-selected text region |
| 7 | In-document figure caption | Caption for a natural image embedded in a PDF page |
| 8 | Multi-page multi-region OCR | OCR of several user-specified boxes spread across an eight-page document, all in a single query |
| 9 | Cross-page VQA | Visual question answering that requires comparing or aggregating content from more than one page |
Tasks 1 through 7 evaluate single-page focus, while tasks 8 and 9 evaluate the multi-page, format-free behaviour that gives the paper its title. The paper also introduces an in-document figure chat task that lets the user converse about an embedded image, but the public benchmark scripts focus on the nine items listed above.
Fox reports different metrics depending on the task type. The eval scripts published in the GitHub repository compute the following:
| Task family | Metrics |
|---|---|
| OCR (page, region, line, color, multi-page) | F1, precision, recall, BLEU, METEOR, normalized edit distance |
| Region-level summary and figure caption | ROUGE |
| Cross-page VQA | Accuracy |
The OCR metric set is closer to traditional text-recognition evaluation than to the loose string match used by some VQA benchmarks. Edit distance in particular makes the score sensitive to small character errors, which matters for dense Chinese pages. The original paper reports a normalized edit distance of about 0.046 on English page OCR and 0.061 on Chinese page OCR for the Fox model itself, with corresponding F1 scores above 0.95. For the multi-page setting it reports an F1 of roughly 0.95 on multi-region OCR over eight-page documents and an accuracy near 0.83 on cross-page VQA. These numbers are useful as a yardstick rather than a leaderboard, since the benchmark is small enough that a single evaluation run is cheap and reproducible.
The Fox paper is unusual in that the benchmark and the proposed model arrive together. The model side is a multimodal pipeline that wires two frozen vision encoders into one language model. A CLIP-ViT branch covers natural images, and a Vary-style ViT (close to the SAM image encoder, trained for document text) compresses a 1024 by 1024 page into 256 image tokens. Both encoders stay frozen during fine-tuning; only the projection layers and the language model receive gradient updates. The choice keeps the document vocabulary's strong text-recognition prior intact while letting the language side learn to switch between visual modes.
The key trick in the paper is what the authors call cross-vocabulary data. Instead of training on either pure document images or pure natural images, they synthesise hybrid pages by rendering natural images directly onto PDF pages and removing any underlying text that would overlap. They also paint random text spans in red, blue, and green to teach the model how to follow color cues. With only this hybrid corpus and position-aware prompts (point coordinates, drawn boxes, colored regions), the model learns to localise its attention to a chosen part of a page without the encoders ever being retrained.
This architecture is the reason the multi-page tasks are tractable: a single eight-page document can be fed as 8 by 256, or 2048, image tokens, which still fits comfortably in a long-context language model.
The code, evaluation scripts, and pre-trained Fox model weights are published on GitHub under ucaslcl/Fox with an Apache 2.0 license for code and CC BY-NC 4.0 for data. The benchmark images are distributed as a single archive (focus_benchmark_test.zip, about 329 MB) on Hugging Face under ucaslcl/Fox_benchmark_data, with the dataset itself released under CC BY-NC-SA 4.0. Because the test set is small and the metrics are deterministic, running the full benchmark on a new model takes minutes once inference is set up, which has helped Fox become a quick smoke test for new document-understanding models.
Fox has been picked up by several follow-up systems and surveys.
| Project | How Fox is used |
|---|---|
| DeepSeek-OCR (October 2025) | Uses the English document portion of Fox as the testbed for its optical context compression study, selecting the 100 pages that fall in the 600 to 1300 token range |
| OCRBench v2 (January 2025) | Cites Fox in its overview of existing text-centric benchmarks (2 scenarios, 9 tasks, around 0.7k images) for comparison with its own broader suite |
| General OCR Theory / GOT-OCR (September 2024) | References Fox in its discussion of dense, position-aware OCR evaluation |
| ECLAIR (February 2025) | Cites Fox among the prior benchmarks that motivate layout-aware reading-order extraction |
The DeepSeek-OCR usage is the most quantitative. In their compression study the authors fix a model size and vary the number of vision tokens used to encode each page, then ask how accurately the original text can be decoded. They report roughly 97% OCR precision when text tokens are within ten times the number of vision tokens (a compression ratio under 10x), about 90% at 10 to 12x compression, and around 60% at 20x compression, all measured on the Fox English subset. These numbers became one of the most cited results from the DeepSeek-OCR paper and are largely responsible for the renewed interest in Fox during 2025 and 2026.
Fox sits in a corner of the document-understanding landscape that few earlier benchmarks covered. The table below compares it on a few rough axes; the row for Fox is sourced from the original paper and the OCRBench v2 summary table.
| Benchmark | Region-level prompts | Multi-page | Bilingual | Notes |
|---|---|---|---|---|
| DocVQA | No | No | English | Question answering over single document images |
| ChartQA | No | No | English | Question answering over charts |
| InfoVQA | No | No | English | Question answering over infographics |
| OCRBench (v1) | Limited | No | English plus some Chinese | Aggregated short OCR and KIE tasks |
| MMDocBench | Limited | Some | English | Multi-task document understanding for LVLMs |
| Fox | Yes (points, boxes, colors) | Yes (eight-page) | English and Chinese | Nine fine-grained tasks, region- and color-guided OCR, cross-page VQA |
The fine-grained focus tasks and the cross-page split are what set Fox apart. It does not try to replace large transcription benchmarks like DocVQA. It tries to ask whether a model can read what the user points at and link information across pages, and most pre-2024 benchmarks simply did not test that.
Fox is small by the standards of modern multimodal benchmarks. With a few hundred pages and a few thousand instructions it cannot, on its own, prove that a model is good at document understanding. What it does well is isolate a specific behaviour that bigger suites tend to average away: the ability to focus on a chosen region rather than a whole page, and to do so reliably across multiple pages. That made it a natural fit for the wave of compression-oriented work in 2025, where researchers wanted a clean, dense, position-aware testbed for asking how few vision tokens are needed to recover the underlying text.
The paper's other contribution, the cross-vocabulary training recipe, has been less directly copied but it has influenced how later systems combine document-specialised encoders with general image encoders. The benchmark is the part of the project that has had the longer life so far, partly because it is cheap to run and partly because there has been no obvious replacement for region- and color-prompted OCR evaluation since.