Fox (benchmark)
Last reviewed
May 16, 2026
Sources
9 citations
Review status
Source-backed
Revision
v5 ยท 4,458 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
9 citations
Review status
Source-backed
Revision
v5 ยท 4,458 words
Add missing citations, update stale details, or suggest a clearer explanation.
Fox is an evaluation suite for fine-grained, multi-page document understanding by large vision-language models. It was released in May 2024 alongside a model and training pipeline of the same name in the paper Focus Anywhere for Fine-grained Multi-page Document Understanding by Chenglong Liu, Haoran Wei, and collaborators from the University of Chinese Academy of Sciences (UCAS) and MEGVII Technology. The benchmark targets a gap that the authors identify in earlier work: many document understanding systems can transcribe a whole page but cannot answer questions about a specific column, paragraph, line, or colored span, and they break down further once a query spans several pages.
The "Fox" name doubles as the project's tagline, focus anywhere, and refers to the model's ability to be steered with positional prompts (click points, bounding boxes, or color cues) toward any region of any page. The benchmark is designed to test that capability against vision-language models of any architecture, not just the authors' own model. It has since become a frequently cited probe for document-focused multimodal systems, partly because it is small enough to run in minutes and partly because no comparable suite covers region-prompted, color-guided, and cross-page tasks in one place.
By 2024, OCR and document visual question answering were dominated by single-page benchmarks such as DocVQA, ChartQA, and InfoVQA. These cover full-page transcription or short visual questions but rarely require a model to operate on a region defined by the user, and almost none of them stretch beyond a single page. The Fox authors argue that real reading is interactive: a user points at a paragraph, asks for a translation of a specific column, or compares a figure caption on page 3 with a table on page 7. Existing benchmarks did not test any of this.
The paper's introduction frames the problem in terms of two architectural traditions. Patch-based multimodal models, including UReader, TextMonkey, LLaVA-NeXT, and InternVL-V1.5, employ CLIP-style vision vocabularies at low input resolution and decompose a page into many crops. The authors argue that this approach produces thousands of image tokens, makes multi-page extension difficult, and "prevents these models from losslessly recovering the content of the original document." In parallel, the Vary line of work uses a SAM-style vocabulary that can read a 1024 by 1024 page in 256 tokens but, in its single-vocabulary form, lacks "full collaboration across multiple vision vocabularies" and is sensitive to document format. The Fox paper poses the question: "Can we devise an effective and efficient pipeline for LVLMs to achieve the fine-grained multi-page document understanding?"
Fox was introduced as the public probe for the resulting skills. The paper, posted to arXiv as 2405.14295 on 23 May 2024, is also the first publication to describe what the authors call the Fox model, a vision-language model that combines two frozen vision encoders (a CLIP-style natural-image vocabulary and a Vary-style document vocabulary) and is fine-tuned on a small amount of synthetic, position-prompted data. The benchmark data was released to Hugging Face three days later, on 26 May 2024.
The paper lists ten authors: Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. The first author is affiliated with UCAS, while the other authors are associated with MEGVII Technology, a Beijing-based computer vision company. Haoran Wei is the same researcher who would later lead General OCR Theory (GOT-OCR2.0) and is widely credited as the architect of the Vary line of vision vocabularies, which gives Fox a direct lineage to those earlier systems.
The Fox benchmark is bilingual and built from PDF pages collected from the open web. The paper reports 112 English pages and 100 Chinese pages, each with roughly a thousand or more characters and a mix of single- and multi-column layouts. The pages were sourced from e-books, the CC-MAIN corpus, and arXiv, then parsed with Python tooling to extract per-paragraph and per-line bounding boxes. A separate multi-page split groups pages into eight-page documents for the cross-page tasks. The benchmark also includes 200 rendered natural-image pages, produced by compositing Laion-COCO images onto PDF pages, that are used to evaluate the in-document figure caption task.
The OCRBench v2 survey from January 2025 summarises Fox as covering 2 scenarios (English and Chinese), 9 tasks, around 0.7k images, and around 2.2k instructions, which gives a sense of the overall annotation budget. The full Hugging Face release reports 944 image rows in a single test split and a total archive size of roughly 329 MB.
The benchmark's nine sub-tasks are summarised below. Names follow the paper.
| # | Sub-task | What it tests |
|---|---|---|
| 1 | Page-level (foreground) OCR | Full transcription of a dense, multi-column page in English or Chinese, framed as foreground focus on the page's content box |
| 2 | Region-level OCR | Transcription of an arbitrary user-drawn box, often run as multi-turn dialogue over several boxes on one page |
| 3 | Line-level OCR | Transcription of a single line selected by a click point near the left side of that line |
| 4 | Color-guided OCR | Transcription of text overlaid by a colored highlight (red, blue, or green) |
| 5 | Region-level translation | English-to-Chinese translation of the text inside a user-selected box, generated with GPT-3.5 |
| 6 | Region-level summary | Short summary of a user-selected text region (boxes filtered to text length over 400) |
| 7 | In-document figure caption | Caption for a natural image embedded in a PDF page, rendered from Laion-COCO |
| 8 | Multi-page multi-region OCR | OCR of several user-specified boxes spread across an eight-page document, all in a single query |
| 9 | Cross-page VQA | Visual question answering that requires comparing or aggregating content from more than one page |
Tasks 1 through 7 evaluate single-page focus, while tasks 8 and 9 evaluate the multi-page, format-free behaviour that gives the paper its title. The paper also introduces an in-document figure chat task that lets the user converse about an embedded image, but the public benchmark scripts focus on the nine items listed above.
The download archive on Hugging Face, focus_benchmark_test.zip, decompresses into a folder of bilingual PDF page images and a collection of JSON annotation files, one per sub-task. The file names are revealing: en_page_ocr.json, cn_page_ocr.json, en_box_ocr.json, cn_box_ocr.json, en_line_ocr.json, cn_line_ocr.json, en_onbox_ocr.json, cn_onbox_ocr.json, en_box_translation.json, en_box_summary.json, en_page_indoc_caption.json, encn-multi-8page-box-ocr.json, and encn-multi-8page-cross-vqa.json. Each record holds an image filename and a conversation with one question and one ground-truth answer.
The defining feature of Fox is that almost every task is conditioned on a position prompt that the user supplies in natural language. The paper formalises these prompts as bounding box coordinates inserted directly into the instruction string, and the format is shared across single-page and multi-page settings.
For foreground OCR the prompt looks like "Give the OCR results of the box (x_1, y_1, x_2, y_2)", where the box covers the full text area of the page. For region-level OCR the same template is used but with an arbitrary box drawn by the annotator. Line-level OCR uses a click point: "OCR the line (x, y)". Color-guided OCR replaces the explicit coordinates with a color cue: "OCR red box", "OCR blue box", or "OCR green box", and the model has to find the relevant highlight on its own.
The multi-page tasks extend this scheme by chaining page-indexed boxes into a single query: "OCR boxes on multiple pages. Page 1: (...), Page 2: (...), Page N: (...)". Cross-page VQA uses the same indexing but turns the prompt into a question, for example asking which page's box contains more characters. The result is a uniform instruction format across all nine sub-tasks, which makes the benchmark relatively easy to run on any vision-language model that can accept arbitrary text instructions, even if that model has never been trained on position prompts.
Fox reports different metrics depending on the task type. The evaluation scripts published in the GitHub repository compute the following:
| Task family | Metrics |
|---|---|
| OCR (page, region, line, color, multi-page) | F1, precision, recall, BLEU, METEOR, normalized edit distance |
| Region-level summary and figure caption | ROUGE-L F, METEOR |
| Cross-page VQA | Accuracy |
The OCR metric set is closer to traditional text-recognition evaluation than to the loose string match used by some visual question answering benchmarks. Edit distance in particular makes the score sensitive to small character errors, which matters for dense Chinese pages. The three OCR evaluation scripts in the repository are eval_ocr_test.py, eval_summary_test.py, and eval_qa_test.py, each running deterministically on the JSON output of an inference script that the user provides.
Because the test set is small and the metrics are deterministic, running the full benchmark on a new model takes minutes once inference is set up. That property is partly what has made Fox a popular smoke test for document-understanding research in 2025 and 2026.
The paper reports performance for the Fox model itself across all nine sub-tasks, and for a small set of contemporary baselines on the most directly comparable task, dense single-page OCR. The numbers below are the headline ones from the paper's tables.
All models are evaluated on the 112-page English split. Edit distance is normalized, where lower is better.
| Model | Parameters | Edit distance | F1 |
|---|---|---|---|
| LLaVA-NeXT | 34B | 0.430 | 0.647 |
| InternVL-Chat-V1.5 | 26B | 0.393 | 0.751 |
| Nougat | 250M | 0.255 | 0.745 |
| Vary | 7B | 0.092 | 0.918 |
| Vary-toy | 1.8B | 0.082 | 0.924 |
| Qwen-VL-Plus | > 100B (API) | 0.096 | 0.931 |
| Qwen-VL-Max | > 100B (API) | 0.057 | 0.964 |
| Fox | 1.8B | 0.046 | 0.952 |
Fox at 1.8B parameters matches or beats every open baseline tested and competes with the larger Qwen-VL-Max API model. The paper attributes part of the improvement to redefining the page-level task as foreground focus, which prompts the model to operate on the dense content box rather than the full image; the authors report that this reframing alone lifts the English F1 by 2.8 percentage points over Vary-toy.
On the 100-page Chinese split, Fox is again the best-performing model on both metrics, while the larger Qwen-VL-Max remains the strongest baseline.
| Model | Parameters | Edit distance | F1 |
|---|---|---|---|
| InternVL-Chat-V1.5 | 26B | 0.265 | 0.816 |
| Vary-toy | 1.8B | 0.142 | 0.914 |
| Qwen-VL-Plus | > 100B (API) | 0.121 | 0.895 |
| Vary | 7B | 0.113 | 0.952 |
| Qwen-VL-Max | > 100B (API) | 0.091 | 0.931 |
| Fox | 1.8B | 0.061 | 0.954 |
The gap on Chinese pages is larger because most patch-based models, including LLaVA-NeXT and InternVL-V1.5, were not trained on Chinese-dense documents at the time of the paper. The Vary-style document vocabulary, which Fox inherits, is the main reason that the 1.8B model can read multi-column Chinese pages at all.
For the region, line, and color-guided OCR tasks, the paper only reports the Fox model's own numbers. The same metrics are used as for page OCR.
| Setting | Edit distance | F1 | BLEU |
|---|---|---|---|
| English color-guided | 0.064 | 0.940 | 0.868 |
| English region | 0.059 | 0.957 | 0.914 |
| English line | 0.116 | 0.879 | 0.845 |
| Chinese color-guided | 0.114 | 0.884 | 0.778 |
| Chinese region | 0.042 | 0.955 | 0.885 |
| Chinese line | 0.084 | 0.918 | 0.825 |
Line-level OCR is the hardest of the three because the model has to localise a single line from a click point with no bounding box, while region OCR (where the box is given) is the easiest. Color-guided OCR sits in between because the model has to identify the colored span before it can transcribe it.
For the non-OCR tasks the paper reports a single number per metric, again only for the Fox model.
| Task | Metric | Score |
|---|---|---|
| Region-level translation | BLEU | 0.138 |
| Region-level translation | METEOR | 0.366 |
| Region-level summary | ROUGE-L F | 0.282 |
| In-document figure caption | METEOR | 0.359 |
| In-document figure caption | ROUGE-L F | 0.396 |
These numbers should be read as anchors rather than ceilings. The reference translations and summaries were themselves generated by GPT-3.5 on long in-box text, so the gold standard is itself a model output, and the absolute scale of the scores is therefore harder to interpret than the OCR numbers.
For the eight-page setting the paper reports a single column for each task.
| Task | Edit distance | F1 | BLEU | METEOR | Accuracy |
|---|---|---|---|---|---|
| Multi-region OCR | 0.084 | 0.946 | 0.836 | 0.805 | not applicable |
| Cross-page VQA | not applicable | not applicable | not applicable | not applicable | 0.827 |
The multi-region OCR score is close to single-page region OCR, which suggests that the multi-page extension does not severely degrade Fox's region-level fidelity. The cross-page VQA accuracy of 0.827 is best interpreted with the task design in mind: most questions ask the model to compare or aggregate properties across pages, and a baseline that always answers "page 1" would do poorly on those.
The Fox paper is unusual in that the benchmark and the proposed model arrive together. The model side is a multimodal pipeline that wires two frozen vision encoders into one language model. A CLIP-ViT branch covers natural images at 224 by 224 input resolution, and a Vary-style ViT (close to the SAM image encoder, trained for document text) compresses a 1024 by 1024 page into 256 image tokens. Both branches emit 256 tokens of dimension 1024; two linear projections, denoted W^C and W^S, lift them into the language space and concatenate to give 256 tokens of dimension 2048 per page. Both vision encoders stay frozen during all training; only the projection layers and the language model receive gradient updates. The choice keeps the document vocabulary's strong text-recognition prior intact while letting the language side learn to switch between visual modes.
The language model is Qwen-1.8B (see Qwen), chosen for what the paper calls its "rich linguistic vocabulary" in both Chinese and English. A single eight-page document therefore feeds 8 by 256, or 2048, image tokens into the language model, which still fits comfortably in a long-context decoder.
The key trick in the paper is what the authors call cross-vocabulary data. Instead of training on either pure document images or pure natural images, they synthesise hybrid pages by rendering natural images directly onto PDF pages and removing any underlying text that would overlap. The rendering scaffolding is specific: natural images are scaled to between roughly 0.3 and 0.9 of the page width or 0.4 and 0.9 of the page height (depending on aspect ratio), placed at a random location, and then any vanilla text box whose intersection over union with the natural image is non-zero is painted white before the image is drawn. They also paint random text spans in red, blue, and green to teach the model how to follow color cues. The claim is that the CLIP branch already knows how to recognise colors but cannot read dense text, while the Vary-tiny branch can read dense text but is colour-blind, so only by training on hybrid examples can both branches contribute at the same time.
The pre-training corpus, assembled from this recipe, totals roughly 9.8 million samples. The reported composition is: 4.6 million region-level document understanding examples (with point, box, and color prompts), 558 thousand in-document caption pairs from BLIP558K rendered onto PDFs, 22 thousand in-document chat conversations from RegionChat, around 1 million layout analysis examples mixing PubLayNet ground truth with pseudo annotations generated by PaddleOCR, 800 thousand multi-page documents, and around 1 million natural-image and pure-text examples to maintain general conversational ability. For the supervised fine-tuning phase the authors sample 10 thousand image-text pairs per task type, rewrite each prompt ten times using GPT-3.5 to diversify the surface form, and add LLaVA80K to round out instruction following.
The training run uses AdamW with cosine annealing, a learning rate of 1e-4 during pre-training and 2e-5 during fine-tuning, on 48 NVIDIA A800 GPUs with per-device batch size 4. The whole dataset is run for a single epoch in each phase. The architecture is the reason the multi-page tasks are tractable at all on a 1.8B language model, since the 2048-token budget for an eight-page document is roughly the size of a long single-page context for many patch-based systems.
The code, evaluation scripts, and pre-trained Fox model weights are published on GitHub under ucaslcl/Fox with an Apache 2.0 license for code and CC BY-NC 4.0 for the data. The licensing note in the README explicitly says that the data terms are inherited from the Vary and Opt projects, and that the dataset is intended for research use only. The benchmark images are distributed as a single archive (focus_benchmark_test.zip, about 329 MB) on Hugging Face under ucaslcl/Fox_benchmark_data, with the dataset itself released under CC BY-NC-SA 4.0.
The repository's README provides a template Python inference script and three evaluation scripts, one per metric family, that consume the JSON outputs of any model and produce the headline numbers used in the paper. Because the test set is small and the metrics are deterministic, running the full benchmark on a new model takes minutes once inference is set up, which has helped Fox become a quick smoke test for new document-understanding models.
Fox has been picked up by several follow-up systems and surveys. Some treat it as a leaderboard target, others as a fixed dense-text testbed for ablation studies, and a few cite it mainly as evidence that region- and color-prompted OCR is a meaningful task in its own right.
| Project | How Fox is used |
|---|---|
| DeepSeek-OCR (October 2025) | Uses the English document portion of Fox as the testbed for its optical context compression study, selecting the 100 pages that fall in the 600 to 1300 token range |
| OCRBench v2 (January 2025) | Cites Fox in its overview of existing text-centric benchmarks (2 scenarios, 9 tasks, around 0.7k images) for comparison with its own broader 31-scenario suite |
| General OCR Theory / GOT-OCR (September 2024) | References Fox in its discussion of dense, position-aware OCR evaluation; GOT shares first author Haoran Wei with Fox |
| ECLAIR (February 2025) | Cites Fox among the prior benchmarks that motivate layout-aware reading-order extraction |
The DeepSeek-OCR usage is the most quantitative, and is largely responsible for the renewed attention that Fox received during 2025 and 2026. In their compression study the authors fix a model size and vary the number of vision tokens used to encode each page, then ask how accurately the original text can be decoded as a function of the compression ratio (text tokens divided by vision tokens). They evaluate this on the 100-page English subset of Fox where the text length falls between 600 and 1300 tokens, which makes the compression ratio cleanly tunable.
The DeepSeek-OCR paper reports two operating points on this subset. With its Tiny mode at 64 vision tokens per page, OCR precision is 96.5 percent for pages of 600 to 700 text tokens (a 10.5x compression ratio), 93.8 percent at 700 to 800 tokens (11.8x), and 59.1 percent at 1200 to 1300 tokens (19.7x). With its Small mode at 100 vision tokens per page, precision is 98.5 percent at 6.7x, 96.8 percent at 9.7x, and 87.1 percent at 12.6x. The summary headline from the paper, quoted widely in subsequent coverage, is that decoding precision stays around 97 percent at compression ratios below 10x, drops to roughly 90 percent in the 10x to 12x range, and falls to about 60 percent at 20x compression. These numbers became one of the most cited results from the DeepSeek-OCR paper and are largely responsible for the renewed interest in Fox during 2025 and 2026.
OCRBench v2 takes a different angle on Fox. Its main contribution is to position Fox alongside more general text-centric suites and to summarise it in a comparison table: 2 scenarios, 9 tasks, 0.7k images, 2.2k instructions. The OCRBench v2 authors then argue that their own benchmark, with 31 scenarios, 23 tasks, 9,500 images, and 10,000 instructions, fills a different niche by going broader at the cost of being less focused on dense bilingual pages.
Fox sits in a corner of the document-understanding landscape that few earlier benchmarks covered. The table below compares it on a few rough axes; the row for Fox is sourced from the original paper and the OCRBench v2 summary table.
| Benchmark | Region-level prompts | Multi-page | Bilingual | Notes |
|---|---|---|---|---|
| DocVQA | No | No | English | Question answering over single document images |
| ChartQA | No | No | English | Question answering over charts |
| InfoVQA | No | No | English | Question answering over infographics |
| OCRBench (v1) | Limited | No | English plus some Chinese | Aggregated short OCR and KIE tasks |
| OCRBench v2 | Yes (referring, grounding) | No | English and Chinese | 31 scenarios, 23 tasks, 9.5k images |
| MMDocBench | Limited | Some | English | Multi-task document understanding for LVLMs |
| MMLongBench-Doc | No | Yes | English | Long-document QA with figures and charts |
| Fox | Yes (points, boxes, colors) | Yes (eight-page) | English and Chinese | Nine fine-grained tasks, region- and color-guided OCR, cross-page VQA |
The fine-grained focus tasks and the cross-page split are what set Fox apart. It does not try to replace large transcription benchmarks like DocVQA. It tries to ask whether a model can read what the user points at, follow a color cue rather than a coordinate, and link information across pages, and most pre-2024 benchmarks simply did not test that. OCRBench v2, released in January 2025, is the closest competitor in spirit; it copies the bilingual emphasis and adds explicit visual text localization tasks, but it does not include a multi-page split comparable to Fox's eight-page documents.
The Fox authors note several limitations in their concluding section. The most explicit one is the low resolution of the CLIP branch, which is fed at 224 by 224 pixels and therefore cannot resolve fine text on its own. The benchmark is also small (a few hundred pages, a few thousand instructions), so it can reveal which models can or cannot perform region-prompted reading but cannot, on its own, support a broad capability claim about document understanding. Both authors and outside reviewers have noted that the GPT-3.5 generated translation and summary references mean those particular tasks have a synthetic gold standard, so absolute scores in BLEU, METEOR, or ROUGE for those tasks should not be compared too tightly across systems with different normalisation behaviour.
A further structural limitation is bilinguality without truly multilingual coverage. The benchmark contains only English and Chinese pages, so it does not exercise scripts such as Arabic, Devanagari, or Cyrillic. This is mentioned in passing in OCRBench v2 as one of the motivations for that suite's broader scenario set. Finally, the multi-page split fixes the page count at eight, which is convenient for token-budget reasons on a 1.8B model but does not test much longer documents like the 50-page reports that drive industrial document AI.
Fox is small by the standards of modern multimodal benchmarks. With a few hundred pages and a few thousand instructions it cannot, on its own, prove that a model is good at document understanding. What it does well is isolate a specific behaviour that bigger suites tend to average away: the ability to focus on a chosen region rather than a whole page, and to do so reliably across multiple pages. That made it a natural fit for the wave of compression-oriented work in 2025, where researchers wanted a clean, dense, position-aware testbed for asking how few vision tokens are needed to recover the underlying text.
The paper's other contribution, the cross-vocabulary training recipe based on hybrid synthetic data, has been less directly copied but it has influenced how later systems combine document-specialised encoders with general image encoders. The benchmark is the part of the project that has had the longer life so far, partly because it is cheap to run and partly because there has been no obvious replacement for region- and color-prompted OCR evaluation since.
A secondary effect of Fox has been to make the foreground focus framing more common in subsequent dense-OCR papers. The reframing, in which a page OCR query is delivered as a bounding box prompt over the page's content area rather than as a global "read this image" instruction, gave the authors a measurable 2.8 percentage point improvement in F1 on the English split and has since been picked up implicitly by several follow-ups that train on box-prompted OCR data.