Humanities Data Benchmark¶
Welcome to the Humanities Data Benchmark report page. This page provides an overview of all benchmark tests, results, and comparisons.
Leaderboard¶
The table below shows the global average performance, cost efficiency, and time efficiency of each model across the four core benchmarks: bibliographic_data, fraktur, metadata_extraction, and zettelkatalog.
The Model and Provider columns identify each AI system. Global Average represents the mean performance score across all four benchmarks (higher is better). Cost/Point and Time/Point show normalized efficiency metrics calculated per test, averaged per benchmark, then averaged globally; this multi-level normalization accounts for different numbers of items, test configurations, and benchmark scales. For efficiency metrics, lower values are better, indicating less cost or time needed per performance point achieved. The four benchmark-specific columns show average performance for each individual benchmark. Only models with results in all four benchmarks are included. Click on any column header to sort the table.
Model ↕ | Provider ↕ | Global Average ↕ | Cost/Point ↕ | Time/Point ↕ | bibliographic_data ↕ | fraktur ↕ | metadata_extraction ↕ | zettelkatalog ↕ |
---|---|---|---|---|---|---|---|---|
gemini-2.5-pro | 0.765 | $0.3015 | 37.65s | |||||
gemini-2.5-flash | 0.752 | $0.0718 | 34.45s | |||||
gemini-2.5-flash-preview-09-2025 | 0.746 | $0.1444 | 22.32s | |||||
gemini-2.0-flash | 0.678 | $0.0310 | 11.24s | |||||
claude-3-7-sonnet-20250219 | Anthropic | 0.650 | $1.1037 | 29.21s | ||||
gemini-2.0-flash-lite | 0.649 | $0.0260 | 16.70s | |||||
gpt-4.1 | OpenAI | 0.646 | $0.9358 | 58.83s | ||||
gpt-5-mini | OpenAI | 0.621 | $0.5016 | 91.83s | ||||
gpt-4o | OpenAI | 0.617 | $0.5720 | 88.90s | ||||
gpt-5 | OpenAI | 0.607 | $2.8010 | 245.62s | ||||
claude-3-5-sonnet-20241022 | Anthropic | 0.607 | $1.0678 | 26.77s | ||||
mistral-large-latest | Mistral AI | 0.602 | $0.5405 | 34.47s | ||||
claude-opus-4-1-20250805 | Anthropic | 0.598 | $5.7946 | 46.39s | ||||
mistral-medium-2505 | Mistral AI | 0.591 | $0.1271 | 33.54s | ||||
gemini-2.5-flash-lite | 0.585 | $0.0435 | 6.09s | |||||
o3 | OpenAI | 0.574 | $1.2242 | 174.26s | ||||
gemini-2.5-flash-lite-preview-09-2025 | 0.570 | $0.1258 | 6.99s | |||||
mistral-medium-2508 | Mistral AI | 0.560 | $0.1311 | 84.27s | ||||
claude-opus-4-20250514 | Anthropic | 0.548 | $6.1998 | 64.52s | ||||
claude-sonnet-4-20250514 | Anthropic | 0.544 | $1.2451 | 55.68s | ||||
meta-llama/llama-4-maverick | Meta (via OpenRouter) | 0.516 | $0.2231 | 59.23s | ||||
pixtral-large-latest | Mistral AI | 0.507 | $1.1231 | 40.36s | ||||
gpt-4.1-mini | OpenAI | 0.500 | $0.1127 | 131.83s | ||||
qwen/qwen3-vl-8b-instruct | Alibaba (via OpenRouter) | 0.487 | $0.0712 | 1633.78s | ||||
qwen/qwen3-vl-30b-a3b-instruct | Alibaba (via OpenRouter) | 0.472 | $0.1840 | 176.98s | ||||
gpt-5-nano | OpenAI | 0.466 | $0.3687 | 636.97s | ||||
claude-3-opus-20240229 | Anthropic | 0.398 | $7.5173 | 89.44s | ||||
gpt-4.1-nano | OpenAI | 0.371 | $0.0218 | 24.73s | ||||
gpt-4o-mini | OpenAI | 0.328 | $0.4949 | 4210.65s | ||||
pixtral-12b | Mistral AI | 0.323 | $0.0952 | 99.43s | ||||
GLM-4.5V-FP8 | Z.ai (via sciCORE) | 0.314 | $0.0000 | 406.20s | ||||
claude-sonnet-4-5-20250929 | Anthropic | 0.307 | $2.1789 | 8.28s | ||||
x-ai/grok-4 | xAI (via OpenRouter) | 0.263 | $25.0504 | 1464.25s | ||||
qwen/qwen3-vl-8b-thinking | Alibaba (via OpenRouter) | 0.219 | $0.7975 | 127.24s |
The following radar chart shows the performance distribution of top models across the four core benchmarks:

Latest Benchmark Results¶
The tables below show detailed results for each benchmark, with each row representing a single test configuration run on the most recent date. The Model and Provider columns identify the AI system used. Each test has a unique Test ID (click to see full history) and shows the most recent execution Date. The Prompt and Rules columns indicate the configuration used. Results show the performance score (fuzzy match for bibliographic_data/fraktur, F1-micro for metadata_extraction/zettelkatalog; higher is better). Cost (USD) represents the total cost for processing all items in the test. Cost/Point shows cost efficiency ($/performance point; lower is better). Test Time (s) is the total execution time for all items. Time/Point shows time efficiency (seconds/performance point; lower is better).
bibliographic_data¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|---|---|---|---|---|---|---|---|---|---|
gemini-2.5-flash-preview-09-2025 | T0219 | 2025-10-01 | prompt.txt | None | $0.0307 | $0.0437 | 116.65 | 166.10 | ||
gpt-5 | OpenAI | T0129 | 2025-10-01 | prompt.txt | None | $0.3421 | $0.4992 | 591.90 | 863.65 | |
gpt-5-mini | OpenAI | T0130 | 2025-10-01 | prompt.txt | None | $0.0582 | $0.0860 | 411.12 | 607.56 | |
gemini-2.5-flash | T0195 | 2025-09-30 | prompt.txt | None | $0.0252 | $0.0376 | 195.82 | 292.59 | ||
claude-sonnet-4-20250514 | Anthropic | T0107 | 2025-09-30 | prompt.txt | None | $0.1692 | $0.2531 | 127.79 | 191.16 | |
o3 | OpenAI | T0133 | 2025-10-01 | prompt.txt | None | $0.1885 | $0.2827 | 391.04 | 586.48 | |
gemini-2.5-pro | T0128 | 2025-09-30 | prompt.txt | None | $0.1032 | $0.1554 | 227.18 | 342.25 | ||
mistral-medium-2505 | Mistral AI | T0170 | 2025-10-01 | prompt.txt | None | $0.0222 | $0.0336 | 128.32 | 194.01 | |
gpt-4.1 | OpenAI | T0139 | 2025-10-01 | prompt.txt | None | $0.0952 | $0.1449 | 298.94 | 455.07 | |
qwen/qwen3-vl-8b-thinking | Alibaba (via OpenRouter) | T0233 | 2025-10-17 | prompt.txt | None | $0.1268 | $0.1931 | 923.12 | 1405.84 | |
mistral-medium-2508 | Mistral AI | T0169 | 2025-10-01 | prompt.txt | None | $0.0220 | $0.0336 | 112.70 | 172.31 | |
claude-3-5-sonnet-20241022 | Anthropic | T0009 | 2025-09-30 | prompt.txt | None | $0.1682 | $0.2576 | 124.19 | 190.17 | |
gpt-4o | OpenAI | T0007 | 2025-09-30 | prompt.txt | None | $0.1136 | $0.1748 | 350.22 | 538.95 | |
claude-3-7-sonnet-20250219 | Anthropic | T0031 | 2025-09-30 | prompt.txt | None | $0.1765 | $0.2720 | 136.48 | 210.38 | |
gpt-4.1-mini | OpenAI | T0140 | 2025-10-01 | prompt.txt | None | $0.0199 | $0.0307 | 164.93 | 254.41 | |
mistral-large-latest | Mistral AI | T0187 | 2025-10-01 | prompt.txt | None | $0.0805 | $0.1259 | 136.28 | 213.12 | |
claude-opus-4-1-20250805 | Anthropic | T0127 | 2025-09-30 | prompt.txt | None | $0.9735 | $1.5435 | 203.32 | 322.38 | |
meta-llama/llama-4-maverick | Meta (via OpenRouter) | T0234 | 2025-10-17 | prompt.txt | None | $0.0062 | $0.0099 | 151.02 | 241.35 | |
gemini-2.0-flash | T0008 | 2025-09-30 | prompt.txt | None | $0.0052 | $0.0087 | 69.66 | 115.32 | ||
qwen/qwen3-vl-8b-instruct | Alibaba (via OpenRouter) | T0259 | 2025-10-20 | prompt.txt | None | $0.0050 | $0.0084 | 115.49 | 193.02 | |
gpt-5-nano | OpenAI | T0131 | 2025-10-01 | prompt.txt | None | $0.0281 | $0.0476 | 401.62 | 681.07 | |
claude-opus-4-20250514 | Anthropic | T0106 | 2025-09-30 | prompt.txt | None | $0.8992 | $1.5413 | 193.49 | 331.67 | |
gemini-2.5-flash-lite-preview-09-2025 | T0211 | 2025-10-01 | prompt.txt | None | $0.0048 | $0.0083 | 18.69 | 32.28 | ||
gemini-2.5-flash-lite | T0203 | 2025-10-01 | prompt.txt | None | $0.0039 | $0.0072 | 19.33 | 35.50 | ||
pixtral-large-latest | Mistral AI | T0035 | 2025-09-30 | prompt.txt | None | $0.1079 | $0.2123 | 199.57 | 392.57 | |
gpt-4o-mini | OpenAI | T0027 | 2025-09-30 | prompt.txt | None | $0.0261 | $0.0526 | 233.18 | 468.66 | |
gemini-2.0-flash-lite | T0033 | 2025-09-30 | prompt.txt | None | $0.0055 | $0.0140 | 90.14 | 230.26 | ||
claude-3-opus-20240229 | Anthropic | T0138 | 2025-10-01 | prompt.txt | None | $0.6830 | $1.8417 | 237.58 | 640.62 | |
gpt-4.1-nano | OpenAI | T0141 | 2025-10-01 | prompt.txt | None | $0.0044 | $0.0139 | 105.83 | 330.56 | |
qwen/qwen3-vl-30b-a3b-instruct | Alibaba (via OpenRouter) | T0253 | 2025-10-20 | prompt.txt | None | $0.0160 | $0.0518 | 516.04 | 1674.49 | |
x-ai/grok-4 | xAI (via OpenRouter) | T0265 | 2025-10-20 | prompt.txt | None | $1.2925 | $4.8391 | 2024.03 | 7577.84 | |
GLM-4.5V-FP8 | Z.ai (via sciCORE) | T0237 | 2025-10-17 | prompt.txt | None | $0.0000 | $0.0000 | 361.78 | 1560.25 | |
gpt-4.5-preview | OpenAI | T0026 | 2025-04-08 | prompt.txt | None | N/A | N/A | 480.16 | 2367.38 | |
pixtral-12b | Mistral AI | T0181 | 2025-10-01 | prompt.txt | None | $0.0036 | $0.0197 | 38.06 | 207.04 | |
gemini-1.5-pro | T0030 | 2025-04-08 | prompt.txt | None | N/A | N/A | 88.55 | 763.13 | ||
gemini-1.5-flash | T0029 | 2025-04-08 | prompt.txt | None | N/A | N/A | 62.39 | 586.44 | ||
claude-sonnet-4-5-20250929 | Anthropic | T0225 | 2025-10-01 | prompt.txt | None | $0.1636 | N/A | 121.63 | N/A |
blacklist¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|
company_lists¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|
fraktur¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|---|---|---|---|---|---|---|---|---|---|
gemini-exp-1206 | T0087 | 2025-05-09 | prompt_optimized.txt | None | N/A | N/A | 875.69 | 908.39 | ||
gemini-2.5-pro | T0132 | 2025-10-01 | prompt_optimized.txt | None | $0.1068 | $0.1112 | 244.01 | 254.18 | ||
gemini-2.5-pro-exp-03-25 | T0022 | 2025-05-09 | prompt.txt | None | N/A | N/A | 807.86 | 855.79 | ||
gemini-2.5-pro-exp-03-25 | T0080 | 2025-05-09 | prompt_optimized.txt | None | N/A | N/A | 799.79 | 847.24 | ||
gemini-2.0-pro-exp-02-05 | T0091 | 2025-05-09 | prompt_optimized.txt | None | N/A | N/A | 855.41 | 906.15 | ||
gemini-2.5-pro-preview-05-06 | T0097 | 2025-05-09 | prompt_optimized.txt | None | N/A | N/A | 891.10 | 964.40 | ||
gemini-2.5-flash | T0199 | 2025-09-30 | prompt_optimized.txt | None | $0.0253 | $0.0276 | 243.38 | 265.12 | ||
gemini-2.5-flash-preview-09-2025 | T0223 | 2025-10-01 | prompt_optimized.txt | None | $0.0287 | $0.0329 | 157.75 | 180.91 | ||
gemini-2.0-flash-lite | T0090 | 2025-09-30 | prompt_optimized.txt | None | $0.0034 | $0.0041 | 60.48 | 73.40 | ||
gemini-2.0-flash | T0086 | 2025-09-30 | prompt_optimized.txt | None | $0.0044 | $0.0060 | 55.03 | 75.39 | ||
claude-3-7-sonnet-20250219 | Anthropic | T0092 | 2025-09-30 | prompt_optimized.txt | None | $0.1777 | $0.2605 | 196.58 | 288.25 | |
gemini-1.5-pro | T0089 | 2025-05-09 | prompt_optimized.txt | None | N/A | N/A | 123.51 | 192.99 | ||
gemini-2.5-flash-lite | T0207 | 2025-10-01 | prompt_optimized.txt | None | $0.0045 | $0.0072 | 37.23 | 59.10 | ||
gemini-2.5-flash-preview-04-17 | T0096 | 2025-05-09 | prompt_optimized.txt | None | N/A | N/A | 223.59 | 370.18 | ||
gpt-4.1 | OpenAI | T0083 | 2025-09-30 | prompt_optimized.txt | None | $0.0776 | $0.1362 | 224.89 | 394.54 | |
claude-opus-4-1-20250805 | Anthropic | T0123 | 2025-09-30 | prompt_optimized.txt | None | $0.9371 | $1.6615 | 289.28 | 512.90 | |
gemini-1.5-flash | T0088 | 2025-05-09 | prompt_optimized.txt | None | N/A | N/A | 81.93 | 148.43 | ||
mistral-large-latest | Mistral AI | T0191 | 2025-10-01 | prompt_optimized.txt | None | $0.0683 | $0.1329 | 217.88 | 423.88 | |
gemini-2.5-flash-lite-preview-09-2025 | T0215 | 2025-10-01 | prompt_optimized.txt | None | $0.0058 | $0.0114 | 33.93 | 66.79 | ||
claude-3-5-sonnet-20241022 | Anthropic | T0093 | 2025-09-30 | prompt_optimized.txt | None | $0.1305 | $0.2632 | 133.05 | 268.26 | |
gpt-4o | OpenAI | T0079 | 2025-09-30 | prompt_optimized.txt | None | $0.0790 | $0.1659 | 516.44 | 1084.96 | |
gpt-5-mini | OpenAI | T0121 | 2025-09-30 | prompt_optimized.txt | None | $0.0353 | $0.0746 | 243.61 | 513.94 | |
claude-opus-4-20250514 | Anthropic | T0098 | 2025-09-30 | prompt_optimized.txt | None | $0.9807 | $2.1135 | 392.86 | 846.67 | |
mistral-medium-2505 | Mistral AI | T0178 | 2025-10-01 | prompt_optimized.txt | None | $0.0211 | $0.0490 | 183.12 | 425.87 | |
qwen/qwen3-vl-30b-a3b-instruct | Alibaba (via OpenRouter) | T0257 | 2025-10-20 | prompt_optimized.txt | None | $0.0174 | $0.0413 | 593.01 | 1405.23 | |
pixtral-large-latest | Mistral AI | T0095 | 2025-09-30 | prompt_optimized.txt | None | $0.0861 | $0.2121 | 137.85 | 339.53 | |
claude-sonnet-4-20250514 | Anthropic | T0099 | 2025-09-30 | prompt_optimized.txt | None | $0.2083 | $0.5820 | 298.19 | 832.94 | |
mistral-medium-2508 | Mistral AI | T0177 | 2025-10-01 | prompt_optimized.txt | None | $0.0213 | $0.0642 | 484.57 | 1459.55 | |
meta-llama/llama-4-maverick | Meta (via OpenRouter) | T0251 | 2025-10-17 | prompt_optimized.txt | None | $0.0059 | $0.0196 | 112.89 | 376.30 | |
gpt-4o-mini | OpenAI | T0082 | 2025-09-30 | prompt_optimized.txt | None | $0.0223 | $0.0834 | 110.91 | 413.85 | |
GLM-4.5V-FP8 | Z.ai (via sciCORE) | T0241 | 2025-10-17 | prompt_optimized.txt | None | $0.0000 | $0.0000 | 675.50 | 2659.44 | |
claude-3-opus-20240229 | Anthropic | T0094 | 2025-09-30 | prompt_optimized.txt | None | $0.6288 | $2.8326 | 214.63 | 966.82 | |
pixtral-12b | Mistral AI | T0185 | 2025-10-01 | prompt_optimized.txt | None | $0.0037 | $0.0171 | 376.97 | 1729.20 | |
gpt-4.5-preview | OpenAI | T0081 | 2025-05-09 | prompt_optimized.txt | None | N/A | N/A | 224.02 | 1349.51 | |
gpt-5 | OpenAI | T0120 | 2025-09-30 | prompt_optimized.txt | None | $0.2036 | $1.3394 | 493.97 | 3249.77 | |
o3 | OpenAI | T0137 | 2025-10-01 | prompt_optimized.txt | None | $0.1485 | $1.0606 | 357.90 | 2556.41 | |
gpt-4.1-mini | OpenAI | T0084 | 2025-09-30 | prompt_optimized.txt | None | $0.0106 | $0.2306 | 74.24 | 1613.84 | |
gpt-5-nano | OpenAI | T0122 | 2025-09-30 | prompt_optimized.txt | None | $0.0054 | $0.8930 | 70.61 | 11768.32 | |
qwen/qwen3-vl-8b-instruct | Alibaba (via OpenRouter) | T0263 | 2025-10-20 | prompt_optimized.txt | None | $0.0012 | $0.1938 | 194.63 | 32438.90 | |
gpt-4.1-nano | OpenAI | T0085 | 2025-09-30 | prompt_optimized.txt | None | $0.0013 | N/A | 11.57 | N/A | |
claude-sonnet-4-5-20250929 | Anthropic | T0229 | 2025-10-01 | prompt_optimized.txt | None | $0.2178 | N/A | 301.59 | N/A | |
qwen/qwen3-vl-8b-thinking | Alibaba (via OpenRouter) | T0246 | 2025-10-17 | prompt_optimized.txt | None | $0.0048 | N/A | 47.09 | N/A | |
x-ai/grok-4 | xAI (via OpenRouter) | T0269 | 2025-10-20 | prompt_optimized.txt | None | $0.6505 | N/A | 2896.21 | N/A |
medieval_manuscripts¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|---|---|---|---|---|---|---|---|---|---|
claude-3-7-sonnet-20250219 | Anthropic | T0274 | 2025-10-21 | prompt.txt | None | $0.0573 | $0.0797 | 43.50 | 60.50 | |
gpt-4.1 | OpenAI | T0273 | 2025-10-21 | prompt.txt | None | $0.0251 | $0.0364 | 54.42 | 78.98 | |
gemini-2.5-flash | T0271 | 2025-10-21 | prompt.txt | None | $0.0044 | $0.0065 | 48.13 | 71.30 | ||
gemini-2.5-pro | T0272 | 2025-10-21 | prompt.txt | None | $0.0181 | $0.0268 | 112.99 | 167.39 |
metadata_extraction¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|---|---|---|---|---|---|---|---|---|---|
gpt-5 | OpenAI | T0109 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.4823 | $0.6183 | 1030.30 | 1320.90 | |
o3 | OpenAI | T0135 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.1855 | $0.2541 | 394.51 | 540.43 | |
gpt-5 | OpenAI | T0108 | 2025-09-30 | prompt.txt | None | $1.2982 | $1.8285 | 2922.29 | 4115.90 | |
gpt-5 | OpenAI | T0110 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.7651 | $1.1419 | 1711.52 | 2554.51 | |
o3 | OpenAI | T0134 | 2025-10-01 | prompt.txt | None | $0.5053 | $0.8021 | 1038.49 | 1648.39 | |
gemini-2.0-flash-lite | T0056 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0042 | $0.0066 | 45.77 | 72.65 | ||
gemini-2.5-flash | T0197 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0087 | $0.0139 | 195.57 | 310.43 | ||
gemini-2.5-pro | T0125 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0426 | $0.0687 | 203.80 | 328.71 | ||
gpt-4.5-preview | OpenAI | T0011 | 2025-04-11 | prompt.txt | None | N/A | N/A | 960.76 | 1575.02 | |
gemini-2.0-flash | T0044 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0055 | $0.0090 | 51.72 | 84.78 | ||
gpt-4.5-preview | OpenAI | T0040 | 2025-04-11 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | N/A | N/A | 373.44 | 622.41 | |
gemini-2.5-flash-preview-09-2025 | T0221 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0108 | $0.0180 | 112.91 | 188.19 | ||
gpt-4.5-preview | OpenAI | T0041 | 2025-04-11 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | N/A | N/A | 526.13 | 876.89 | |
o3 | OpenAI | T0136 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.3052 | $0.5087 | 624.76 | 1041.27 | |
mistral-medium-2505 | Mistral AI | T0174 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0231 | $0.0392 | 69.80 | 118.30 | |
qwen/qwen3-vl-8b-instruct | Alibaba (via OpenRouter) | T0261 | 2025-10-20 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0046 | $0.0077 | 62.40 | 105.76 | |
gemini-exp-1206 | T0014 | 2025-04-11 | prompt.txt | None | N/A | N/A | 894.64 | 1542.48 | ||
gemini-2.5-pro-exp-03-25 | T0019 | 2025-04-01 | prompt.txt | None | N/A | N/A | 877.02 | 1512.10 | ||
gpt-5-mini | OpenAI | T0112 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0622 | $0.1072 | 413.46 | 712.86 | |
gemini-2.0-pro-exp-02-05 | T0021 | 2025-04-01 | prompt.txt | None | N/A | N/A | 856.03 | 1501.80 | ||
gpt-5-nano | OpenAI | T0115 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0299 | $0.0524 | 429.53 | 753.56 | |
GLM-4.5V-FP8 | Z.ai (via sciCORE) | T0239 | 2025-10-17 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0000 | $0.0000 | 22.11 | 38.79 | |
gpt-4.1-mini | OpenAI | T0070 | 2025-09-30 | prompt.txt | None | $0.0630 | $0.1124 | 152.75 | 272.78 | |
gpt-4o-mini | OpenAI | T0012 | 2025-09-30 | prompt.txt | None | $0.3814 | $0.6935 | 211.44 | 384.43 | |
gemini-2.5-pro | T0124 | 2025-09-30 | prompt.txt | None | $0.1057 | $0.1921 | 493.26 | 896.84 | ||
gemini-2.5-flash-lite-preview-09-2025 | T0213 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0026 | $0.0048 | 20.70 | 37.64 | ||
gemini-2.0-flash-lite | T0020 | 2025-09-30 | prompt.txt | None | $0.0089 | $0.0165 | 105.24 | 194.88 | ||
gpt-4o-mini | OpenAI | T0076 | 2025-09-30 | prompt.txt | None | $0.3815 | $0.7064 | 217.37 | 402.54 | |
gpt-5-mini | OpenAI | T0111 | 2025-09-30 | prompt.txt | None | $0.1486 | $0.2752 | 1069.71 | 1980.95 | |
gemini-2.5-flash | T0196 | 2025-09-30 | prompt.txt | None | $0.0217 | $0.0403 | 439.55 | 813.98 | ||
gemini-2.5-flash-preview-09-2025 | T0220 | 2025-10-01 | prompt.txt | None | $0.0427 | $0.0791 | 307.46 | 569.36 | ||
qwen/qwen3-vl-8b-instruct | Alibaba (via OpenRouter) | T0260 | 2025-10-20 | prompt.txt | None | $0.0117 | $0.0216 | 131.71 | 243.91 | |
gpt-4.1-mini | OpenAI | T0071 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0238 | $0.0440 | 75.20 | 139.26 | |
mistral-large-latest | Mistral AI | T0189 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.1082 | $0.2004 | 64.28 | 119.03 | |
gpt-4.1-mini | OpenAI | T0072 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0393 | $0.0727 | 86.89 | 160.91 | |
gpt-4o | OpenAI | T0010 | 2025-09-30 | prompt.txt | None | $0.2844 | $0.5367 | 432.47 | 815.98 | |
gpt-4o-mini | OpenAI | T0042 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.1368 | $0.2580 | 86.90 | 163.97 | |
gpt-4o-mini | OpenAI | T0077 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.1368 | $0.2580 | 84.87 | 160.14 | |
gemini-2.0-flash-lite | T0057 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0047 | $0.0089 | 55.63 | 104.95 | ||
gemini-2.5-pro | T0126 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0617 | $0.1164 | 275.31 | 519.45 | ||
gemini-2.0-flash | T0013 | 2025-09-30 | prompt.txt | None | $0.0118 | $0.0227 | 121.45 | 233.56 | ||
gpt-4o | OpenAI | T0039 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.1756 | $0.3376 | 358.87 | 690.14 | |
gpt-4o-mini | OpenAI | T0043 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.2447 | $0.4706 | 120.99 | 232.67 | |
gemini-2.5-flash | T0198 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0130 | $0.0250 | 196.67 | 378.22 | ||
gpt-4.1 | OpenAI | T0067 | 2025-09-30 | prompt.txt | None | $0.2347 | $0.4603 | 229.72 | 450.42 | |
mistral-medium-2508 | Mistral AI | T0173 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0232 | $0.0454 | 66.29 | 129.99 | |
gemini-2.5-flash-lite | T0205 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0028 | $0.0055 | 24.16 | 47.38 | ||
gpt-4o-mini | OpenAI | T0078 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.2447 | $0.4798 | 128.56 | 252.08 | |
gpt-5-mini | OpenAI | T0113 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0868 | $0.1702 | 628.66 | 1232.66 | |
gemini-2.5-flash-preview-09-2025 | T0222 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0157 | $0.0307 | 167.65 | 328.72 | ||
gemini-2.5-flash-lite-preview-09-2025 | T0212 | 2025-10-01 | prompt.txt | None | $0.0066 | $0.0132 | 50.73 | 101.46 | ||
GLM-4.5V-FP8 | Z.ai (via sciCORE) | T0238 | 2025-10-17 | prompt.txt | None | $0.0000 | $0.0000 | 71.75 | 143.50 | |
gpt-4o | OpenAI | T0038 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.1094 | $0.2188 | 136.28 | 272.55 | |
gpt-4.1 | OpenAI | T0068 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0905 | $0.1811 | 87.29 | 174.58 | |
gpt-4.1-nano | OpenAI | T0074 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0081 | $0.0161 | 70.52 | 141.04 | |
qwen/qwen3-vl-30b-a3b-instruct | Alibaba (via OpenRouter) | T0255 | 2025-10-20 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0092 | $0.0184 | 56.40 | 112.80 | |
gpt-4.1 | OpenAI | T0069 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.1447 | $0.2893 | 139.77 | 279.53 | |
qwen/qwen3-vl-8b-instruct | Alibaba (via OpenRouter) | T0262 | 2025-10-20 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0066 | $0.0131 | 96.53 | 193.06 | |
gpt-4.1-nano | OpenAI | T0073 | 2025-09-30 | prompt.txt | None | $0.0217 | $0.0444 | 134.41 | 274.30 | |
mistral-medium-2508 | Mistral AI | T0171 | 2025-10-01 | prompt.txt | None | $0.0601 | $0.1252 | 167.64 | 349.25 | |
gemini-2.0-flash | T0045 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0063 | $0.0131 | 69.17 | 144.11 | ||
gemini-2.5-flash-lite-preview-09-2025 | T0214 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0040 | $0.0082 | 31.45 | 65.52 | ||
mistral-medium-2505 | Mistral AI | T0172 | 2025-10-01 | prompt.txt | None | $0.0602 | $0.1282 | 163.77 | 348.45 | |
meta-llama/llama-4-maverick | Meta (via OpenRouter) | T0248 | 2025-10-17 | prompt.txt | None | $0.0327 | $0.0696 | 183.79 | 391.03 | |
meta-llama/llama-4-maverick | Meta (via OpenRouter) | T0249 | 2025-10-17 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0120 | $0.0255 | 54.29 | 115.50 | |
gpt-4.1-nano | OpenAI | T0075 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0137 | $0.0291 | 88.44 | 188.18 | |
gpt-5-nano | OpenAI | T0114 | 2025-09-30 | prompt.txt | None | $0.0652 | $0.1417 | 1002.37 | 2179.06 | |
gemini-2.5-flash-lite | T0204 | 2025-10-01 | prompt.txt | None | $0.0068 | $0.0149 | 61.57 | 133.85 | ||
gpt-5-nano | OpenAI | T0116 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0341 | $0.0741 | 465.75 | 1012.51 | |
GLM-4.5V-FP8 | Z.ai (via sciCORE) | T0240 | 2025-10-17 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0000 | $0.0000 | 33.01 | 71.76 | |
qwen/qwen3-vl-30b-a3b-instruct | Alibaba (via OpenRouter) | T0256 | 2025-10-20 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0145 | $0.0315 | 87.88 | 191.05 | |
gemini-1.5-pro | T0016 | 2025-04-11 | prompt.txt | None | N/A | N/A | 325.48 | 723.29 | ||
mistral-large-latest | Mistral AI | T0188 | 2025-10-01 | prompt.txt | None | $0.2842 | $0.6316 | 168.93 | 375.40 | |
qwen/qwen3-vl-30b-a3b-instruct | Alibaba (via OpenRouter) | T0254 | 2025-10-20 | prompt.txt | None | $0.0227 | $0.0505 | 147.26 | 327.23 | |
meta-llama/llama-4-maverick | Meta (via OpenRouter) | T0250 | 2025-10-17 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0205 | $0.0456 | 70.88 | 157.50 | |
claude-sonnet-4-5-20250929 | Anthropic | T0226 | 2025-10-01 | prompt.txt | None | $0.5785 | $1.3147 | 243.73 | 553.93 | |
claude-opus-4-1-20250805 | Anthropic | T0118 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $1.0480 | $2.3818 | 101.89 | 231.56 | |
claude-sonnet-4-5-20250929 | Anthropic | T0227 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.2292 | $0.5209 | 102.12 | 232.08 | |
gemini-2.5-flash-lite | T0206 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0040 | $0.0092 | 34.99 | 79.53 | ||
mistral-medium-2505 | Mistral AI | T0176 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0371 | $0.0863 | 108.79 | 252.99 | |
mistral-large-latest | Mistral AI | T0190 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.1758 | $0.4088 | 108.78 | 252.98 | |
claude-3-7-sonnet-20250219 | Anthropic | T0017 | 2025-09-30 | prompt.txt | None | $0.5398 | $1.2853 | 235.51 | 560.74 | |
gemini-1.5-flash | T0048 | 2025-04-11 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | N/A | N/A | 113.85 | 271.08 | ||
claude-3-7-sonnet-20250219 | Anthropic | T0025 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.3273 | $0.7794 | 140.90 | 335.48 | |
mistral-medium-2508 | Mistral AI | T0175 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0371 | $0.0883 | 100.16 | 238.48 | |
claude-3-5-sonnet-20241022 | Anthropic | T0018 | 2025-09-30 | prompt.txt | None | $0.5403 | $1.3177 | 171.85 | 419.14 | |
claude-3-7-sonnet-20250219 | Anthropic | T0024 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.2127 | $0.5188 | 109.17 | 266.27 | |
claude-opus-4-20250514 | Anthropic | T0101 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $1.0565 | $2.5769 | 120.33 | 293.48 | |
claude-3-5-sonnet-20241022 | Anthropic | T0052 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.2131 | $0.5326 | 77.08 | 192.71 | |
claude-3-5-sonnet-20241022 | Anthropic | T0053 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.3281 | $0.8202 | 135.74 | 339.35 | |
gemini-1.5-flash | T0015 | 2025-04-11 | prompt.txt | None | N/A | N/A | 291.55 | 747.57 | ||
claude-sonnet-4-5-20250929 | Anthropic | T0228 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.3493 | $0.9193 | 128.77 | 338.86 | |
claude-opus-4-20250514 | Anthropic | T0100 | 2025-09-30 | prompt.txt | None | $2.6761 | $7.2328 | 282.63 | 763.87 | |
pixtral-large-latest | Mistral AI | T0023 | 2025-09-30 | prompt.txt | None | $0.6883 | $1.9121 | 164.98 | 458.28 | |
claude-opus-4-1-20250805 | Anthropic | T0117 | 2025-09-30 | prompt.txt | None | $2.6621 | $7.3947 | 238.52 | 662.57 | |
pixtral-large-latest | Mistral AI | T0061 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.4348 | $1.2789 | 102.15 | 300.45 | |
claude-sonnet-4-20250514 | Anthropic | T0105 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.3226 | $0.9487 | 113.88 | 334.95 | |
gemini-1.5-flash | T0049 | 2025-04-11 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | N/A | N/A | 184.94 | 560.41 | ||
claude-opus-4-1-20250805 | Anthropic | T0119 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $1.6173 | $5.0542 | 137.86 | 430.81 | |
claude-sonnet-4-20250514 | Anthropic | T0103 | 2025-09-30 | prompt.txt | None | $0.5322 | $1.7169 | 211.51 | 682.28 | |
pixtral-large-latest | Mistral AI | T0060 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.2545 | $0.8208 | 66.61 | 214.87 | |
pixtral-12b | Mistral AI | T0184 | 2025-10-01 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0314 | $0.1014 | 65.18 | 210.27 | |
claude-opus-4-20250514 | Anthropic | T0102 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $1.6251 | $5.4172 | 167.21 | 557.37 | |
pixtral-12b | Mistral AI | T0182 | 2025-10-01 | prompt.txt | None | $0.0496 | $0.1710 | 109.70 | 378.27 | |
claude-sonnet-4-20250514 | Anthropic | T0104 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.2107 | $0.7804 | 78.40 | 290.37 | |
claude-3-opus-20240229 | Anthropic | T0036 | 2025-09-30 | prompt.txt | None | $2.6852 | $10.7406 | 304.65 | 1218.61 | |
pixtral-12b | Mistral AI | T0183 | 2025-10-01 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0182 | $0.0726 | 43.94 | 175.76 | |
claude-3-opus-20240229 | Anthropic | T0063 | 2025-09-30 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $1.6204 | $6.4818 | 172.46 | 689.82 | |
claude-3-opus-20240229 | Anthropic | T0062 | 2025-09-30 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $1.0614 | $5.0543 | 128.92 | 613.89 | |
qwen/qwen3-vl-8b-thinking | Alibaba (via OpenRouter) | T0244 | 2025-10-17 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $0.0337 | $0.1775 | 184.67 | 971.93 | |
qwen/qwen3-vl-8b-thinking | Alibaba (via OpenRouter) | T0245 | 2025-10-17 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $0.0583 | $0.5834 | 312.08 | 3120.79 | |
x-ai/grok-4 | xAI (via OpenRouter) | T0268 | 2025-10-20 | prompt.txt | {"skip_signatures": false, "skip_non_signatures": true} | $1.6650 | $23.7858 | 2638.80 | 37697.19 | |
qwen/qwen3-vl-8b-thinking | Alibaba (via OpenRouter) | T0243 | 2025-10-17 | prompt.txt | None | $0.0581 | $0.9691 | 267.65 | 4460.85 | |
x-ai/grok-4 | xAI (via OpenRouter) | T0266 | 2025-10-20 | prompt.txt | None | $3.0152 | $100.5078 | 4888.20 | 162940.09 | |
x-ai/grok-4 | xAI (via OpenRouter) | T0267 | 2025-10-20 | prompt.txt | {"skip_signatures": true, "skip_non_signatures": false} | $1.4549 | $72.7445 | 4039.29 | 201964.64 |
test_benchmark¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|---|---|---|---|---|---|---|---|---|---|
gpt-4o | OpenAI | T0001 | 2025-09-30 | prompt.txt | None | N/A | $0.0045 | N/A | 7.66 | N/A |
gemini-2.0-flash | T0002 | 2025-09-30 | prompt.txt | None | N/A | $0.0002 | N/A | 3.05 | N/A | |
claude-3-5-sonnet-20241022 | Anthropic | T0003 | 2025-09-30 | prompt.txt | None | N/A | $0.0080 | N/A | 8.60 | N/A |
gemini-2.5-flash | T0193 | 2025-09-30 | prompt.txt | None | N/A | $0.0012 | N/A | 19.04 | N/A | |
gemini-2.5-flash-lite | T0201 | 2025-10-01 | prompt.txt | None | N/A | $0.0263 | N/A | 99.68 | N/A | |
gemini-2.5-flash-lite-preview-09-2025 | T0209 | 2025-10-01 | prompt.txt | None | N/A | $0.0001 | N/A | 1.07 | N/A | |
gemini-2.5-flash-preview-09-2025 | T0217 | 2025-10-01 | prompt.txt | None | N/A | $0.0008 | N/A | 7.19 | N/A |
test_benchmark2¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|---|---|---|---|---|---|---|---|---|---|
gpt-4o | OpenAI | T0004 | 2025-09-30 | a_prompt.txt | None | N/A | $0.0039 | N/A | 10.09 | N/A |
gemini-2.0-flash | T0005 | 2025-09-30 | a_prompt.txt | None | N/A | $0.0002 | N/A | 2.58 | N/A | |
claude-3-5-sonnet-20241022 | Anthropic | T0006 | 2025-09-30 | a_prompt.txt | None | N/A | $0.0069 | N/A | 8.17 | N/A |
gemini-2.5-flash | T0194 | 2025-09-30 | a_prompt.txt | None | N/A | $0.0003 | N/A | 10.83 | N/A | |
gemini-2.5-flash-lite | T0202 | 2025-10-01 | a_prompt.txt | None | N/A | $0.0001 | N/A | 1.54 | N/A | |
gemini-2.5-flash-lite-preview-09-2025 | T0210 | 2025-10-01 | a_prompt.txt | None | N/A | $0.0001 | N/A | 0.98 | N/A | |
gemini-2.5-flash-preview-09-2025 | T0218 | 2025-10-01 | a_prompt.txt | None | N/A | $0.0006 | N/A | 7.22 | N/A |
zettelkatalog¶
Model ↕ | Provider ↕ | Test ID ↕ | Date ↕ | Prompt ↕ | Rules ↕ | Results ↕ | Cost (USD) ↕ | Cost/Point ↕ | Test Time (s) ↕ | Time/Point ↕ |
---|---|---|---|---|---|---|---|---|---|---|
claude-3-5-sonnet-20241022 | Anthropic | T0143 | 2025-10-01 | prompt.txt | None | $2.5000 | $2.8568 | 1577.11 | 1802.20 | |
gpt-5 | OpenAI | T0165 | 2025-10-01 | prompt.txt | None | $7.1243 | $8.1870 | 21108.28 | 24256.89 | |
gemini-2.5-pro | T0155 | 2025-10-01 | prompt.txt | None | $0.7095 | $0.8157 | 3744.41 | 4304.93 | ||
gemini-2.5-flash-preview-09-2025 | T0224 | 2025-10-01 | prompt.txt | None | $0.3949 | $0.4591 | 2424.33 | 2818.30 | ||
gemini-2.5-flash | T0200 | 2025-09-30 | prompt.txt | None | $0.1684 | $0.1965 | 3027.50 | 3531.85 | ||
gpt-4.1 | OpenAI | T0160 | 2025-10-01 | prompt.txt | None | $2.6952 | $3.1509 | 12936.40 | 15123.99 | |
claude-3-7-sonnet-20250219 | Anthropic | T0144 | 2025-10-01 | prompt.txt | None | $2.5778 | $3.0183 | 1529.23 | 1790.50 | |
claude-sonnet-4-20250514 | Anthropic | T0148 | 2025-10-01 | prompt.txt | None | $2.5141 | $2.9870 | 1456.90 | 1730.95 | |
gemini-2.0-flash | T0151 | 2025-10-01 | prompt.txt | None | $0.0795 | $0.0945 | 632.49 | 751.61 | ||
o3 | OpenAI | T0168 | 2025-10-01 | prompt.txt | None | $2.5396 | $3.0453 | 9024.04 | 10821.21 | |
claude-opus-4-1-20250805 | Anthropic | T0146 | 2025-10-01 | prompt.txt | None | $12.5676 | $15.2166 | 1581.36 | 1914.67 | |
gpt-4o | OpenAI | T0066 | 2025-09-30 | prompt.txt | None | $1.3031 | $1.5799 | 3291.08 | 3989.93 | |
gemini-2.0-flash-lite | T0152 | 2025-10-01 | prompt.txt | None | $0.0615 | $0.0757 | 618.19 | 761.16 | ||
claude-sonnet-4-5-20250929 | Anthropic | T0230 | 2025-09-30 | prompt.txt | None | $2.7730 | $3.4395 | 1428.77 | 1772.17 | |
qwen/qwen3-vl-8b-instruct | Alibaba (via OpenRouter) | T0264 | 2025-10-20 | prompt.txt | None | $0.0549 | $0.0687 | 829.98 | 1039.52 | |
gpt-5-mini | OpenAI | T0166 | 2025-10-01 | prompt.txt | None | $1.3158 | $1.6634 | 22744.78 | 28753.17 | |
claude-opus-4-20250514 | Anthropic | T0147 | 2025-10-01 | prompt.txt | None | $12.7064 | $16.1835 | 1760.17 | 2241.84 | |
mistral-medium-2508 | Mistral AI | T0179 | 2025-10-01 | prompt.txt | None | $0.2675 | $0.3413 | 934.75 | 1192.46 | |
mistral-large-latest | Mistral AI | T0192 | 2025-10-01 | prompt.txt | None | $1.1730 | $1.5031 | 862.11 | 1104.69 | |
pixtral-large-latest | Mistral AI | T0159 | 2025-10-01 | prompt.txt | None | $2.1044 | $2.7040 | 1298.95 | 1669.01 | |
mistral-medium-2505 | Mistral AI | T0180 | 2025-10-01 | prompt.txt | None | $0.2675 | $0.3448 | 835.72 | 1077.34 | |
gpt-5-nano | OpenAI | T0167 | 2025-10-01 | prompt.txt | None | $0.3448 | $0.4475 | 5049.08 | 6552.84 | |
claude-3-opus-20240229 | Anthropic | T0145 | 2025-10-01 | prompt.txt | None | $13.6030 | $17.8355 | 2789.97 | 3658.04 | |
gpt-4.1-mini | OpenAI | T0161 | 2025-10-02 | prompt.txt | None | N/A | N/A | 29647.30 | 39063.63 | |
x-ai/grok-4 | xAI (via OpenRouter) | T0270 | 2025-10-20 | prompt.txt | None | $14.3280 | $19.1860 | 25318.45 | 33902.87 | |
gemini-2.5-flash-lite | T0208 | 2025-10-01 | prompt.txt | None | $0.1044 | $0.1497 | 587.82 | 842.84 | ||
qwen/qwen3-vl-30b-a3b-instruct | Alibaba (via OpenRouter) | T0258 | 2025-10-20 | prompt.txt | None | $0.4187 | $0.6101 | 15627.93 | 22775.35 | |
gemini-2.5-flash-lite-preview-09-2025 | T0216 | 2025-10-01 | prompt.txt | None | $0.3252 | $0.4750 | 1152.22 | 1682.91 | ||
gpt-4.1-nano | OpenAI | T0162 | 2025-10-02 | prompt.txt | None | N/A | N/A | 471.26 | 696.91 | |
meta-llama/llama-4-maverick | Meta (via OpenRouter) | T0252 | 2025-10-17 | prompt.txt | None | $0.5506 | $0.8159 | 19155.46 | 28385.64 | |
pixtral-12b | Mistral AI | T0186 | 2025-10-01 | prompt.txt | None | $0.1384 | $0.2274 | 596.77 | 980.26 | |
GLM-4.5V-FP8 | Z.ai (via sciCORE) | T0242 | 2025-10-17 | prompt.txt | None | $0.0000 | $0.0000 | 44963.40 | 173666.98 | |
qwen/qwen3-vl-8b-thinking | Alibaba (via OpenRouter) | T0247 | 2025-10-17 | prompt.txt | None | $0.1827 | $1.7703 | 1035.96 | 10037.63 | |
gpt-4o-mini | OpenAI | T0164 | 2025-10-03 | prompt.txt | None | $0.0201 | $1.3640 | 1229.45 | 83295.51 |
About This Page¶
This benchmark suite is designed to test AI models on humanities data tasks. The tests run monthly and results are automatically updated.
For more details, visit the GitHub repository.