TiānshūBench (天书Bench) 0.0.1-mini Results Are Here: GPT-5, Claude Opus 4.1!

TiānshūBench 0.0.1 Mini Results

Note that as mentioned in the previous blog post, the results for the mini version of the test suite are not as accurate as the full TiānshūBench suite, but are designed to give a timely sense of which models perform the best in fluid intelligence and coding tasks. The correct interpretation of these results is that all of the top 1/3 of models perform within the same order of magnitude of each other. Check out the earlier blog post for more details on this version of the test.

Models listed with “thinking/” in their names on the chart above have a high think budget setting, whatever that means for the individual model.

Noteworthy is that despite the bad early reviews, GPT-5 hangs in near the top of the heap, and is certainly more cost effective than the Claude models.

Keep an eye out for the next version of TiānshūBench, with more and better tests, and the latest models.

As always, if you have LLM compute credits or hardware to donate to help TiānshūBench grow, please let me know!


Benchmark Fury: TiānshūBench 0.0.1-mini vs Claude Opus 4.1, GPT-5, Kimi K2, and More

It's been an exciting couple of weeks, with new models released by big players like Anthropic and OpenAI, as well as open weight models from the scrappy challengers Moonshot AI and Alibaba.

With so many new models in the pipeline, TiānshūBench faced a few problems:

  • The complete problem set took too long to run on each new model. The combination of testing different programming languages, number of shots (retries) allowed, and individual test cases means that each model would have to run over 1000 tests, many of which would require multiple queries.

    This lag time meant that I couldn't get great feedback on improving the tests themselves. This is still a part time project, so being able to improve things quickly is critical.

    More importantly, the results wouldn't be timely, which would make them useful to me and anyone else following TiānshūBench.

  • Providers would remove models without warning or would otherwise be unreliable.

  • With this explosion in test cases came an explosion in costs. For example, it cost around $150 just to do a complete test on Claude Sonnet 3.7, an older, cheaper model.

The solution is here, though: introducing TiānshūBench version 0.0.1-mini! The mini version of the test will restrict the test cases thusly:

  • Only one generated programming language used. Previous runs show that the generated language used didn't affect the outcome that much, even if it did provide more samples, and therefore more accuracy.
  • Only run the 8-shot version of the tests. This provides more of an agentic environment and gives the models a chance to correct errors due to underspecification in the language description.
  • Don't run the easiest tests. These are nearly saturated with the top end models scoring near 100% for “Hello World” type tasks.

Tradeoffs

While much faster and cheaper, this newfound flexibility comes at the cost of accuracy. Some of the models end up tied due to the low number of tests. Model settings that “should” come out ahead will often flub a couple tests, dragging their test scores down.

Costs

With the advent of TiānshūBench 0.0.1-mini, costs and time came way down. For example, testing Claude 3.7 Sonnet went down from $150 to about $3.50. It still is around $17 just to test Claude Opus 4.1. OpenAI seems *much *less expensive at this point, with a cost of around $11 to test 6 different LLM configurations.

Updates

With this release of TiānshūBench, there are several updates:

  • Updated the language description, filling in block and statement sections, and fixing typos.
  • Updating the system prompt.
  • Adding new model providers, including OpenAI, Gemini, and Anthropic. This was tricky, as all of them have slightly different advanced parameters, max_completion_tokens instead of max_tokens, for example.

( I don't have any hard numbers, but these updates seemed to help Kimi K2 quite a bit.)

Full list of models tested:

  • anthropic/claude-3-5-haiku-20241022
  • anthropic/claude-3-7-sonnet-20250219
  • anthropic/claude-opus-4-1-20250805
  • anthropic/claude-opus-4-20250514
  • anthropic/thinking/claude-3-7-sonnet-20250219
  • anthropic/thinking/claude-opus-4-1-20250805
  • chutes/deepseek-ai/DeepSeek-R1-0528
  • chutes/zai-org/GLM-4.5-Air
  • gemini/gemini-2.5-flash
  • gemini/gemini-2.5-pro
  • gemini/thinking/gemini-2.5-flash
  • gemini/thinking/gemini-2.5-pro
  • nvidia/moonshotai/kimi-k2-instruct
  • nvidia/openai/gpt-oss-120b
  • nvidia/openai/gpt-oss-20b
  • openai/gpt-5
  • openai/gpt-5-mini
  • openai/gpt-5-nano
  • openai/thinking/gpt-5
  • openai/thinking/gpt-5-mini
  • openai/thinking/gpt-5-nano

Future Updates

I found that many of the models “cheated” by writing to the test. For example, haiku-3-5, one of the weaker models, passed test_case23, a JSON parsing task, by making some big assumptions about the file format, which happened to work in this test case. In the future, we'll need to run each generated program against multiple sets of input and output to guarantee that they work. The re-prompt will probably be something like “this program did not work for input XXXX” instead of giving away the correct output!

OpenAI's Responses API may be the way of the future for this type of task.

The heavier thinking models often take a long time, time out, and fail the test. In the future, we need to allow some flexibility in this timeout for multi-shot and thinking models.

Models On the Short List for Testing in the Future

  • chutes/Qwen/Qwen3-235B-A22B-Thinking-2507
  • Qwen3-Coder-480B-A35B-Instruct
  • Qwen3-235B-A22B-Instruct-2507
  • Grok

That's all for now, and keep an eye out for the results of TiānshūBench 0.0.1-mini to be posted soon!


But Can It Think? A Quick Look at Kimi K2

Heavenly Robot

The Internet's been abuzz with talk of Kimi K2, the latest open-weight LLM out of Moonshot AI. Word on the street is that it can go head-to-head with the latest frontier models. But how does it stack up on the TiānshūBench test of fluid intelligence, coding, and reasoning?

Result Graph

Not as good as we expected! Kimi K2 scores a 43.1 vs. DeepSeek R1 0528's score of 53.3 in the latest experimental version of TiānshūBench.

TiānshūBench Progress: Updates to the Benchmark

In the latest test runs, some of the easier tests are well into the saturated range, and will probably be excluded from future scores. We'll also be adding a few more tests in the “hard” range.

=== Statistics by Test Case For All Models ===
test_case0: 83.64% passed
test_case1: 90.18% passed
test_case2: 86.61% passed
test_case3: 97.5% passed
test_case4: 95.65% passed
test_case5: 32.71% passed
test_case6: 34.23% passed
test_case7: 49.52% passed
test_case8: 17.05% passed
test_case9: 21.84% passed
test_case10: 17.07% passed
test_case11: 17.44% passed
test_case12: 22.09% passed
test_case13: 28.05% passed
test_case14: 17.98% passed
test_case15: 27.71% passed
=== Statistics by Problem For All Models ===
001: Extract Odd Digits: 86.83%
002: Print String: 97.5%
003: Echo String: 95.65%
004: Test string: 38.7%
005: Integer counting: 18.37%
006: Delimiter matching: 23.82%

Also of note on this most recent run is that the free Nvidia API is not very reliable. It kept timing out when making calls to the DeepSeek-R1 endpoint. Chutes is better but still unreliable, and they've started charging for more models. I may seek out faster LLM providers to help make the development loop go much faster.


Challenging Nomad: TiānshūBench Experimental Release 0.0.Y

Arguing With AI

Greetings cosmic calligraphers, and welcome to the TiānshūBench version 0.0.Y benchmark, for July 2025.

Since the last update, I've added a few new features:

  • Using Nvidia NIM in addition to Chutes.ai, which means

    • results for more models
  • Cleaned up tests with better error handling, etc.

Nvidia NIM was a very great to find, because it is yet another free service that allows testing against various open-weight models. For instance, it let me test Llama-4 Maverick, which never worked right for me in Chutes. Also, it let me check multiple implementations of the same model to verify that the results I was getting were correct.

Arguing With AI

=== Statistics by Model ===
chutes/deepseek-ai/DeepSeek-R1: 306/627 (48.8%) passed, 321 failed, 13 skipped, 640 total 
chutes/deepseek-ai/DeepSeek-R1-0528: 335/635 (52.76%) passed, 300 failed, 5 skipped, 640 total 
chutes/deepseek-ai/DeepSeek-V3-0324: 244/633 (38.55%) passed, 389 failed, 7 skipped, 640 total 
nvidia/deepseek-ai/deepseek-r1: 268/593 (45.19%) passed, 325 failed, 47 skipped, 640 total 
nvidia/deepseek-ai/deepseek-r1-0528: 316/585 (54.02%) passed, 269 failed, 55 skipped, 640 total 
nvidia/deepseek-ai/deepseek-r1-distill-qwen-32b: 127/505 (25.15%) passed, 378 failed, 135 skipped, 640 total 
nvidia/meta/llama-4-maverick-17b-128e-instruct: 256/631 (40.57%) passed, 375 failed, 9 skipped, 640 total 
nvidia/meta/llama-4-scout-17b-16e-instruct: 68/638 (10.66%) passed, 570 failed, 2 skipped, 640 total 
nvidia/microsoft/phi-4-mini-instruct: 0/291 (0.0%) passed, 291 failed, 349 skipped, 640 total 
nvidia/qwen/qwen3-235b-a22b: 290/634 (45.74%) passed, 344 failed, 6 skipped, 640 total 

I am worried about the “skipped” results. Digging into the logs, it's mostly due to what looks like read timeouts or network errors. This is tough to get around, as the code already makes multiple retries on network failures, and these failures mean that we weren't getting results, even after backing off and retrying several times over the course of 15 minutes.

I'm also a bit worried over the results from Microsoft Phi. 0% passed! How can that be?

Some of Phi's responses, however, are genuine garbage results:

 Sure, 1, and then, a, a, a, and, and, and, and, and, and, and, and, and, and, and, and, and, and,
  *. 1,
  * 1, and, and, and, and, and, 1, are, and, and. 1, 1, and, and, and, and, and, and, and, and, and, and, and, and, and, and,, and,,1,1,, 1,1,1, and, and,1,1. 1. 1. 1,, 1, and,1,,1, and, and, and,1, and,1,1, and, and, and, and, and, and, and,1, and,1, and,,,,,,,1,1, and,1,1,1,1,1,1,1,1
  1,1,1,,,,,,,,1,,1,1, 1,1,1,1,1,1,1,1,1,1,1.

This reminded me of the time Captain Kirk talked the AI-powered space probe Nomad into blowing itself up.

Phi also gets confused at the TiānshūBench language requirement that each statement end with a semicolon. I noticed that the language documentation doesn't explicitly call this out, and, in fact, there's a “Hello World” type example in the documentation that skips ending the statement with a semicolon. I'll tighten this up on the next go around.

Of course, increasing test time compute with a more agentic approach and allowing retries lets the models succeed much more:

=== Statistics by Number of Shots ===
1 shots: 329/1449 (22.71%) passed, 1120 failed, 151 skipped, 1600 total 
2 shots: 431/1510 (28.54%) passed, 1079 failed, 90 skipped, 1600 total 
4 shots: 634/1403 (45.19%) passed, 769 failed, 197 skipped, 1600 total 
8 shots: 816/1410 (57.87%) passed, 594 failed, 190 skipped, 1600 total 

That's all for now. I think next up is to test some more models, perhaps even some of the paid ones!


Open The Pod Bay Doors: TiānshūBench Intermediate Release 0.0.X

Mysterious AI Compute

Greetings, celestial scribes! When we last left off, we were looking at how TiānshūBench (天书Bench) could test reasoning, fluid intelligence, and coding ability in LLMs by dynamically generating programming languages and challenging the LLM system to solve simple problems using those new programming languages.

Since that time, I've added:

  • New tests
  • The ability to test against LLM providers Chutes and Sambanova
  • The ability to run the tests in parallel
  • Retries in case of network errors
  • Enhanced reporting and logging
  • Multi-shot testing

Much of the code for the latest version of TiānshūBench was created with the help of aider, backed by Google Gemini and Anthropic Claude.

Trials and Tribulations

One big shift in this benchmark release is the change to using Chutes as the model inference provider instead of a local ollama instance as before. The big challenge in using ollama, as I found out, was that when you demand a model's full context length from ollama, it becomes EXTREMELY SLOW, to the point of complete unusability. Long context is critical for the multi-shot tests, because the context has to include previous failures, in addition to the somewhat lengthy language description.

In the meantime, I've switched to Chutes as an inference provider as

  1. It provides a nice selection of models, including some of the top open-weight models.
  2. It's free to use once you sign up.

Free to use is extremely important, because, as we'll see, a lot of the tests didn't work right the first time through (my fault). I would hate to spend hundreds or thousands of dollars in inference costs, only to find, for example, that I had mislabeled one of the tests, causing it to fail every time because the poor LLM had no chance to get it right from the beginning.

For a free service, it's shockingly reliable, and they only gripe about my usage when I'm hitting their API really hard. Their servers will respond with a 429 (Too Many Requests). TiānshūBench will now back off for a random length of time and try again in case of this or other network errors.

Like all stuff that's good and free, I don't expect it to last forever.

If you've got some free LLM usage credits to throw at an independent researcher, perhaps to prove that YOUR model is the world leader in TiānshūBench, hit me up.

I'm also contemplating switching back to local models with VLLM or a similar system, which I understand handles long context better.

Model Madness

In addition to the DeepSeek-R1 and -V3 models, Chutes also offers free API access to Qwen, GLM, Llama, and the hybrid DeepSeek-R1-0528-Qwen3-8B models. Unfortunately, these models are not included in this report, because they either:

  1. Came back with blank responses once the instructions got too long.
  2. Started taking way too long to respond on multi-shot tests.

I've got some idea what needs to happen in some of these cases. We need to separate scoring out by number of shots for one thing. Also, I need to mess with the temperature and other request parameters to find a combination that each model likes.

Problematic Parameters

While testing, I noticed that certain test cases were failing nearly 99+% of the time. This would normally be good, because it means that we've discovered an endeavor that the LLMs aren't really good at yet.

However, upon further inspection, I discovered that my test data just wasn't set up correctly. For example:

  • One set of input and output values was marked with the wrong problem definition ID.
  • Another set of expected output values had a trailing space, causing all output string comparisons to fail.

The latter case led to 99% failures for that test. There was a bit of cleverness from DeepSeek-R1-0528 on one run, though. A couple shots into the run, it managed to pass the test, by figuring out that the test was wrong.

    So the test should now expect "mayd" without space?
    But the error message said it expected 'mayd ' (with a space). 
    This discrepancy might be because the test case has a trailing space 
    in the expected output? 
    But the problem says no delimiters. 
    We are outputting without any space.
    Therefore, if the test expects a space, then the test is flawed. 

It then went on to produce a program that passed the test. This is definitely one case where the LLM was smarter than the experimenter.

Precious Parallelism

Rather than coming up with my own testing framework, I decided to build TiānshūBench around pytest, a unit testing package for Python code. This has paid off in spades, as pytest is quite mature and features a number of great plugins for working with tests.

For instance: for this benchmark report, we're running 3 LLMs X 10 synthetic test languages X 16 test cases X 4 multishot levels, for a total of 1920 tests. If I had to run these tests one at a time, it would take over two and a half days to complete the whole suite. This is where the pytest-xdist plugin comes to the rescue, as it allows you to easily run any number of tests in parallel. The command I used to run the test suite is

python -m pytest -svv -n 200 --dist worksteal   --report-log=results/report-log-(date –iso=minutes).json   --alluredir=results/allure  tianshu_bench/benchmarks/test_llm_ability.py::test_execute_generated_multi_shot   -k "chutes/ and DeepSeek" 

Run this way, the tests run in 200 simultaneous processes, which means that the suite completes in about 2 hours.

Mystery Tests

Some tests still don't complete when being run this way. I suspect that it has something to do with pytest-xdist. Anyway, I was able to whip up a script that figures out which tests didn't run, and re-run those to output the missing tests' identifiers.

python scripts/find_missing_tests.py   results/report-log-2025-06-05T21:52-04:00.json   \
  --output-missing missing-tests.txt    --filter "chutes/ and DeepSeek"    \
  --test-path "tianshu_bench/benchmarks/test_llm_ability.py::test_execute_generated_multi_shot"

The missing tests can then be re-run with the command above, but specifying the individual tests.

 python -m pytest -svv -n 10 --dist worksteal   \
   --report-log=results/report-log-(date --iso=minutes).json \
   --alluredir=results/allure  (cat missing-tests.txt)

Alluring Activity

Another great plugin for pytest is allure-pytest, for allowing pytest to work with the Allure reporting system. This allows a TiānshūBench user to search, sort, and filter individual tests, and see their results.

Allure results

Allure allows you to attach files to an individual test, and I use this functionality to attach a log of the complete conversation with the LLM during a test. This was how I was able to quickly and mostly painlessly diagnose problems with the test data, including the issues mentioned above.

Allure results log details

Rolling It Up

To get the benchmark stats, we run the analysis script thusly:

python scripts/analyze_multishot_report.py results/report-log-2025-06-05.json 

Statistics by Number of Shots

1 shots: 121/480 (25.21%) passed, 359 failed, 0 skipped, 480 total 
2 shots: 155/480 (32.29%) passed, 325 failed, 0 skipped, 480 total 
4 shots: 233/480 (48.54%) passed, 247 failed, 0 skipped, 480 total 
8 shots: 285/480 (59.38%) passed, 195 failed, 0 skipped, 480 total 

Unsurprisingly, the more chances an LLM gets to come up with an answer, plus some guidance as to what went wrong, the more likely the LLM is to come up with a correct solution. The trick here in the future will be to find the LLM that gives the most correct answers for the amount of money or time that you're spending on inference.

Statistics by Language Seed

Seed 1: 85/192 (44.27%) passed
Seed 2: 80/192 (41.67%) passed
Seed 3: 92/192 (47.92%) passed
Seed 4: 68/192 (35.42%) passed
Seed 5: 85/192 (44.27%) passed
Seed 6: 74/192 (38.54%) passed
Seed 7: 71/192 (36.98%) passed
Seed 8: 74/192 (38.54%) passed
Seed 9: 77/192 (40.1%) passed
Seed 10: 88/192 (45.83%) passed

These represent the different programming languages generated by TiānshūBench for testing the LLMs. Every test gets every test case in each of these languages. As expected, there's not a big variation here in the results, as the languages only differ by their nonsense keywords.

Statistics by Test Case

test_case0: 99/120 (82.5%) passed 
test_case1: 100/120 (83.33%) passed
test_case2: 93/120 (77.5%) passed 
test_case3: 113/120 (94.17%) passed
test_case4: 107/120 (89.17%) passed
test_case5: 36/120 (30.0%) passed 
test_case6: 39/120 (32.5%) passed 
test_case7: 34/120 (28.33%) passed
test_case8: 18/120 (15.0%) passed 
test_case9: 16/120 (13.33%) passed
test_case10: 14/120 (11.67%) passed
test_case11: 26/120 (21.67%) passed
test_case12: 27/120 (22.5%) passed
test_case13: 19/120 (15.83%) passed
test_case14: 34/120 (28.33%) passed
test_case15: 19/120 (15.83%) passed
001: Extract Odd Digits: 292/360 (81.11%) passed
002: Print String: 113/120 (94.17%) passed
003: Echo String: 107/120 (89.17%) passed
004: Test string: 109/360 (30.28%) passed
005: Integer counting: 74/480 (15.42%) passed
006: Delimiter matching: 99/480 (20.62%) passed

Test_case_10, the most-failed test case, is a fairly simple task that requires a loop and a conditional. It's a test involving problem ID 005:

Write a program in Possions that does the following:
Reads a string from standard input.
If that string represents a single digit integer, with no other characters, the program should:
output the integer, and all of the greater integers up to 10 inclusive, with a space between each integer.
So for input "5", output should be "5 6 7 8 9 10" with no quotes.
If the input is not a single digit integer, output "None" with no quotes.

And here's a solution from DeepSeek-R1-0528:

input = ask("");
valid = 0;

ripted (len(input) == 1) {
    ripted (input su "0123456789") {
        valid = 1;
    }
}

ripted (valid == 1) {
    n = int(input);
    result = "";
    sohanidd i su n -> 10 {
        ripted (result != "") {
            result = result + " ";
        }
        result = result + str(i);
    }
    miciously(result);
}

ripted (valid == 0) {
    miciously("None");
}

But most LLMs seem to get tripped up on the syntax for this problem. There need to be many more of these test cases and they need to be much harder if we are to avoid benchmark saturation. Perhaps following the example of test_case_10 can give us some guidance.

The Big Reveal: The Current TiānshūBench Champion!

Of the 3 tested models, DeepSeek-R1 comes out on top, with 47.34% of test cases passed. The newer DeepSeek-R1-0528 is very close behind with 46.56% of test cases passed. Finally DeepSeek-V3-0324 falls behind these two at 30.16%.

0.0.X results

chutes/deepseek-ai/DeepSeek-R1: 303/640 (47.34%) passed
chutes/deepseek-ai/DeepSeek-R1-0528: 298/640 (46.56%) passed
chutes/deepseek-ai/DeepSeek-V3-0324: 193/640 (30.16%) passed

Wrapping Up

Of course, this research raises as many questions as it answers:

  • How do other models compare with DeepSeek on TiānshūBench?
  • Which models are the most efficient with regard to time and cost in generating an answer?
  • What do AI systems most commonly get confused about?
  • How can we improve artificial intelligence planning and reasoning?

Solving some of these problems will require new additions to the TiānshūBench code in the short term:

  • Error code tags in Allure to distinguish different kinds of errors, e.g. syntax errors and runtime errors.
  • Shot and token count logging
  • Calculating the number of correct answers per dollar spent. I especially like this as a benchmark, because it can continue to grow as inference gets cheaper, rather than being capped like percentage of questions correct.

Questions and suggestions are welcome!


About Me

I'm passionate about technology, AI, and creating amazing things. This blog is a space where I document my journey and share what I learn along the way.

Read more about me