But Can It Think? A Quick Look at Kimi K2

Heavenly Robot

The Internet's been abuzz with talk of Kimi K2, the latest open-weight LLM out of Moonshot AI. Word on the street is that it can go head-to-head with the latest frontier models. But how does it stack up on the TiānshūBench test of fluid intelligence, coding, and reasoning?

Result Graph

Not as good as we expected! Kimi K2 scores a 43.1 vs. DeepSeek R1 0528's score of 53.3 in the latest experimental version of TiānshūBench.

TiānshūBench Progress: Updates to the Benchmark

In the latest test runs, some of the easier tests are well into the saturated range, and will probably be excluded from future scores. We'll also be adding a few more tests in the “hard” range.

=== Statistics by Test Case For All Models ===
test_case0: 83.64% passed
test_case1: 90.18% passed
test_case2: 86.61% passed
test_case3: 97.5% passed
test_case4: 95.65% passed
test_case5: 32.71% passed
test_case6: 34.23% passed
test_case7: 49.52% passed
test_case8: 17.05% passed
test_case9: 21.84% passed
test_case10: 17.07% passed
test_case11: 17.44% passed
test_case12: 22.09% passed
test_case13: 28.05% passed
test_case14: 17.98% passed
test_case15: 27.71% passed
=== Statistics by Problem For All Models ===
001: Extract Odd Digits: 86.83%
002: Print String: 97.5%
003: Echo String: 95.65%
004: Test string: 38.7%
005: Integer counting: 18.37%
006: Delimiter matching: 23.82%

Also of note on this most recent run is that the free Nvidia API is not very reliable. It kept timing out when making calls to the DeepSeek-R1 endpoint. Chutes is better but still unreliable, and they've started charging for more models. I may seek out faster LLM providers to help make the development loop go much faster.


Challenging Nomad: TiānshūBench Experimental Release 0.0.Y

Arguing With AI

Greetings cosmic calligraphers, and welcome to the TiānshūBench version 0.0.Y benchmark, for July 2025.

Since the last update, I've added a few new features:

  • Using Nvidia NIM in addition to Chutes.ai, which means

    • results for more models
  • Cleaned up tests with better error handling, etc.

Nvidia NIM was a very great to find, because it is yet another free service that allows testing against various open-weight models. For instance, it let me test Llama-4 Maverick, which never worked right for me in Chutes. Also, it let me check multiple implementations of the same model to verify that the results I was getting were correct.

Arguing With AI

=== Statistics by Model ===
chutes/deepseek-ai/DeepSeek-R1: 306/627 (48.8%) passed, 321 failed, 13 skipped, 640 total 
chutes/deepseek-ai/DeepSeek-R1-0528: 335/635 (52.76%) passed, 300 failed, 5 skipped, 640 total 
chutes/deepseek-ai/DeepSeek-V3-0324: 244/633 (38.55%) passed, 389 failed, 7 skipped, 640 total 
nvidia/deepseek-ai/deepseek-r1: 268/593 (45.19%) passed, 325 failed, 47 skipped, 640 total 
nvidia/deepseek-ai/deepseek-r1-0528: 316/585 (54.02%) passed, 269 failed, 55 skipped, 640 total 
nvidia/deepseek-ai/deepseek-r1-distill-qwen-32b: 127/505 (25.15%) passed, 378 failed, 135 skipped, 640 total 
nvidia/meta/llama-4-maverick-17b-128e-instruct: 256/631 (40.57%) passed, 375 failed, 9 skipped, 640 total 
nvidia/meta/llama-4-scout-17b-16e-instruct: 68/638 (10.66%) passed, 570 failed, 2 skipped, 640 total 
nvidia/microsoft/phi-4-mini-instruct: 0/291 (0.0%) passed, 291 failed, 349 skipped, 640 total 
nvidia/qwen/qwen3-235b-a22b: 290/634 (45.74%) passed, 344 failed, 6 skipped, 640 total 

I am worried about the “skipped” results. Digging into the logs, it's mostly due to what looks like read timeouts or network errors. This is tough to get around, as the code already makes multiple retries on network failures, and these failures mean that we weren't getting results, even after backing off and retrying several times over the course of 15 minutes.

I'm also a bit worried over the results from Microsoft Phi. 0% passed! How can that be?

Some of Phi's responses, however, are genuine garbage results:

 Sure, 1, and then, a, a, a, and, and, and, and, and, and, and, and, and, and, and, and, and, and,
  *. 1,
  * 1, and, and, and, and, and, 1, are, and, and. 1, 1, and, and, and, and, and, and, and, and, and, and, and, and, and, and,, and,,1,1,, 1,1,1, and, and,1,1. 1. 1. 1,, 1, and,1,,1, and, and, and,1, and,1,1, and, and, and, and, and, and, and,1, and,1, and,,,,,,,1,1, and,1,1,1,1,1,1,1,1
  1,1,1,,,,,,,,1,,1,1, 1,1,1,1,1,1,1,1,1,1,1.

This reminded me of the time Captain Kirk talked the AI-powered space probe Nomad into blowing itself up.

Phi also gets confused at the TiānshūBench language requirement that each statement end with a semicolon. I noticed that the language documentation doesn't explicitly call this out, and, in fact, there's a “Hello World” type example in the documentation that skips ending the statement with a semicolon. I'll tighten this up on the next go around.

Of course, increasing test time compute with a more agentic approach and allowing retries lets the models succeed much more:

=== Statistics by Number of Shots ===
1 shots: 329/1449 (22.71%) passed, 1120 failed, 151 skipped, 1600 total 
2 shots: 431/1510 (28.54%) passed, 1079 failed, 90 skipped, 1600 total 
4 shots: 634/1403 (45.19%) passed, 769 failed, 197 skipped, 1600 total 
8 shots: 816/1410 (57.87%) passed, 594 failed, 190 skipped, 1600 total 

That's all for now. I think next up is to test some more models, perhaps even some of the paid ones!


Open The Pod Bay Doors: TiānshūBench Intermediate Release 0.0.X

Mysterious AI Compute

Greetings, celestial scribes! When we last left off, we were looking at how TiānshūBench (天书Bench) could test reasoning, fluid intelligence, and coding ability in LLMs by dynamically generating programming languages and challenging the LLM system to solve simple problems using those new programming languages.

Since that time, I've added:

  • New tests
  • The ability to test against LLM providers Chutes and Sambanova
  • The ability to run the tests in parallel
  • Retries in case of network errors
  • Enhanced reporting and logging
  • Multi-shot testing

Much of the code for the latest version of TiānshūBench was created with the help of aider, backed by Google Gemini and Anthropic Claude.

Trials and Tribulations

One big shift in this benchmark release is the change to using Chutes as the model inference provider instead of a local ollama instance as before. The big challenge in using ollama, as I found out, was that when you demand a model's full context length from ollama, it becomes EXTREMELY SLOW, to the point of complete unusability. Long context is critical for the multi-shot tests, because the context has to include previous failures, in addition to the somewhat lengthy language description.

In the meantime, I've switched to Chutes as an inference provider as

  1. It provides a nice selection of models, including some of the top open-weight models.
  2. It's free to use once you sign up.

Free to use is extremely important, because, as we'll see, a lot of the tests didn't work right the first time through (my fault). I would hate to spend hundreds or thousands of dollars in inference costs, only to find, for example, that I had mislabeled one of the tests, causing it to fail every time because the poor LLM had no chance to get it right from the beginning.

For a free service, it's shockingly reliable, and they only gripe about my usage when I'm hitting their API really hard. Their servers will respond with a 429 (Too Many Requests). TiānshūBench will now back off for a random length of time and try again in case of this or other network errors.

Like all stuff that's good and free, I don't expect it to last forever.

If you've got some free LLM usage credits to throw at an independent researcher, perhaps to prove that YOUR model is the world leader in TiānshūBench, hit me up.

I'm also contemplating switching back to local models with VLLM or a similar system, which I understand handles long context better.

Model Madness

In addition to the DeepSeek-R1 and -V3 models, Chutes also offers free API access to Qwen, GLM, Llama, and the hybrid DeepSeek-R1-0528-Qwen3-8B models. Unfortunately, these models are not included in this report, because they either:

  1. Came back with blank responses once the instructions got too long.
  2. Started taking way too long to respond on multi-shot tests.

I've got some idea what needs to happen in some of these cases. We need to separate scoring out by number of shots for one thing. Also, I need to mess with the temperature and other request parameters to find a combination that each model likes.

Problematic Parameters

While testing, I noticed that certain test cases were failing nearly 99+% of the time. This would normally be good, because it means that we've discovered an endeavor that the LLMs aren't really good at yet.

However, upon further inspection, I discovered that my test data just wasn't set up correctly. For example:

  • One set of input and output values was marked with the wrong problem definition ID.
  • Another set of expected output values had a trailing space, causing all output string comparisons to fail.

The latter case led to 99% failures for that test. There was a bit of cleverness from DeepSeek-R1-0528 on one run, though. A couple shots into the run, it managed to pass the test, by figuring out that the test was wrong.

    So the test should now expect "mayd" without space?
    But the error message said it expected 'mayd ' (with a space). 
    This discrepancy might be because the test case has a trailing space 
    in the expected output? 
    But the problem says no delimiters. 
    We are outputting without any space.
    Therefore, if the test expects a space, then the test is flawed. 

It then went on to produce a program that passed the test. This is definitely one case where the LLM was smarter than the experimenter.

Precious Parallelism

Rather than coming up with my own testing framework, I decided to build TiānshūBench around pytest, a unit testing package for Python code. This has paid off in spades, as pytest is quite mature and features a number of great plugins for working with tests.

For instance: for this benchmark report, we're running 3 LLMs X 10 synthetic test languages X 16 test cases X 4 multishot levels, for a total of 1920 tests. If I had to run these tests one at a time, it would take over two and a half days to complete the whole suite. This is where the pytest-xdist plugin comes to the rescue, as it allows you to easily run any number of tests in parallel. The command I used to run the test suite is

python -m pytest -svv -n 200 --dist worksteal   --report-log=results/report-log-(date –iso=minutes).json   --alluredir=results/allure  tianshu_bench/benchmarks/test_llm_ability.py::test_execute_generated_multi_shot   -k "chutes/ and DeepSeek" 

Run this way, the tests run in 200 simultaneous processes, which means that the suite completes in about 2 hours.

Mystery Tests

Some tests still don't complete when being run this way. I suspect that it has something to do with pytest-xdist. Anyway, I was able to whip up a script that figures out which tests didn't run, and re-run those to output the missing tests' identifiers.

python scripts/find_missing_tests.py   results/report-log-2025-06-05T21:52-04:00.json   \
  --output-missing missing-tests.txt    --filter "chutes/ and DeepSeek"    \
  --test-path "tianshu_bench/benchmarks/test_llm_ability.py::test_execute_generated_multi_shot"

The missing tests can then be re-run with the command above, but specifying the individual tests.

 python -m pytest -svv -n 10 --dist worksteal   \
   --report-log=results/report-log-(date --iso=minutes).json \
   --alluredir=results/allure  (cat missing-tests.txt)

Alluring Activity

Another great plugin for pytest is allure-pytest, for allowing pytest to work with the Allure reporting system. This allows a TiānshūBench user to search, sort, and filter individual tests, and see their results.

Allure results

Allure allows you to attach files to an individual test, and I use this functionality to attach a log of the complete conversation with the LLM during a test. This was how I was able to quickly and mostly painlessly diagnose problems with the test data, including the issues mentioned above.

Allure results log details

Rolling It Up

To get the benchmark stats, we run the analysis script thusly:

python scripts/analyze_multishot_report.py results/report-log-2025-06-05.json 

Statistics by Number of Shots

1 shots: 121/480 (25.21%) passed, 359 failed, 0 skipped, 480 total 
2 shots: 155/480 (32.29%) passed, 325 failed, 0 skipped, 480 total 
4 shots: 233/480 (48.54%) passed, 247 failed, 0 skipped, 480 total 
8 shots: 285/480 (59.38%) passed, 195 failed, 0 skipped, 480 total 

Unsurprisingly, the more chances an LLM gets to come up with an answer, plus some guidance as to what went wrong, the more likely the LLM is to come up with a correct solution. The trick here in the future will be to find the LLM that gives the most correct answers for the amount of money or time that you're spending on inference.

Statistics by Language Seed

Seed 1: 85/192 (44.27%) passed
Seed 2: 80/192 (41.67%) passed
Seed 3: 92/192 (47.92%) passed
Seed 4: 68/192 (35.42%) passed
Seed 5: 85/192 (44.27%) passed
Seed 6: 74/192 (38.54%) passed
Seed 7: 71/192 (36.98%) passed
Seed 8: 74/192 (38.54%) passed
Seed 9: 77/192 (40.1%) passed
Seed 10: 88/192 (45.83%) passed

These represent the different programming languages generated by TiānshūBench for testing the LLMs. Every test gets every test case in each of these languages. As expected, there's not a big variation here in the results, as the languages only differ by their nonsense keywords.

Statistics by Test Case

test_case0: 99/120 (82.5%) passed 
test_case1: 100/120 (83.33%) passed
test_case2: 93/120 (77.5%) passed 
test_case3: 113/120 (94.17%) passed
test_case4: 107/120 (89.17%) passed
test_case5: 36/120 (30.0%) passed 
test_case6: 39/120 (32.5%) passed 
test_case7: 34/120 (28.33%) passed
test_case8: 18/120 (15.0%) passed 
test_case9: 16/120 (13.33%) passed
test_case10: 14/120 (11.67%) passed
test_case11: 26/120 (21.67%) passed
test_case12: 27/120 (22.5%) passed
test_case13: 19/120 (15.83%) passed
test_case14: 34/120 (28.33%) passed
test_case15: 19/120 (15.83%) passed
001: Extract Odd Digits: 292/360 (81.11%) passed
002: Print String: 113/120 (94.17%) passed
003: Echo String: 107/120 (89.17%) passed
004: Test string: 109/360 (30.28%) passed
005: Integer counting: 74/480 (15.42%) passed
006: Delimiter matching: 99/480 (20.62%) passed

Test_case_10, the most-failed test case, is a fairly simple task that requires a loop and a conditional. It's a test involving problem ID 005:

Write a program in Possions that does the following:
Reads a string from standard input.
If that string represents a single digit integer, with no other characters, the program should:
output the integer, and all of the greater integers up to 10 inclusive, with a space between each integer.
So for input "5", output should be "5 6 7 8 9 10" with no quotes.
If the input is not a single digit integer, output "None" with no quotes.

And here's a solution from DeepSeek-R1-0528:

input = ask("");
valid = 0;

ripted (len(input) == 1) {
    ripted (input su "0123456789") {
        valid = 1;
    }
}

ripted (valid == 1) {
    n = int(input);
    result = "";
    sohanidd i su n -> 10 {
        ripted (result != "") {
            result = result + " ";
        }
        result = result + str(i);
    }
    miciously(result);
}

ripted (valid == 0) {
    miciously("None");
}

But most LLMs seem to get tripped up on the syntax for this problem. There need to be many more of these test cases and they need to be much harder if we are to avoid benchmark saturation. Perhaps following the example of test_case_10 can give us some guidance.

The Big Reveal: The Current TiānshūBench Champion!

Of the 3 tested models, DeepSeek-R1 comes out on top, with 47.34% of test cases passed. The newer DeepSeek-R1-0528 is very close behind with 46.56% of test cases passed. Finally DeepSeek-V3-0324 falls behind these two at 30.16%.

0.0.X results

chutes/deepseek-ai/DeepSeek-R1: 303/640 (47.34%) passed
chutes/deepseek-ai/DeepSeek-R1-0528: 298/640 (46.56%) passed
chutes/deepseek-ai/DeepSeek-V3-0324: 193/640 (30.16%) passed

Wrapping Up

Of course, this research raises as many questions as it answers:

  • How do other models compare with DeepSeek on TiānshūBench?
  • Which models are the most efficient with regard to time and cost in generating an answer?
  • What do AI systems most commonly get confused about?
  • How can we improve artificial intelligence planning and reasoning?

Solving some of these problems will require new additions to the TiānshūBench code in the short term:

  • Error code tags in Allure to distinguish different kinds of errors, e.g. syntax errors and runtime errors.
  • Shot and token count logging
  • Calculating the number of correct answers per dollar spent. I especially like this as a benchmark, because it can continue to grow as inference gets cheaper, rather than being capped like percentage of questions correct.

Questions and suggestions are welcome!


Introducing TiānshūBench (天书Bench)

TiānshūBench Logo

TiānshūBench (天书Bench) is a benchmark which tests the ability of LLMs to generate code in a unique way: every test is performed in a unique programming language which is created for the purpose of that test run.

For example, here is a valid program generated by an LLM in response to one of TiānshūBench’s test questions, with a programming language generated with a random seed of 1:

input_str = ask(); 
sohanidd char su input_str { 
	ripted char >= '0' ? char <= '9' { 
			digit = int(char); 
			ripted digit % 2 != 0 { 
				miciously(char); 
		} 
	} 
} 

TiānshūBench certainly serves as a test of programming ability, one of the most widely used applications of LLMs today. More importantly, TiānshūBench serves as a test of the system’s ability to use new and unfamiliar tools to solve problems. By giving the programming language definition in the prompt, TiānshūBench pulls the LLM out of the familiar space of common programming languages such as Python or Java, which any modern LLM surely has been extensively trained upon.

Tool use and the ability to adapt to novel situations are fundamental to what we consider “intelligence”. Some existing benchmarks, such as ARC-2, have attempted to provide novel situations and avoid dataset contamination through by showing the AI system unique sets of visual symbols, which it is then expected to decode. This, approach, however, has two difficulties:

1. The symbols in the ARC-2 benchmark problems have no agreed upon real world meaning. This makes the problem statement, as presented to the AI system, ambiguous. As we shall see, the problem statements in TiānshūBench are written using ordinary language, in which the symbols (words) have agreed upon meanings.

2. The benchmark tool has no ability to generate new, unique, problems.

In this blog post, I hope to give an overview of TiānshūBench’s origins, how it works, and some of the design decisions made along the way. In subsequent articles, I’ll talk about the difficulties I faced at each step in getting things to run smoothly and coming up with appropriate measurements of LLM abilities. TiānshūBench is still in early development, so you will be able to follow along as I improve and expand upon the tests, refine the implementation, and benchmark today’s most popular LLMs.

Origins

Part of TiānshūBench was inspired by the observation that the more esoteric the programming language, the harder time an LLM would have coming up with a working example. For instance, here’s

a program in AtariSoft BASIC, a language hardly used since the 1980’s, generated by one of today’s leading LLMs:

10 REM ATARISOFT BASIC PROGRAM
20 REM PRINT ODD DIGITS FROM A STRING
30 REM DEFINE THE CONSTANT STRING OF 10 DIGITS
40 LET DIGITSTRING\$ = \"3141592653\"
50 REM PRINT A HEADER
60 PRINT \"ORIGINAL STRING: \"; DIGITSTRING\$
70 PRINT \"ODD DIGITS FOUND:\"
80 REM LOOP THROUGH EACH CHARACTER OF THE STRING
90 FOR I = 1 TO LEN(DIGITSTRING\$)
100 REM EXTRACT ONE CHARACTER (DIGIT)
110 LET CHAR\$ = MID\$(DIGITSTRING\$, I, 1)
120 REM CONVERT THE CHARACTER TO ITS NUMERIC VALUE
130 LET DIGITVALUE = VAL(CHAR\$)
140 REM CHECK IF THE NUMERIC VALUE IS ODD
150 IF DIGITVALUE MOD 2 = 1 THEN PRINT CHAR\$; \" \";
160 REM CONTINUE TO THE NEXT CHARACTER
170 NEXT I
180 REM PRINT A NEWLINE AT THE END FOR CLEANLINESS
190 PRINT
200 REM END OF PROGRAM
210 END

Astute readers will note that it’s the same program, more or less, as the TiānshūBench response program above. However, it contains several mistakes, including not declaring a string variable correctly, and confabulating a MOD operator which does not exist in this language.

Generating a New Programming Language

For each benchmark test question, a new programming language is generated. The language is derived from the somewhat obscure Mamba programming language (no relation to the deep learning architecture). We create a set of keyword replacements randomly chosen from a set of nonsense words. These replacements are then dynamically substituted in the Mamba language parser and used to update the human-readable Mamba language guide. The language name is randomized as well. Here’s an excerpt of the language describing functions and flow control.

## Possions programming language
Possions is a simple dynamic typed, programming language
...
#### Functions ####

functions are declared via the following grammar

    thresyncess func_name( [<arguments>,] ){
        < statements >
    }

    thresyncess random(){
        naritrannument 4;
    }

return value is specified with the `naritrannument` keyword which, as expected, 
immediately halts function execution upon being called. Functions can have 
their private functions which are inaccessible to the outer scope.

#### Flow control ####

Possions supports `ripted` statements sohanidd flow control via the following syntax

    ripted < expression > {
        < statements >
    }

Mamba also includes a number of other useful features, such as the ability to read from standard input and write to standard output, looping constructs, and so on.

The language definition is given to the LLM to be benchmarked as part of the first client request.

An Aside: Where the Name Comes From

As I was working on the benchmark, it occurred to me that since the tests were not based on a particular language, the thoughts of the LLM might be said to be the “language of the gods”. In Chinese, tiānshū, 天书, means “heavenly language” or “celestial script”, and indicates the writing system of the gods. But there’s a wonderful double meaning: idiomatically, tiānshū is used to describe difficult to understand documents, like complicated mathematics or law. In English, we might say “It’s Greek to me!”, but in Chinese, it’s “天书!”

Benchmark Problem Set

The benchmark problem set consists of programming problems which vary in difficulty.

Here is one example problem:

 Write a program in Possions that does the following:                 
                                                                      
 Reads a string from standard input that contains a series of decimal 
 digits.                                                              
                                                                      
 Prints the digits of that string that are odd.                       
                                                                      
 Do not output anything other than those digits, including prompts or 
 delimiters.                                                          

The problem description is concatenated to the end of the language description, and the whole request is sent to the LLM system.

Running the Benchmark Tests

The TiānshūBench suite is invoked via the pytest framework. This gives us access to all of pytest’s useful features like reporting, test selection, and parameterization.

LLM Providers

For initial development, ollama was used with a few smaller models on a local server with a relatively low end 12 GB 3060 GPU. This allowed me to try a variety of models and test cases to figure out what worked with the test suite suite without breaking the bank on inference costs.

Later, I realized that as we scaled up to a more robust test suite, performance was not going to keep up. I’ve switched to Chutes as a provider for further development. They have a free to use inference API with many popular models and reasonable performance.

TiānshūBench Version 0.0 Results

I recently released version 0.0 of the test results, an initial proof of concept for TiānshūBench. It includes 5 single shot test cases, and 10 generated languages, for a total of 50 tests per LLM. I found 3 LLMs that were able to run on the hardware I had and give reasonable responses.

Statistics by LLM

ollama/deepseek-r1:14b: 18/50 passed (36.0%)

ollama/phi4:14b-q4_K_M: 10/50 passed (20.0%)

ollama/qwen3:14b: 23/50 passed (46.0%)

Statistics by Problem ID

Test Case 0: 3/30 passed (10.0%)

Test Case 1: 8/30 passed (26.67%)

Test Case 2: 7/30 passed (23.33%)

Test Case 3: 18/30 passed (60.0%)

Test Case 4: 15/30 passed (50.0%)

TiānshūBench Results Chart

Ollama’s qwen3:14b takes the lead here, with a 46% pass rate on the tests.

Current enhancements

TiānshūBench is undergoing active development on Github. Current updates include:

  • More tests
  • More supported models
  • Coarse parallelism for performance
  • Network error recovery
  • Multi-shot tests
  • Support for inference via SambaNova and Chutes

These will be applied to the next official benchmark results.

Future Enhancements

  • Performance
    • Finer grained parallelism in running the test cases
    • Finding the fastest inference providers
  • Adding new and interesting models to the test

  • Checking and fixing tests that that failing are due to infrastructure and configuration problems

  • More tests across a range of difficulty

  • More randomizing of the generated programming languages

    • Randomize remaining tokens and symbols
    • Add the ability to add or remove language features
  • Improved programming language documentation for further reduction in ambiguity

Future blog articles may include the following topics:

  • Issues faced setting up a local LLM
  • Working with Mamba to create unique programming languages
  • Pytest setup and operation
  • Common stumbling blocks when benchmarking LLMs

How You Can Help

Follow JeepyTea on Twitter/X for updates on the latest in TiānshūBench news. I’m also looking for contributors on Github to help out with coding tasks and benchmark problems. Finally, if you’ve got access to powerful LLM models via API or are willing to contribute inference credits, please let me know!


About Me

I'm passionate about technology, AI, and creating amazing things. This blog is a space where I document my journey and share what I learn along the way.

Read more about me