TiānshūBench Logo

TiānshūBench (天书Bench) is a benchmark which tests the ability of LLMs to generate code in a unique way: every test is performed in a unique programming language which is created for the purpose of that test run.

For example, here is a valid program generated by an LLM in response to one of TiānshūBench’s test questions, with a programming language generated with a random seed of 1:

input_str = ask(); 
sohanidd char su input_str { 
	ripted char >= '0' ? char <= '9' { 
			digit = int(char); 
			ripted digit % 2 != 0 { 
				miciously(char); 
		} 
	} 
} 

TiānshūBench certainly serves as a test of programming ability, one of the most widely used applications of LLMs today. More importantly, TiānshūBench serves as a test of the system’s ability to use new and unfamiliar tools to solve problems. By giving the programming language definition in the prompt, TiānshūBench pulls the LLM out of the familiar space of common programming languages such as Python or Java, which any modern LLM surely has been extensively trained upon.

Tool use and the ability to adapt to novel situations are fundamental to what we consider “intelligence”. Some existing benchmarks, such as ARC-2, have attempted to provide novel situations and avoid dataset contamination through by showing the AI system unique sets of visual symbols, which it is then expected to decode. This, approach, however, has two difficulties:

1. The symbols in the ARC-2 benchmark problems have no agreed upon real world meaning. This makes the problem statement, as presented to the AI system, ambiguous. As we shall see, the problem statements in TiānshūBench are written using ordinary language, in which the symbols (words) have agreed upon meanings.

2. The benchmark tool has no ability to generate new, unique, problems.

In this blog post, I hope to give an overview of TiānshūBench’s origins, how it works, and some of the design decisions made along the way. In subsequent articles, I’ll talk about the difficulties I faced at each step in getting things to run smoothly and coming up with appropriate measurements of LLM abilities. TiānshūBench is still in early development, so you will be able to follow along as I improve and expand upon the tests, refine the implementation, and benchmark today’s most popular LLMs.

Origins

Part of TiānshūBench was inspired by the observation that the more esoteric the programming language, the harder time an LLM would have coming up with a working example. For instance, here’s

a program in AtariSoft BASIC, a language hardly used since the 1980’s, generated by one of today’s leading LLMs:

10 REM ATARISOFT BASIC PROGRAM
20 REM PRINT ODD DIGITS FROM A STRING
30 REM DEFINE THE CONSTANT STRING OF 10 DIGITS
40 LET DIGITSTRING\$ = \"3141592653\"
50 REM PRINT A HEADER
60 PRINT \"ORIGINAL STRING: \"; DIGITSTRING\$
70 PRINT \"ODD DIGITS FOUND:\"
80 REM LOOP THROUGH EACH CHARACTER OF THE STRING
90 FOR I = 1 TO LEN(DIGITSTRING\$)
100 REM EXTRACT ONE CHARACTER (DIGIT)
110 LET CHAR\$ = MID\$(DIGITSTRING\$, I, 1)
120 REM CONVERT THE CHARACTER TO ITS NUMERIC VALUE
130 LET DIGITVALUE = VAL(CHAR\$)
140 REM CHECK IF THE NUMERIC VALUE IS ODD
150 IF DIGITVALUE MOD 2 = 1 THEN PRINT CHAR\$; \" \";
160 REM CONTINUE TO THE NEXT CHARACTER
170 NEXT I
180 REM PRINT A NEWLINE AT THE END FOR CLEANLINESS
190 PRINT
200 REM END OF PROGRAM
210 END

Astute readers will note that it’s the same program, more or less, as the TiānshūBench response program above. However, it contains several mistakes, including not declaring a string variable correctly, and confabulating a MOD operator which does not exist in this language.

Generating a New Programming Language

For each benchmark test question, a new programming language is generated. The language is derived from the somewhat obscure Mamba programming language (no relation to the deep learning architecture). We create a set of keyword replacements randomly chosen from a set of nonsense words. These replacements are then dynamically substituted in the Mamba language parser and used to update the human-readable Mamba language guide. The language name is randomized as well. Here’s an excerpt of the language describing functions and flow control.

## Possions programming language
Possions is a simple dynamic typed, programming language
...
#### Functions ####

functions are declared via the following grammar

    thresyncess func_name( [<arguments>,] ){
        < statements >
    }

    thresyncess random(){
        naritrannument 4;
    }

return value is specified with the `naritrannument` keyword which, as expected, 
immediately halts function execution upon being called. Functions can have 
their private functions which are inaccessible to the outer scope.

#### Flow control ####

Possions supports `ripted` statements sohanidd flow control via the following syntax

    ripted < expression > {
        < statements >
    }

Mamba also includes a number of other useful features, such as the ability to read from standard input and write to standard output, looping constructs, and so on.

The language definition is given to the LLM to be benchmarked as part of the first client request.

An Aside: Where the Name Comes From

As I was working on the benchmark, it occurred to me that since the tests were not based on a particular language, the thoughts of the LLM might be said to be the “language of the gods”. In Chinese, tiānshū, 天书, means “heavenly language” or “celestial script”, and indicates the writing system of the gods. But there’s a wonderful double meaning: idiomatically, tiānshū is used to describe difficult to understand documents, like complicated mathematics or law. In English, we might say “It’s Greek to me!”, but in Chinese, it’s “天书!”

Benchmark Problem Set

The benchmark problem set consists of programming problems which vary in difficulty.

Here is one example problem:

 Write a program in Possions that does the following:                 
                                                                      
 Reads a string from standard input that contains a series of decimal 
 digits.                                                              
                                                                      
 Prints the digits of that string that are odd.                       
                                                                      
 Do not output anything other than those digits, including prompts or 
 delimiters.                                                          

The problem description is concatenated to the end of the language description, and the whole request is sent to the LLM system.

Running the Benchmark Tests

The TiānshūBench suite is invoked via the pytest framework. This gives us access to all of pytest’s useful features like reporting, test selection, and parameterization.

LLM Providers

For initial development, ollama was used with a few smaller models on a local server with a relatively low end 12 GB 3060 GPU. This allowed me to try a variety of models and test cases to figure out what worked with the test suite suite without breaking the bank on inference costs.

Later, I realized that as we scaled up to a more robust test suite, performance was not going to keep up. I’ve switched to Chutes as a provider for further development. They have a free to use inference API with many popular models and reasonable performance.

TiānshūBench Version 0.0 Results

I recently released version 0.0 of the test results, an initial proof of concept for TiānshūBench. It includes 5 single shot test cases, and 10 generated languages, for a total of 50 tests per LLM. I found 3 LLMs that were able to run on the hardware I had and give reasonable responses.

Statistics by LLM

ollama/deepseek-r1:14b: 18/50 passed (36.0%)

ollama/phi4:14b-q4_K_M: 10/50 passed (20.0%)

ollama/qwen3:14b: 23/50 passed (46.0%)

Statistics by Problem ID

Test Case 0: 3/30 passed (10.0%)

Test Case 1: 8/30 passed (26.67%)

Test Case 2: 7/30 passed (23.33%)

Test Case 3: 18/30 passed (60.0%)

Test Case 4: 15/30 passed (50.0%)

TiānshūBench Results Chart

Ollama’s qwen3:14b takes the lead here, with a 46% pass rate on the tests.

Current enhancements

TiānshūBench is undergoing active development on Github. Current updates include:

  • More tests
  • More supported models
  • Coarse parallelism for performance
  • Network error recovery
  • Multi-shot tests
  • Support for inference via SambaNova and Chutes

These will be applied to the next official benchmark results.

Future Enhancements

  • Performance
    • Finer grained parallelism in running the test cases
    • Finding the fastest inference providers
  • Adding new and interesting models to the test

  • Checking and fixing tests that that failing are due to infrastructure and configuration problems

  • More tests across a range of difficulty

  • More randomizing of the generated programming languages

    • Randomize remaining tokens and symbols
    • Add the ability to add or remove language features
  • Improved programming language documentation for further reduction in ambiguity

Future blog articles may include the following topics:

  • Issues faced setting up a local LLM
  • Working with Mamba to create unique programming languages
  • Pytest setup and operation
  • Common stumbling blocks when benchmarking LLMs

How You Can Help

Follow JeepyTea on Twitter/X for updates on the latest in TiānshūBench news. I’m also looking for contributors on Github to help out with coding tasks and benchmark problems. Finally, if you’ve got access to powerful LLM models via API or are willing to contribute inference credits, please let me know!