TiānshūBench (天书Bench) 0.0.1-mini Results Are Here: GPT-5, Claude Opus 4.1!
Note that as mentioned in the previous blog post, the results for the mini version of the test suite are not as accurate as the full TiānshūBench suite, but are designed to give a timely sense of which models perform the best in fluid intelligence and coding tasks. The correct interpretation of these results is that all of the top 1/3 of models perform within the same order of magnitude of each other. Check out the earlier blog post for more details on this version of the test.
Models listed with “thinking/” in their names on the chart above have a high think budget setting, whatever that means for the individual model.
Noteworthy is that despite the bad early reviews, GPT-5 hangs in near the top of the heap, and is certainly more cost effective than the Claude models.
Keep an eye out for the next version of TiānshūBench, with more and better tests, and the latest models.
As always, if you have LLM compute credits or hardware to donate to help TiānshūBench grow, please let me know!
Comments