The different components of the ChemBench framework. The frameworks foundation is the benchmark corpus comprising thousands of questions and answers that we manually or semi-automatically compiled from various sources in a format based in the one introduced in the BIG-bench benchmark (Extended Data Fig. 1 ). Questions are classified on the basis of topics, required skills (reasoning, calculation, knowledge and intuition) and difficulty levels. We then used this corpus to evaluate the performance of various models and tool-augmented systems using a custom framework. To provide a baseline, we built a web application that we used to survey experts in chemistry. The results of the evaluations are then compiled in publicly accessible leaderboards (Supplementary Note 15 ), which we propose as a foundation for evaluating future models.