Get It Built

Today, we're introducing TradesBench, a new benchmark designed to evaluate large language models (LLMs) on their ability to answer real-world construction and trade-related questions. From locksmithing to decorating, TradesBench challenges LLMs with practical queries sourced from the internet, paired with expert-verified reference answers. By comparing these reference answers to synthesized responses from various models, we can rigorously assess LLM performance in real-world scenarios.

TradesBench uses LLMs as judges, evaluating the accuracy and relevance of model-generated responses against the reference expert answers.

TradesBench

Introducing TradesBench