Beyond the general benchmark: How your bench evaluates AI models against real data


Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more


Every AI model release inevitably includes a chart that promotes how this benchmark test or its evaluation matrix is ​​better than its competitors.

However, these benchmarks often test common features. For organizations that want to use models and large language model-based agents, it is difficult to assess how well an agent or model actually understands a particular need.

Model Repository Hugging Face has launched Yourbench, an open source tool that allows developers and businesses to create their own benchmarks and test the performance of their models against internal data.

Sumuk Shashidhar, part of the Hugging Face evaluation research team, has announced your bench at X. This feature “provides custom benchmarks and synthetic data generation from any of the documents. This is a major step towards improving the way models are evaluated.”

He added, “In many use cases, what really matters is how well a model performs a particular task. Your bench can evaluate the model for what is important to you.”

Creating a custom evaluation

In a paper, Hugging Face stated that your bench will work by replicating a subset of large-scale multitasking language understanding (MMLU) benchmarks.

Your organization needs to preprocess the documents before your bench can work. This includes three stages.

  • Intake of documents “Normalize” the file format.
  • Semantic chunks Disassemble the document to meet the limitations of the context window and focus the attention of the model.
  • Document Summary

Next, there is a question and answer process where you will create a question from information about the document. This is where users bring in the LLM of their choice and see which answers the question best.

The embrace faces tested yourbench with deepseek v3 and r1 models, including Deepseek V3 and R1 models, inference models Qwen QWQ, Mistral Large 2411, Misratal 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, Gpt-4o, gpt-4o, 4o-4o-mini, and gpt-4o-4o-mini, gpt-4o, 4o-4o-mini, gpt-4o-4o-mini, gpt-4o-4o-mini, gpt-4o-4o-mini, gpt-4o-4o-mini, gpt-4o-4o-mini, gpt-4o-4o-mini, gpt-4o-4o-mini, gpt-4o-4o-mini, gpt-4o-4o-mini, gpt-4o-mini. Sonnet and Claude 3.5 Haiku.

According to Shashidhar, Facing also provided cost analysis for the model, discovering that Qwen and Gemini 2.0 Flash “creates incredible value at a very low cost.”

Calculate limits

However, creating custom LLM benchmarks based on your organization’s documentation is costly. Your bench requires a lot of calculation skills to work with. Shashidhar said with X that the company is “adding capacity” as soon as possible.

Hugging Face runs several GPUs, partners with companies like Google, and uses cloud services for inference tasks. VentureBeat hugged your face about your bench calculation use.

The benchmark isn’t perfect

Benchmarks and other evaluation methods allow users to know how well they perform their models, but these do not capture completely how the models work every day.

Benchmark tests show limitations on the model and even express skepticism that it could lead to false conclusions about safety and performance. The benchmark agent also warned that it could be “misleading.”

However, companies cannot avoid valuing models because they have many options in the market, and technology leaders justify the rising costs of using AI models. This has created a variety of ways to test the performance and reliability of a model.

Google Deepmind introduced a factual basis that tests the ability of a model to generate virtually accurate responses based on information from a document. Some Yale and Zinger University researchers have developed a self-welcome code benchmark for coding LLM to guide companies that function for them.



Source link