Opencua’s Open Source Computer Use Agent Openai and Humanity’s Specialized Model -

Need smarter insights in your inbox? Sign up for our weekly newsletter to get only the things that matter to enterprise AI, data and security leaders. Subscribe now

A new framework from researchers at the University of Hong Kong (HKU) and collaborative institutions provides an open source foundation for creating robust AI agents that can operate computers. A framework called Opencua contains tools, data and recipes to scale the development of computer-use agents (CUAs).

Models trained using this framework work strongly in CUA benchmarks, outperform existing open source models and compete closely with closed agents in major AI labs such as Openai and humanity.

The challenge of building computer-use agents

Computer Use Agents are designed to autonomously complete tasks on your computer, from navigating websites to working with complex software. It also helps to automate your company’s workflow. However, the most capable CUA systems are unique and private with important details about the training data, architecture and development process.

“As a lack of transparency limits technological advances and raises safety concerns, the research community needs a truly open CUA framework to study capabilities, limitations and risks,” the researcher said in the paper.

AI scaling reaches its limit

Power caps, rising token costs, and inference delays are rebuilding Enterprise AI. Join exclusive salons and discover what your top team looks like.

Turning energy into a strategic advantage

Architects efficient inference for real throughput gain

Unlock competitive ROI with a sustainable AI system

Make sure you have your place to stay first: https://bit.ly/4mwgngo

At the same time, open source efforts face a unique set of hurdles. There was no scalable infrastructure to collect the diverse and large-scale data needed to train these agents. Existing open source datasets in graphical user interfaces (GUIs) have limited data, and many research projects have insufficient details on how to do this, making it difficult for others to replicate their work.

According to the paper, “These limitations collectively hinder the advancements of urban CUAs in general, limiting meaningful investigations of scalability, generalization and potential learning approaches.”

Introducing Opencua

OpenCua Framework Source: HKU’s Xlang Lab

Opencua is an open source framework designed to address these challenges by scaling both data collection and the model itself. Its core includes agentnet tools for recording human demonstrations of computer tasks across a variety of operating systems.

This tool runs in the background on the annotator’s personal computer and streamlines data collection by capturing the underlying accessibility tree that provides structured information about screen videos, mouse, keyboard inputs, and on-screen elements. This raw data is processed into a “state action trajectory” by combining a screenshot of a computer (state) with a user’s corresponding action (click, keypress, etc.). Annotators can view, edit and send these demonstrations.

Agent Net Tool Source: HKU’s Xlang Lab

Using this tool, researchers collected AgentNet datasets. This included over 22,600 task demonstrations across Windows, MacO and Ubuntu, spanning over 200 applications and websites. “This dataset authentically captures the complexity of human behavior and environmental dynamics from a user’s personal computing environment,” the paper states.

Recognizing that screen recording tools raise important data privacy concerns for businesses, researchers designed the AgentNet tool with security in mind. Xinyuan Wang, a co-author of the HKU paper and a doctoral student, explained that they implemented a multi-layered privacy protection framework. “First, you can fully observe the data the annotator itself generates, before you decide whether to submit it,” he told VentureBeat. The data is then subject to manual verification of privacy issues and automatic scanning by large models to detect remaining sensitive content prior to release. “This layered process ensures the robustness of the company-grade in environments that process sensitive customer or financial data,” Wang added.

To accelerate the evaluation, the team also curated the Agent Net Bench. This is an offline benchmark that provides multiple correct actions for each step, providing a way to measure agent performance more efficiently.

New recipes for training agents

The OpenCua framework introduces a new pipeline for processing data and training computer-used agents. The first step is to transform the raw human demonstration into clean state behavior pairs suitable for training visual language models (VLMs). However, researchers have found that simply training models on these pairs can only provide limited performance improvements, even with a large amount of data.

Opencua Chain of Thinking Pipeline Source: HKU’s Xlang Lab

An important insight was to consider and reinforce these trajectories with (COT) inference. This process generates a detailed “internal monologue” for each action. This includes planning, memory, and reflection. This structured reasoning is organized into three levels: high-level observations on the screen, reflective thinking that analyses situations and plans the next step, and finally, concise, actionable actions. This approach helps agents to understand their tasks more deeply.

“We are crucial for a generalizable computer use fund model, helping CUAS internalize cognitive abilities,” the researchers write.

This data synthesis pipeline is a common framework that companies can adapt to train agents with their own internal tools. According to Wang, companies can record demonstrations of their own workflows and use the same “reflector” and “generator” pipeline to create the training data they need. “This allows you to bootstrap high-performance agents tailored to the internal tools without the need to manually make inferences,” he explained.

Put OpenCua in your test

Researchers apply the OpenCua framework to train a variety of open source VLMs, including Qwen and Kimi-VL variations, with parameter sizes ranging from 3 billion to 32 billion. The models were evaluated on a set of online and offline benchmarks that tested their ability to perform tasks and understand GUIs.

The 3.2 billion parameter model, OpenCua-32B, has established a new cutting-edge success rate in open source models in the benchmark tested by OSWORLD. It also surpasses Openai’s GPT-4O-based CUA, significantly closing the performance gap with its main unique model of humanity.

Opencua shows a larger improvement over the base model (left), whilst competing with the major CUA model (right). Source: HKU’s Xlang Lab

For enterprise developers and product leaders, this study provides several important findings. The OpenCua method is widely applicable and improves the performance of models of various architectures (both compact and mixed) and sizes. Trained agents also show strong generalizations that work well on a wide range of tasks and operating systems.

According to Wang, the framework is particularly well suited to automating repetitive, labor-intensive enterprise workflows. “For example, AgentNet Dataset has already captured some demonstrations that launch an EC2 instance on Amazon AWS and configure annotation parameters in Mturk,” he told VentureBeat. “These tasks involve many consecutive steps, but follow a repeatable pattern.”

However, Wang pointed out that closing the gap to Live Deployment requires addressing key safety and reliability challenges. “The biggest challenges in real-world deployments are safety and reliability. Agents must avoid mistakes that can inadvertently change system settings or cause harmful side effects beyond the intended task,” he said.

Researchers have released code, datasets and weights for the model.

As open source agents built on frameworks such as OpenCua become more capable, they could fundamentally evolve the relationship between knowledge workers and computers. Wang envisions a future in which complex software proficiency is less important than the ability to clearly articulate goals for AI agents.

He described two main modes of work: “Offline automation. Agents leverage broader software knowledge to pursue tasks” and “Online collaboration. Agents respond in real time and work side by side with humans like coworkers.” Essentially, humans provide strategic “what,” but increasingly sophisticated AI agents handle operational “how.”

Daily insights into business use cases in VB every day

If you want to impress your boss, VB Daily has it covered. From regulatory shifts to actual deployments, it provides an internal scoop on what companies are doing with generated AI, allowing you to share the biggest ROI insights.

Please read our privacy policy

Thank you for subscribing. Check out this VB newsletter.

An error has occurred.