In February 2024, Reddit signed a $60 million contract with Google, allowing search giants to train artificial intelligence models using data on the platform. Particularly lacking in the discussion was Reddit users, whose data was on sale.
This transaction reflects the reality of the modern Internet. Large tech companies own almost all of their online data and can decide what to do with that data. Naturally, many platforms monetize data, and the fastest growing way to achieve it today is to sell it to AI companies, large tech companies that use it to train more powerful models.
The distributed platform VANA, which began as a class project for MIT, is on the mission to return power to users. The company has created a fully user-owned network where individuals can upload data and control how they use it. AI developers can pitch new model ideas to users and if the user agrees to provide data for training, they will proportionate ownership of the model.
The idea is to give everyone funding in AI systems, and unlock new data pools to advance technology while increasingly shaping our society.
“This data is needed to create a better AI system,” says Anna Kazlauskas ’19, co-founder of Vana. “We have created a distributed system to get better data. This is within today’s largest tech companies, but it keeps users the ultimate ownership.”
From economics to blockchain
Many high school students have photos of pop stars and athletes on their bedroom walls. Kazulauska had a photo of former US Treasury Secretary Janet Yellen.
Kazulauska confirmed that she would become an economist, but she became one of five students joining the MIT Bitcoin Club in 2015. That experience led her into the world of blockchain and cryptocurrency.
From her dorm room at MacGregor House, she began mining cryptocurrency ethics. She occasionally searched for garbage dumps on campus in search of discarded computer chips.
“I was interested in everything about computer science and networking,” says Kazulauskas. “It was related to decentralized systems from a blockchain perspective and not just artificial intelligence and econometrics, but also the way economic power was shifted to individuals.”
Kazlauskas met Art Abal, who attended Harvard University at his former Media Lab class Engelument Ventures.
“Our question is, how is it that many people are contributing to these AI systems that they use more distributed networks?” Kazlauskas recalls.
Kazlauskas and Abal were trying to deal with the current situation where most models are trained by cutting out public data on the Internet. Large tech companies often buy large datasets from other companies as well.
The founder’s approach has evolved over the years and was informed by the experience of Kazlauskas, who works at the financial blockchain company Celo after graduation. However, Kazlauskas believes she is helping her time at MIT to think about these issues, and Ramesh Raskar, an instructor at an emergency venture, helps Vana think about the questions of AI research today.
“It was great that open-ended opportunities were just opportunities to build, hack and explore,” says Kazulauska. “I think the spirit of MIT is really important. It’s just building things, seeing what works and keeping it going.”
Today, VANA utilizes lesser known laws that allow users of most large technology platforms to export data directly. Users can upload that information to VANA’s encrypted digital wallet and support it in order to train a model they think is appropriate.
AI engineers can propose ideas for new open source models, allowing people to pool data to help train the models. In the blockchain world, data pools are called Data DAOS and represent decentralized autonomous organizations. You can also use the data to create personalized AI models and agents.
VANA uses data in a way that preserves user privacy, as the system does not publish identifiable information. Once a model is created, users retain ownership, so each time they use it, they are rewarded proportionally based on how much data is trained.
“From a developer’s perspective, you can build these ultra-personalized health applications that take exactly what you eat, how you slept, and how you exercise,” says Kazulauska. “These applications are not possible today due to the walled gardens of major high-tech companies.”
Crowdsourcing, user-owned AI
Last year, we proposed that machine learning engineers train AI models that can generate Reddit posts using VANA user data. Over 140,000 VANA users contributed their Reddit data. This included posts, comments, messages and more. The user determined the terms that the model could be used and maintained ownership of the model after it was created.
VANA has enabled similar initiatives using user-controlled data from social media platform X. Sleep data from sources like Oura Rings. more. There is also collaboration that combines data pools to create a wide range of AI applications.
“Suppose you have Spotify, Reddit, or fashion data.” Kazulauska explains. “Normally, Spotify doesn’t intend to work with these types of companies, and there are actually regulations on that. But users can do that when they grant access. So, they can use these cross-platform datasets to create very powerful models.”
Vana has over 1 million users and over 20 live data DAOs. Over 300 additional data pools have been proposed by users on Vana’s system, and Kazlauskas says many will be produced this year.
“I think there are many promises in generalized AI models, personalized medicine, and new consumer applications, because it’s difficult to combine and access all of that data in the first place,” says Kazulauska.
Data pools allow even the most powerful tech companies that groups of users struggle with today to achieve something.
“Today, the biggest tech companies are building these data motes, so the best datasets are not available to anyone,” says Kazulauska. “It’s a matter of collective action, and my data itself isn’t that valuable, but a data pool of tens of thousands or millions is really valuable. Vana allows them to build those pools.