Google Slapped With A Lawsuit For 'secretly Stealing' Data To Train Bard

Another lawsuit involving AI and data theft. Credit: Getty Images

A California law firm has filed(opens in a new tab) a class-action lawsuit against Google for “secretly stealing” vast amounts of data from the web to train its AI technologies.

Clarkson Law Firm is suing the tech giant for negligence, invasion of privacy, larceny, copyright infringement, and profiting from personal data that was illegally obtained. “Google has taken all our personal and professional information, our creative and copywritten works, our photographs, and even our emails—virtually the entirety of our digital footprint—and is using it to build commercial Artificial Intelligence (‘AI’) Products like ‘Bard,'” said the complaint, which was filed on July 11 in the Northern District of California.

The lawsuit comes on the heels of Google quietly updating its privacy policy last week, claiming any public information can be used to train its AI products like Bard. Google is essentially saying anything published on the web is fair game, but the law firm believes this is a massive invasion of privacy, by scraping data without compensation or consent for the express reason of training AI models. The lawsuit alleges that Google, a multi-billion dollar company with over a billion users worldwide, is putting users in an “untenable” position: “either use the internet and surrender all your personal and copyrighted information to Google’s insatiable AI models — or avoid the internet entirely.”

In a statement to Reuters(opens in a new tab), Google general counsel Halimah DeLaine Prado called the claims “baseless,” saying, “we use data from public sources — like information published to the open web and public datasets – to train the AI models behind services like Google Translate, responsibly and in line with our AI Principles.”

Recently, Clarkson filed a similar class-action lawsuit against OpenAI, the company that created ChatGPT, for “theft and misappropriation of personal data,” using the same kind of data-scraping operation. Large language models need huge amounts of data to train AI chatbots and make them conversational and intelligent. Both Bard and ChatGPT rely on large language models to work, which has raised concerns about use of private data as well as copyright infringement.

The most recent lawsuit says Google has misappropriated datasets like the Common Crawl, a non-profit, which makes its data free for research and education purposes, as well as data from sites like Medium, and Kickstarter. Google also uses its own data from Gmail and Google Search to feed its models. Other data scraped includes copyrighted works like e-books in digital libraries, and even from piracy websites, that the company is using without compensating artists and authors.

The key to Clarkson’s lawsuit is the issue of public domain. But, “‘publicly available’ has never meant free to use for any purpose,” the complaint said. Yes, some data or available to purchase, but it depends on the context of their use and user consent. Yes, users consent to privacy policies when they publish content on the web, but they have a right to know if it’s being used somewhere else. In other words, Clarkson says, “Google must understand, once and for all: it does not own the internet.”

Cecily is a tech reporter at Mashable who covers AI, Apple, and emerging tech trends. Before getting her master’s degree at Columbia Journalism School, she spent several years working with startups and social impact businesses for Unreasonable Group and B Lab. Before that, she co-founded a startup consulting business for emerging entrepreneurial hubs in South America, Europe, and Asia. You can find her on Twitter at @cecily_mauran(opens in a new tab).

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use(opens in a new tab) and Privacy Policy(opens in a new tab). You may unsubscribe from the newsletters at any time.