Insights < BACK TO ALL INSIGHTS
How Does AI Use the Data You Give it?
How Does AI Use the Data You Give it?
By: Jake Gray
The development and public offering of generative Artificial Intelligence (AI) models has led to the introduction of many work-complementing and work-supplementing tools that offer individuals and organizations the ability to streamline or expand their workflows.
Generative AIs are algorithms, specifically large language models (LLMs), trained on massive amounts of input data to generate their own original text or image outputs by predicting the next most probable word or pixel in accordance with a prompt. And in so doing, LLMs afford individuals the opportunity to expedite work processes, especially tedious or ultra-time-consuming ones. In general, these tools are trained on information via publicly accessible websites as well as other data sources.
The term “other data sources” is of critical importance here. In some cases, it may be input data from users of the AI platform. And in other cases, it is licensed proprietary data purchased by the AI’s company from another company, as many technology companies collect and sell their user’s data without making obvious their intent to do so to users. [1] A few natural concerns arise here for both individuals and organizations: Where is the data from? Whose is it? Are they using mine? If I use the platform, will they train the AI on my information? Do they, then, own my data or do I retain ownership?
And these concerns are legitimate, as a few recent class-action lawsuits illustrate.
Data Lawsuits and AI Platforms
In June 2023, the Clarkson Law Firm, based in California, filed a class-action lawsuit against OpenAI for using allegedly “stolen” private information, including personal identifying information, from “hundreds of millions of internet users including children of all ages, without their informed consent or knowledge” to train ChatGPT. [2] The suit claims that OpenAI did so secretly and “without registering as a data broker” under applicable California law.
The next day, another class-action suit in California was filed against OpenAI. The Joseph Saveri Law Firm filed on behalf of a class of plaintiffs seeking compensation for damages and an injunction to prevent future harms because OpenAI allegedly used the plaintiff’s books to train ChatGPT. The lawsuit seeks to prevent AI’s replacement of the authors “whose stolen works power these AI products with whom they are competing” and ensure that the products follow the same rules as other technologies and people.
In Authors Guild v. OpenAI Inc.— In September 2023, the Authors Guild filed a class-action lawsuit against OpenAI in the U.S. District Court for the Southern District of New York on behalf of authors such as George R.R. Martin, John Grisham, and Jodi Picoult. [3] The Authors Guild seeks an injunction to block OpenAI from continuing to use the authors’ works to train ChatGPT, as well as unspecific monetary damages and a financial penalty of up to $150,000 per infringed work.
Two similar lawsuits were filed against Meta and OpenAI for the same reason in July, on behalf of comedian/actress Sarah Silverman and two other authors. [4] Both were filed in the U.S. District Court for the Northern District of California and seek class-action status as well as unspecified monetary damages.
In a turn toward the music industry, Universal Music Group and other major record labels filed a lawsuit against Anthropic, creator of AI product Claude, for using Claude to distribute copyrighted music lyrics without a licensing deal. [5] The publishers demand that Anthropic pay the publisher’s legal fees, publicly disclose how Claude was trained, destroy all infringing material, and pay a financial penalty of up to $150,000 per infringed work.
Privacy Concerns Around User Information and Inputs
The three major, publicly available generative text AI platforms include OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Bard, each of which disclose in their terms of service how and to what extent user information and user inputs, or prompts, are used to further train their models.
Generally, users retain ownership and intellectual property rights over the information they submit as prompts to the AI and the content produced by the AI in response to the prompt. Further, following the Digital Millennium Copyright Act, each company generally provides a means by which an individual or entity can notify the AI companies of copyright infringement. But some companies, such as ChatGPT and Bard, reserve the right to use the data from users for their own benefit, particularly to train their AI algorithms or improve other aspects of their business. So, while a user’s submission of proprietary data does not necessarily endanger their ownership of that data, it is likely that the user’s data will be leveraged by the platform to improve the LLM and/or be incorporated into its built-in information base from which outputs draw. This latter function means that personal, proprietary, or otherwise sensitive data you input as prompt data to the AI could be used in responses to other people.
For instance, OpenAI states that it may use content submitted to ChatGPT or other OpenAI services like DALL-E to improve LLM performance, including the user’s prompts and other content such as images and files.[6] Further, Open AI explicitly states that user conversations may be reviewed by OpenAI’s trainers to improve the LLM’s data sets, leading to the possibility that user inputs become part of the LLM’s built-in information base. OpenAI asserts that it “takes steps to reduce the amount of personal information in [its] training datasets before they are used to improve the models,” though it does not specify how this is done or to what extent. [7]
Google’s Terms of Service and its “Generative AI Additional Terms of Service” permits Google to retain a license to use content from users, an arrangement which the user consents to, presumably, by virtue of using Google’s services and platforms. The license allows Google to: host, reproduce, distribute, communicate and use user content; publish or display content if the user has made it visible to others; modify and create derivative works based on user content; and sublicense these rights to other users and Google’s contractors. The license is intended to, among other things, improve Google’s services, develop new technologies and services for Google, and use content shared publicly by the user to promote Google’s services.
According to Google Bard’s terms, when users interact with Google’s Bard, a variety of information is stored about the interaction for 18 months under the default settings, such as the conversation, the user’s location and IP address, any feedback, and other usage information. Additionally, conversations may be subject to human review. As a result, Google warns that confidential information should not be entered into Bard, given the possibility for human review by a Google employee.[8]
Anthropic’s Claude is an exemplary outlier here; by default, conversations and prompts are not used to train its LLMs. Anthropic’s Terms of Service state that it does not use inputs submitted from users to Claude to train its algorithms except when users specifically opt-in by explicitly providing feedback or if the inputs are flagged for trust and safety review to detect and enforce violations against Anthropic’s Acceptable Use Policy to mitigate harm.[9] This is not to say that Anthropic may not use Claude conversations to maintain or improve Claude and other services—because it can—but it will not do so in a way that impacts the security of sensitive data input into Claude. Even under the two delineated exceptions, only “a limited number of staff members have access to conversation data and they will only access this data for explicit business purposes.”[10]
As such, a user’s prompts won’t, or at least shouldn’t be, in Claude’s information base unless the information is simultaneously publicly available or contextually relevant to the conversation as a previous input to the same conversation. In other words, what happens in your conversation with Claude, stays in your conversation with Claude, unless you open up the conversation yourself.
In the case that there is a potential violation of the Acceptable Use Policy flagged for review, Anthropic will train its classifiers to make services safer by using data inferred from or referred to in the input, which does not entail that the input data itself is training Claude.
ChatGPT and Bard do, however, give users the option to “opt out” of having their data analyzed and used to train the program’s LLM.
OpenAI provides Data Control settings in which users may toggle a setting to opt-in/-out of having their data incorporated into the LLM’s data set. Further, ChatGPT’s Enterprise program, a version of the platform specifically tailored for businesses, restricts entirely the possibility that the LLM is trained on business data or user conversations as well as prohibits any learning from the businesses’ usage.
Similarly, Bard gives users the option to “turn off Bard Activity,” which ensures that “future conversations won’t be sent for human review or used to improve Google’s generative machine-learning models.”[11] Users are also able to delete previous Bard data or activity, but if such data was reviewed or annotated by human reviewers, then Google retains access to those portions of the data as well as related data such as the user’s language, device type, location, or feedback information, all of which can be stored for up to three years.[12]
Conclusion
Considering the ongoing arms race between technology companies to achieve AI-industry hegemony, users ought to carefully review the practices and policies such companies have put in place to protect (or not) their users’ data. With the rise of generative AI platforms, so also follows the potential for violations of data privacy, as LLMs require massive amounts of data to function well. As such the need for data privacy hygiene is at an all-time high for users of AI platforms, lest they find that their propriety, confidential, or otherwise sensitive data potentially be distributed to millions of users.
[1] Their legal right to do so may be buried in their terms of service, which few people, if any, read.
[2] https://clarksonlawfirm.com/wp-content/uploads/2023/06/0001.-2023.06.28-OpenAI-Complaint.pdf
[5] https://www.scribd.com/document/678529515/UMG-v-Anthropic
[6] https://help.openai.com/en/articles/7039943-data-usage-for-consumer-services-faq.
[7] Ibid.
[8] https://support.google.com/bard/answer/13594961?hl=en&sjid=9793988515864558164-NA
[11] https://support.google.com/bard/answer/13594961
[12] Ibid.