How your content might find itself in a training database
I recently attended a writers conference, and a hot topic among attendees was copyrights and AI. However, they weren’t asking about the copyrightability of AI-generated content or whether AI-generated content might contain someone else’s copyrightable material. Their big concern was whether their own content was part of an AI training database or whether their content would or could end up in an AI training database. It’s an excellent question, so let’s cover how this might happen.
1. Your content was collected into a data aggregator. Some companies collect data from several different sources and pool all this data into a big database. They then sell this data to other companies. Some AI companies (not all, mind you) purchased licenses to copy all this data into their training databases. So if your content made it into one of these databases, it would have been incorporated into the AI program. There are places online that will tell you which databases were copied by which AI companies; you may even be able to tell if your content is in a specific database.
2. Your content was added to an AI training database. This is most likely to occur when you are hired to write something, e.g., you’re paid money to write a blog or magazine article, but it could also happen with a publisher. Essentially, the company takes your content and adds it to an AI training database to further their own company goals. To combat this practice, the Author’s Guild encourages all writers to have a provision in their contracts stating that their content cannot be used to train an AI program. There are some model provisions on the Author’s Guild website that can be used for this purpose. (If you’re concerned about asking for this provision, many companies now include a provision stating that the author cannot use AI to write content for them, so asking for this in return would be a fair exchange.)
3. You give your content to an AI training database. Some AI companies add all user inputs into their training database, so every time you ask the program a question or give it some content to critique, you are giving it more content to train on. This is not true of all AI programs, and a little bit of due diligence will tell you whether a specific program does this or not.
Keep in mind that different AI companies have different policies. Some AI companies built their databases using only their own content, and others have shown varying levels of care in what content they use for training their programs. With the legal landscape shifting against the use of copyrighted content to train AI, more companies are adding ways to “opt out” of having your content in their program. Change is coming quickly, so stay tuned for further updates.
In the meantime, if you want to learn more about copyright and AI, you are welcome to email me at kaway@kawaylaw.com.
