We have written several times about the need for companies to reduce the amount of data that they collect and to get rid of old data. Data minimization lowers the legal, cybersecurity and privacy risks associated with companies having lots of confidential information that they do not need stored on their systems or with their vendors. But just as many companies are starting to comply with new regulatory obligations to get rid of old data, they are also implementing AI and Big Data projects that require large volumes of confidential information, which may complicate their data-minimization efforts. This tension is particularly challenging when considering how long to keep confidential data that was used to train an AI model that is currently in operation.
Regulatory Requirements to Get Rid of Old Data
Data-minimization laws generally provide that nonpublic data should be kept until it is no longer needed either for legitimate business purposes or legal reasons, such as a pending litigation or a regulatory requirement. For example, under certain circumstances, the Federal Trade Commission (“FTC”) considers retaining personal data for longer than is necessary for a legitimate business or legal purpose an “unfair” business practice under Section 5 of the FTC Act. Similarly, the New York Shield Act requires companies to dispose of private information of New York residents within a reasonable amount of time after it is no longer needed for business purposes. In addition, Section 4 of the California Privacy Rights Act of 2020 (the “CPRA”) prohibits businesses from retaining a consumer’s personal information, or sensitive personal information, for longer than is reasonably necessary, and similar data minimization requirements exist in the NYDFS cybersecurity rules, the Virginia and Colorado state privacy laws, the GDPR and the UK Data Protection Act of 2018.
Data-Minimization Requirements Applied to AI Training Data
The obvious questions that arise when applying these data-minimization requirements to large volumes of confidential company or personal information that were used to train an AI model are: (i) does maintaining the training set while the model is in operation constitute a legitimate business purpose; and (ii) is preserving the training set required by any laws or regulatory guidance?
Taking the second question first, there does not appear to be any specific legal requirements as to how long AI training data must be maintained. Some state insurance regulators have stated that they expect to be able to review data used to develop certain models and to be able to audit the models themselves, which could, under certain circumstances, include the data used to train the models. For example, the Connecticut Insurance Department has stated that it has the authority to require that insurers provide the department with access to data used to build models or algorithms that are included in underwriting filings. Similarly, the Colorado Draft AI Insurance Rules, if adopted, would require life insurance companies operating in Colorado to keep detailed documentation for their models, including a description of the dataset used to train the model.
Factors to Consider in Training Data Retention
As to the first question—whether maintaining AI training data constitutes a legitimate business purpose—there are several factors to consider. In striking a balance between, on the one hand, reducing privacy and cybersecurity risks by getting rid of old sensitive training and, on the other, having access to an AI model’s training data in case the model’s performance comes under some scrutiny, companies should weigh:
- The nature of the training data, including whether it includes large volumes of sensitive personal or biometric data about customers or employees, as well as the jurisdictions in which those people live and where the company deploys these models;
- Any general retention or deletion requirements that apply to the underlying training data (g., if sensitive customer personal information is used to train an AI model, is there a regulatory requirement or company policy that such data should be deleted after a certain period of time, irrespective of its use as AI-training data);
- Any cybersecurity, privacy and regulatory risk that may be associated with retaining the training data, including having to look through the data when responding to data-access requests or in responding to discovery requests in litigation;
- The ability to test the performance of a model without access to the full set of training data;
- The benefits and risks of retaining the training data if the model associated with that data were to be the subject of litigation or a regulatory inquiry;
- The storage and processing costs associated with retaining the training data; and
- The administrative and compliance burdens associated with various retention options applicable to the training data.
These factors will apply differently for different types of models. For continuous learning (“CL”) models, which continue to learn after deployment, there may be reasons to retain model outputs for longer periods. Since these models generally evolve, regulators and courts might argue that—in the event of a performance issue or other regulatory concern—the model’s earlier outputs are important to understanding its later performance. And generative AI models pose unique challenges in terms of retaining training data in light of the continuous nature of their training, the extremely large volume of training data and the lack of access that most users have to that training data.
Examples of Data-Retention Frameworks for AI Training Documents
Maintain While Model Operating. A company could preserve the training data for a model until a reasonable time (e.g., one year) after the model has been decommissioned.
One benefit of this approach is that it ensures that the training data will be available for analysis, responding to regulatory inquiries or defending against civil claims for the entire life of the model. Another benefit is the simplicity of the policy, which would make compliance relatively easy. In terms of risks, this approach increases cybersecurity and privacy risks by having the data available for what may be a very long time, which also can involve significant storage costs. In addition, long-term retention of training data could increase burdens associated with data-access and data-deletion requests, as well as litigation discovery.
Maintain Offline or Delete after One Year. Another approach would be to maintain training data for a model until a reasonable time (e.g., one year) after the model is operational, keeping a description of the training data, along with the fields and the relevant metadata, and then either deleting the actual training data or storing it in an offline storage location until the associated model has been fully decommissioned.
The advantages of this approach include reducing cybersecurity and risks and lowering storage costs associated with retaining the data. Under this approach, an offline copy of the training data would be available in the event that mere descriptions of training data and metadata are not sufficient to respond to regulatory requests relating to the model or for defenses against civil claims. Potential downsides include the possibility that storing data offline could complicate privacy audits, as well as data-access and data-deletion requests. Bringing offline data online for regulatory compliance or litigation may be costly, and offline data may still be subject to discovery.
Anonymizing Data after One Year. Another approach involves maintaining training data until a reasonable time (e.g., one year) after the model is operational and then maintaining a description of the training data, along with the fields and the relevant metadata, and storing an anonymized version of the training data until the model has been fully decommissioned.
The primary advantage of this approach is that anonymized data, along with descriptions of training data and its metadata, may be sufficient to meet regulatory requirements and defenses against civil claims, while simultaneously reducing privacy and cybersecurity risks, especially because anonymized data is generally not subject to privacy and data-minimization requirements. On the other hand, it may be difficult and costly for some companies to satisfy the standards for de-identification that are necessary to avoid privacy obligations. This more complicated retention policy could also make consistent compliance across the organization more challenging. Finally, there is some risk that anonymized data, descriptions of training data and metadata together may not be sufficient to meet all regulatory requirements or support defenses against civil claims.
To subscribe to the Data Blog, please click here.
The Debevoise Artificial Intelligence Regulatory Tracker (“DART”) is now available for clients to help them quickly assess and comply with their current and anticipated AI-related legal obligations, including municipal, state, federal and international requirements.
The cover art used in this blog post was generated by DALL-E.