Creating training datasets that reflect lived experience

Creating training datasets that reflect lived experience is essential for ensuring that AI systems are inclusive, empathetic, and applicable to real-world scenarios. It requires going beyond abstract or theoretical data and embracing a more nuanced approach to data collection. Below are key considerations for curating such datasets.

1. Inclusive Data Collection

Lived experiences encompass a wide range of identities, circumstances, and contexts. To reflect these experiences, it’s crucial to collect data from diverse groups. This includes:

Demographic Diversity: Ensure the dataset includes individuals from various age groups, genders, races, ethnicities, sexual orientations, and socio-economic backgrounds.
Geographical Diversity: Include data from different regions, recognizing that lived experiences may differ based on location, culture, and societal structures.
Disability Representation: Represent individuals with physical, mental, and cognitive disabilities to ensure AI systems are accessible and equitable.

2. Engage Communities in Data Creation

A top-down approach to data collection can risk oversimplifying or misrepresenting the experiences of marginalized or underrepresented groups. Involving these communities directly in the creation of datasets leads to richer, more authentic data. Methods include:

Participatory Research: Involve community members as active partners in the research process. This ensures their lived experiences are accurately captured and prevents the imposition of external viewpoints.
Crowdsourcing: Utilize crowdsourcing platforms that can allow people from various backgrounds to contribute their experiences, ensuring the dataset is representative.
Interviews and Surveys: Conduct qualitative research that asks open-ended questions to capture deeper personal stories and insights from individuals about their experiences.

3. Contextual Sensitivity

Lived experiences are shaped by context. When collecting data, ensure that:

Environmental Factors: The data accounts for specific environments that affect behavior and decision-making, such as urban vs. rural settings, or corporate vs. family dynamics.
Temporal Factors: People’s experiences may change over time. For instance, historical context matters—people’s views and behaviors may be different in pre- and post-pandemic contexts.

4. Data Annotation with Care

Lived experiences often involve emotional, personal, and cultural nuances. Careful and thoughtful data annotation ensures that these complexities are captured without stripping away their depth:

Contextual Tags: Use tags or labels that capture the subtleties of lived experience, like “resilience,” “fear,” “hope,” or “empowerment.”
Bias Minimization: Train annotators to be aware of their own biases and ensure the annotations remain true to the original context and intent of the experience.

5. Ethical Considerations

Gathering data based on lived experiences requires ethical sensitivity to privacy, consent, and representation:

Informed Consent: All participants should understand how their data will be used and have the ability to withdraw at any time.
Anonymization: Lived experiences are often deeply personal, so anonymizing data to protect individual identities is essential.
Representation Consent: In cases where sensitive or culturally specific experiences are involved, ensure that the groups involved consent to their stories being used for AI training.

6. Avoiding Stereotyping

While it’s important to reflect diverse lived experiences, it’s equally important to avoid reinforcing harmful stereotypes. AI systems trained on data that reflects lived experiences should:

Challenge Biases: Ensure that the data challenges stereotypes and misrepresentations, rather than amplifying them.
Emphasize Agency: Capture moments where individuals have exercised agency or resilience, rather than solely focusing on victimhood or marginalization.

7. Iterative Feedback and Refinement

As AI systems are deployed and used, they should be regularly updated based on real-world feedback. This ensures that the model is continually refined to better reflect lived experiences and can adapt to changing societal dynamics.

Real-world Validation: After deploying models, actively engage with communities and users to see how well the AI aligns with their lived experiences.
Continuous Data Updates: Ensure datasets are regularly updated to reflect evolving social, cultural, and economic shifts.

8. Incorporating Intersectionality

Lived experiences are not monolithic but are often shaped by the intersection of multiple identities (e.g., race, gender, class). AI models must account for this complexity by ensuring:

Multi-dimensional Data: Data collection processes should gather intersecting identities and contextual factors, rather than reducing people to a single category.
Intersectional Analysis: Use data analysis methods that allow for the exploration of how different factors (e.g., race and gender) combine to shape unique experiences.

9. Collaboration with Experts

For certain domains, like health or mental health, collaborating with domain experts, such as psychologists or sociologists, can help ensure that the datasets accurately reflect lived experiences, especially where complex, subjective, or sensitive data is involved.

Conclusion

Training datasets that reflect lived experience are foundational for creating AI systems that truly serve diverse populations in a human-centered, ethical, and empathetic manner. By involving diverse communities in data collection, embracing ethical considerations, and refining the models iteratively, we can move closer to developing AI that genuinely reflects the complexity of human experiences.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Creating training datasets that reflect lived experience

1. Inclusive Data Collection

2. Engage Communities in Data Creation

3. Contextual Sensitivity

4. Data Annotation with Care

5. Ethical Considerations

6. Avoiding Stereotyping

7. Iterative Feedback and Refinement

8. Incorporating Intersectionality

9. Collaboration with Experts

Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic