Project Description: Tackling the AI Mental Health Data Crisis

Executive Summary

This project, led by Philip Resnik at the University of Maryland Computational Linguistics and Information Processing (CLIP) Laboratory with the support of an Amazon Machine Learning Research Award, will build out a computational framework for the voluntary donation of social media and questionnaire data, recruit data donors at scale, and construct a secure, ethically-governed virtual data enclave on AWS that provides AI researchers with both access to the data and the compute cycles they need to make progress.


Mental health problems are among the costliest challenges we face, in both economic and human terms. The World Health Organization has reported that mental illnesses are the leading cause of disability-adjusted life-years worldwide, and in the U.S. alone, the numbers are staggering: to cite just a few, between 1996 and 2006, annual expenditures on mental disorders rose from $35.2B to $113B, some 25 million American adults will have an episode of major depression this year, and suicide is the third leading cause of death for people aged 10-14 and second among people between people age 15-34 (Pal 2015, Kliff 2012, NAMI 2014, CDC 2015). The importance of mental health as a topic of research cannot be overstated.

With a problem this important, why has AI failed to demonstrate the kinds of major advances we have seen in other domains, from wide-availability machine translation to everyday use of personal assistants like Siri and Alexa? When it comes to advancing AI, it’s all about the data. The greatest progress in AI happens when a whole community can work on shared problems and common datasets — as examples, consider the role of the Penn Treebank (the first large syntactically annotated dataset, Taylor et al. 2003) in advancing natural language parsing, of EuroParl (multilingual proceedings of the European Parliament) in feeding machine translation research, or the enormous advances in automated sentiment analysis using huge, widely available collections of Amazon reviews. But natural language processing in healthcare is struggling to catch up after being a solid decade behind other research areas because most NLP researchers still don’t have access to large, shared data collections.

Data donation efforts like like Sync for Science have recently begun moving in the right direction by creating a large dataset of electronic health records for research, but they are starting with a heavy focus on structured data (e.g. vital signs, medications, lab test results) rather than tapping into the rich, unstructured clinical language in the records. Moreover, clinical information about mental health only shows up in electronic health records for people who have made contact with a qualified clinician in the first place, something we can’t assume in a nation rife with mental health deserts — nearly 112 million Americans live in federally designated mental health provider shortage areas (Bureau of Health Workforce, 2017) — and even when a clinician is available, many people often don’t realize they need help, or may fear being stigmatized if they seek it. Particularly for mental health, the health record may not be the most important locus of information: relevant symptoms often manifest most clearly in people’s day-to-day lives and social interactions, where social media is likely to be a more valuable window than clinical records.

In this project, we will address these issues and break through the data bottleneck. The central idea is to build on and extend an existing, operating framework for donation of social media and clinical instruments data, to recruit data donors at scale, and to construct a virtual data enclave on AWS that provides AI researchers with both secure access to the data and the compute cycles they need to make progress. The target is a practical, sustainable solution to the crisis-level lack of data for AI research in mental health.