With a problem this important, why has AI failed to demonstrate the kinds of major advances we have seen in other domains, from wide-availability machine translation to everyday use of personal assistants like Siri and Alexa? When it comes to advancing AI, it’s all about the data. The greatest progress in AI happens when a whole community can work on shared problems and common datasets — as examples, consider the role of the Penn Treebank (the first large syntactically annotated dataset, Taylor et al. 2003) in advancing natural language parsing, of EuroParl (multilingual proceedings of the European Parliament) in feeding machine translation research, or the enormous advances in automated sentiment analysis using huge, widely available collections of Amazon reviews. But natural language processing in healthcare is struggling to catch up after being a solid decade behind other research areas because most NLP researchers still don’t have access to large, shared data collections.
Data donation efforts like like Sync for Science have recently begun moving in the right direction by creating a large dataset of electronic health records for research, but they are starting with a heavy focus on structured data (e.g. vital signs, medications, lab test results) rather than tapping into the rich, unstructured clinical language in the records. Moreover, clinical information about mental health only shows up in electronic health records for people who have made contact with a qualified clinician in the first place, something we can’t assume in a nation rife with mental health deserts — nearly 112 million Americans live in federally designated mental health provider shortage areas (Bureau of Health Workforce, 2017) — and even when a clinician is available, many people often don’t realize they need help, or may fear being stigmatized if they seek it. Particularly for mental health, the health record may not be the most important locus of information: relevant symptoms often manifest most clearly in people’s day-to-day lives and social interactions, where social media is likely to be a more valuable window than clinical records.
In this project, we will address these issues and break through the data bottleneck. The central idea is to build on and extend an existing, operating framework for donation of social media and clinical instruments data, to recruit data donors at scale, and to construct a virtual data enclave on AWS that provides AI researchers with both secure access to the data and the compute cycles they need to make progress. The target is a practical, sustainable solution to the crisis-level lack of data for AI research in mental health.