UMD Reddit Suicidality Dataset

The University of Maryland Reddit Suicidality Dataset, Version 2


People who have been approved for Version 1 of the dataset, or for the CLPsych 2019 shared task, do not need to re-apply for Version 2. If we have not contacted you about Version 2 yet please write to resnik@umd.edu to request access.

Overview

The University of Maryland Reddit Suicidality Dataset was constructed using data from Reddit, an online site for anonymous discussion on a wide variety of topics, in order to facilitate research on suicidality and suicide prevention. The dataset was derived from the 2015 Full Reddit Submission Corpus, using postings in the r/SuicideWatch subreddit to identify (anonymous) users who might represent positive instances of suicidality.

We introduced Version 1 of the dataset in Shing et al. (2018). As reported there, annotation of users in this dataset by experts for level of suicide risk (on a four-point scale of no risk, low, moderate, and severe risk) yielded what is, to our knowledge, the first demonstration of reliability in risk assessment by clinicians based on social media postings. The paper also introduces and demonstrates the value of a new, detailed rubric for assessing suicide risk, compares crowdsourced with expert performance, and presented baseline predictive modeling experiments using the new dataset.

Subsequently, we updated the dataset for the shared task on predicting degree of suicide risk from Reddit Posts, run as part of the 2019 Computational Linguistics and Clinical Psychology Workshop (CLPsych 2019) held at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (Zirikly et al. 2019). Updates included adding automatic de-identification of post titles and bodies, as well as the definition of a standard training/test split to be used during the shared task in order to facilitate head-to-head comparisons of system performance. We have also filtered out some posts from the Version 1 dataset based on encoding issues.

The currently available Version 2 of the dataset includes the training and test data from the 2019 CLPsych shared task (with consensus annotations based on crowdsourcing) plus the expert-annotated data (which was not used in the shared task). We recommend using the crowdsourcing train/test split for direct comparison with 2019 shared task papers, and using the full expert-annotated dataset for final testing since the expert annotations have strong inter-rater reliability.

The dataset is accompanied by documentation about its format. Briefly, it contains one subdirectory with data pertaining to 11,129 users who posted on SuicideWatch, and another for 11,129 users who did not. For each user, we have full longitudinal data from the 2015 Full Reddit Submission Corpus, including, for each post, the post ID, anonymized user ID, timestamp, subreddit, de-identified post title, and de-identified post body. In addition, we have two sets of human risk-level annotations for subsets of the users, obtained via crowdsourced annotation (621 users who posted on SuicideWatch and 621 who did not) and expert annotations (245 users who posted on SuicideWatch, paired with 245 control users who did not). In both cases we generated a user-level consensus label using the Dawid-Skene (1979) model for discovering true item states/effects from multiple noisy measurements (Passoneau and Carpenter, 2014; see discussion in Shing, 2018).

In addition to reading and citing the papers below, people using this dataset may wish to read Gaffney and Matias (2018). Published subsequent to Shing et al. (2018), this article provides caveats regarding the use of the 2015 Reddit Corpus related to missing data, which we discuss in Zirikly et al. (2019).

Papers to Cite when Using the Dataset

Han-Chin Shing, Suraj Nair, Ayah Zirikly, Meir Friedenberg, Hal Daumé III, and Philip Resnik, "Expert, Crowdsourced, and Machine Assessment of Suicide Risk via Online Postings", Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 25–36, New Orleans, Louisiana, June 5, 2018.
@inproceedings{shing2018expert,
  title={Expert, crowdsourced, and machine assessment of suicide risk via online postings},
  author={Shing, Han-Chin and Nair, Suraj and Zirikly, Ayah and Friedenberg, Meir and {Daum{\'e} III}, Hal and Resnik, Philip},
  booktitle={Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic},
  pages={25--36},
  year={2018}
}

Ayah Zirikly, Philip Resnik, Özlem Uzuner, and Kristy Hollingshead. 2019. CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology (CLPsych'19), Minneapolis, June 6, 2019.

@inproceedings{zirikly2019clpsych,
  title={{CLPsych} 2019 Shared Task: Predicting the Degree of Suicide Risk in {Reddit} Posts},
  author={Zirikly, Ayah and Resnik, Philip and Uzuner, {\"O}zlem and Hollingshead, Kristy}, 
  booktitle={Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology},
  location="Minneapolis",
  month="June",
  day="6",
  year={2019}
}

Dataset Availability and Governance Plan

Reddit is designed to be a site where people "detach from their real-world identities" and post anonymously (Gutman, 2018), but the construction of this dataset adds an additional layer of anonymization by replacing user names with unique identifiers (since, for example, a hypothetical user could still have chosen the username maryjanesmith1973.collegepark, identifying name, birth year, and location), plus, as of Version 2, automatic de-identification of text as described in Zirikly et al. (2019). In terms of formal human subjects research protections, the University of Maryland College Park’s Institutional Review Board has reviewed the use and sharing of this dataset and designated it as Exempt Category 4, i.e. research involving the collection or study of existing data if they are available or if information is recorded such that subjects cannot be identified.

Even with IRB approval for sharing, however -- and even for an anonymous and/or de-identified dataset -- particular care needs to be taken with sensitive data of this kind (Benton et al., 2017, Chancellor et al., 2019). Therefore we have established a collaboration with the American Association of Suicidology (AAS) to put in place a governance process for researcher access to the dataset, described below. The governance process involves review of applications for access to the dataset by a governance committee of five volunteers established by AAS, which includes Philip Resnik (lead investigator at University of Maryland) and four people affiliated with and/or designated by AAS. The AAS contact person regarding this dataset is Tony Wood, chair of the AAS Board of Directors.

Three of the five members of the governance committee, selected per availability, will review requests for access submitted in the format specified below. Outcomes of the review include the following responses:

Note that the governance process has been established as part of a collaboration between Prof. Resnik and AAS. It may be changed at any time by mutual agreement, and Prof. Resnik or AAS can end this collaboration at any time.

The governance committee will attend to and encourage diversity and inclusion with respect to the set of reviewers and the community of researchers using the dataset.

How to Request Access

Although we have to be careful to make sure all appropriate steps are followed, we are very eager to share this resource with other researchers! Please send requests for access to the dataset to Philip Resnik (resnik@umd.edu). Requests should be based on this sample application, which has two parts:
Return to Philip Resnik's home page