2016 NLP
Shared Task
email password

Announcement of Data Release and Call for Participation

2016 CEGS N-GRID Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data

Tentative Timeline
Registration: begins May, 2016
Data Release for Sight Unseen Track: 6th June 2016
System Outputs Due for Sight Unseen Track: 10th June 2016
Training Data Release: 11th June 2016
Test Data Release: 10th August 2016 (12am Eastern Time)
System Outputs Due: 12th August 2016 (11:59pm Eastern Time)
Abstract Submission: 1st September 2016
Workshop: 11th November 2016, Chicago, IL, USA
Journal Submissions: TBD


The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDOC Individualized Domains (N-GRID) challenge, a.k.a., RDoC for Psychiatry challenge, aims to extract symptom severity from neuropsychiatric clinical records. Research Domain Criteria (RDoC) is a framework developed under the aegis of the National Institute of Mental Health (NIMH) that facilitates the study of human behavior from normal to abnormal in various domains. The challenge goal is to classify symptom severity in a domain for a patient, based on information included in their initial psychiatric evaluation.

This challenge will be conducted on initial psychiatric evaluations (1 per patient), which have been fully de-identified and scored by clinical experts in a symptom domain. The data for this task is provided by Partners Healthcare and the Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) project (HMS PI: Kohane; MGH PI: Perlis) of Harvard Medical School, and will be released under a Rules of Conduct and Data Use Agreement. Obtaining the data requires completing the registration, which will start in May 2016.

All data are fully de-identified and manually annotated for RDoC.

The tracks
The 2016 CEGS N-GRID challenge consists of three NLP tracks:

Track 1: De-identification: Removing protected health information (PHI) is a critical step in making medical records accessible to more people, yet it is a very difficult and nuanced task. This track addresses the problem of de-identifying medical records over a new set of ~1000 initial psychiatric evaluation records, with surrogate PHI for participants to identify. We intend to run two versions of the de-id track.

  1. Sight unseen track: this track involves running existing home-grown de-id systems on the RDoC data without any training and modification to the systems, as a way of measuring how well the existing systems generalize to brand new data. The RDoC data will be provided for this track without any gold standard training annotations and system outputs will be collected within 3 days of data release.
  2. Regular track: this track will allow the development and training of de-id systems on the RDoC training data. Evaluation will be on the RDoC test data.

Track 2: RDoC classification: The goal of RDoC classification is to determine symptom severity in a domain for a patient, based on information included in their initial psychiatric evaluation. The domain has been rated on an ordinal scale of 0-3 as follows: 0 (absent), 1 (mild=modest significance), 2 (moderate=requires treatment), 3 (severe=causes substantial impairment) by experts. There is one judgment per document, and one document per patient.

Track 3: Novel Data Use: The data released for this 2016 challenge are the first set of mental health records released to the research community. These data can be used for mental health-related research questions that go beyond what is posed by the challenge organizers. This Track is for participants who want to build on their existing systems, or the systems developed for Tracks 1 and 2, with the aim of addressing new research questions.

Evaluation Dates and Format
The evaluation for the NLP tracks will be conducted using withheld test data. Participating teams are asked to stop development as soon as they download the test data. Each team is allowed to upload (through this website) up to three system runs for each of the tasks. System output is expected to be submitted in the exact format of the ground truth annotations to be provided by the organizers.

Participants are asked to submit a 500-word long abstract describing their methodologies. Abstracts may also have a graphical summary of the proposed architecture. The document should not exceed 2 pages, 1.5 line spacing, 12 font size. The authors of either top performing systems or particularly novel approaches will be invited to present or demonstrate their systems at the workshop. A special issue of a journal will be organized following the workshop.

Organizing Committee
Ozlem Uzuner, co-chair, SUNY at Albany
Amber Stubbs, co-chair, Simmons College
Michele Filannino, co-chair, SUNY at Albany
Tianxi Cai, Harvard School of Public Health
Susanne Churchill, Harvard Medical School
Isaac Kohane, Harvard Medical School
Thomas H. McCoy, MGH, Harvard
Roy H. Perlis, MGH, Harvard
Peter Szolovits, MIT
Uma Vaidyanathan, NIMH
Philip Wang, American Psychiatric Association

Please see the announcements for more information. Questions about the challenge can be addressed to the organizers.


[ back to top ]
Home | Contact | Sitemap | Search
©2005 - 2017
Partners Healthcare