Announcement of Data Release and Call for Participation
2016 CEGS N-GRID Shared-Tasks and Workshop
on Challenges in Natural Language Processing for Clinical Data
Registration: begins May, 2016
Data Release for Sight Unseen Track: 6th June 2016
System Outputs Due for Sight Unseen Track: 10th June 2016
Training Data Release: 11th June 2016
Test Data Release: 10th August 2016 (12am Eastern Time)
System Outputs Due: 12th August 2016 (11:59pm Eastern Time)
Abstract Submission: 1st September 2016
Journal Submissions: TBD
The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDOC Individualized Domains (N-GRID) challenge,
a.k.a., RDoC for Psychiatry challenge, aims to extract symptom severity from neuropsychiatric clinical records.
Research Domain Criteria (RDoC) is a framework
developed under the aegis of the National Institute of Mental Health (NIMH) that facilitates the study of human behavior
from normal to abnormal in various
domains. The challenge goal is to classify symptom severity in a domain for a patient, based on information
included in their initial psychiatric evaluation.
This challenge will be conducted on initial psychiatric evaluations (1 per patient), which have been fully de-identified and scored by
clinical experts in a symptom domain. The data for this task is provided by Partners Healthcare and the Neuropsychiatric Genome-Scale and
RDoC Individualized Domains (N-GRID) project (HMS PI: Kohane; MGH PI: Perlis) of Harvard Medical School, and will be released under
a Rules of Conduct and Data Use Agreement. Obtaining the data requires completing the registration, which will start in May 2016.
All data are fully de-identified and manually annotated for RDoC.
The 2016 CEGS N-GRID challenge consists of three NLP tracks:
Track 1: De-identification: Removing protected health information (PHI) is a critical step in making medical records accessible
to more people, yet it is a very difficult and nuanced task. This track addresses the problem of de-identifying medical records over a
new set of ~1000 initial psychiatric evaluation records, with surrogate PHI for participants to identify. We intend to run two versions
of the de-id track.
Sight unseen track: this track involves running existing home-grown de-id systems on the RDoC data without any training and
modification to the systems, as a way of measuring how well the existing systems generalize to brand new data. The RDoC data will be
provided for this track without any gold standard training annotations and system outputs will be collected within 3 days of data
Regular track: this track will allow the development and training of de-id systems on the RDoC training data. Evaluation
will be on the RDoC test data.
Track 2: RDoC classification:
The goal of RDoC classification is to determine symptom severity in a domain for a patient, based on information included in their
initial psychiatric evaluation. The domain has been rated on an ordinal scale of 0-3 as follows: 0 (absent), 1 (mild=modest significance),
2 (moderate=requires treatment), 3 (severe=causes substantial impairment) by experts. There is one judgment per document, and one document
Track 3: Novel Data Use: The data released for this 2016 challenge are the first set of mental health records released to the
research community. These data can be used for mental health-related research questions that go beyond what is posed by the challenge
organizers. This Track is for participants who want to build on their existing systems, or the systems developed for Tracks 1 and 2,
with the aim of addressing new research questions.
Evaluation Dates and Format
The evaluation for the NLP tracks will be conducted using withheld test data. Participating teams are asked to stop development as soon as
they download the test data. Each team is allowed to upload (through this website) up to three system runs for each of the tasks. System
output is expected to be submitted in the exact format of the ground truth annotations to be provided by the organizers.
Participants are asked to submit a 500-word long abstract describing their methodologies. Abstracts may also have a graphical summary of
the proposed architecture. The document should not exceed 2 pages, 1.5 line spacing, 12 font size. The authors of either top performing
systems or particularly novel approaches will be invited to present or demonstrate their systems at the workshop. A special issue of a
journal will be organized following the workshop.
Ozlem Uzuner, co-chair, SUNY at Albany
Amber Stubbs, co-chair, Simmons College
Michele Filannino, co-chair, SUNY at Albany
Tianxi Cai, Harvard School of Public Health
Susanne Churchill, Harvard Medical School
Isaac Kohane, Harvard Medical School
Thomas H. McCoy, MGH, Harvard
Roy H. Perlis, MGH, Harvard
Peter Szolovits, MIT
Uma Vaidyanathan, NIMH
Philip Wang, American Psychiatric Association
Please see the announcements for more information. Questions about the challenge can be addressed to the organizers.