i2b2: Informatics for Integrating Biology & the Bedside - A National Center for Biomedical Computing
NLP Research
Data Sets
Data Sets
email password

i2b2 is a passionate advocate for the potential of existing clinical information to yield insights that can directly impact healthcare improvement.  In our many use cases (Driving Biology Projects) it has become increasingly obvious that the value locked in unstructured text is essential to the success of our mission.  In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner.  We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,500) from the first four i2b2 Challenges as i2b2 NLP Research Data Sets. A similar set of notes from the most recent i2b2 Challenge will be released on the one year anniversary of that Challenge. These data sets have already enabled hundreds of journal and conference articles by the research community.

To access these notes, please use the Registration link to your left.  We will do an expedited review of your proposal and, if acceptable, will ask you to sign and return our standard Data Use Agreement before releasing the notes to you.  Given the goal of leveraging the entire community, we will ask that you share your annotations back to us following publication of your results and/or one year after first access, whichever comes first. 

Inclusion of patient notes in presentations and publications:

Please note that to fully insure patient privacy Partners' policy explicitly forbids reproducing any portion of these notes in oral presentations and/or publications.

Please acknowledge i2b2 in your publications as follows:

"Deidentified clinical records used in this research were provided by the i2b2 National Center for Biomedical Computing funded by U54LM008748 and were originally prepared for the Shared Tasks for Challenges in NLP for Clinical Data organized by Dr. Ozlem Uzuner, i2b2 and SUNY."

We also welcome colleagues who are interested in developing new tools that could be wrapped as i2b2 tools and made available to those academic health centers adopting i2b2 as their data warehouse infrastructure. 

Questions?  schurchill@partners.org


2006 Deidentification and Smoking Challenge

Deidentification and Smoking Challenge Participants

NLP Data Set #1:

  • NLP Data Set #1A:  889 unannotated, de-identified discharge summaries

Please cite as:

  • NLP Data Set #1B:  889 de-identified discharge summaries with de-identification challenge annotations, training and test sets and ground truth.

 Please cite as:

Other related publications:

  • NLP Data Set #1C:  A subset of the above 889 (N = 502) de-identified discharge summaries with smoking challenge annotations, training and test sets and ground truth.

Please cite as:   

  • Uzuner Ö., Goldstein I, Luo Y, Kohane I. "Identifying patient smoking status from medical discharge records".  J Am Med Inform Assoc.  2008; 15(1)15-24. www.jamia.org/cgi/content/short/15/1/14.

Other related publications:      

2008 Obesity Challenge

Obesity Challenge Participants

NLP Data Set #2:

Please cite as:   

2009 Medication Challenge

Medication Challenge Participants

NLP Data Set #3:

Please cite as:   

  • Uzuner Ö, Solti I, Xia F, Cadag E. (2010). "Community Annotation Experiment for Ground Truth Generation for the i2b2 Medication Challenge".  Journal of the American Medical Informatics Association. 2010;17:519-523 doi:10.1136/jamia.2010.004200. http://jamia.bmj.com/content/17/5/519.full.pdf.
  • Uzuner Ö, Solti I, Cadag E. (2010). "Extracting Medication Information from Clinical Text".  Journal of the American Medical Informatics Association. 2010;17:514-518 doi:10.1136/jamia.2010.003947. http://jamia.bmj.com/content/17/5/514.full.pdf.

2010 Relations Challenge

Relations Challenge Participants

NLP Data Set #4:

Please cite as:   

  • Uzuner Ö., South B., Shen S., DuVall S. (2011). "2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text".  Journal of the American Medical Informatics Association. 2011;18:552-556 Published Online First: 16 June 2011 doi:10.1136/amiajnl-2011-000203. http://jamia.bmj.com/content/18/5/552.abstract.

2011 Coreference Challenge

Coreference Challenge Participants



[ back to top ]
Home | Contact | Sitemap | Search
©2005 - 2015
Partners Healthcare