HITEx Manual

HITEx Manual v2.0

What is HITEx
Terms of Use
System requirements
Download
Installation
Examples
Contact
References

What is HITEx

HITEx (Health Information Text Extraction) is an open-source natural language processing (NLP) software application developed by a group of researchers at the Brigham and Women's Hospital and Harvard Medical School. HITEx is built on top of Gate framework and uses Gate as a platform. HITEx consists of the collection of Gate plug-ins that were developed to solve problems in medical domain, such as princial diagnoses extraction, discharge medications extraction, smoking status extraction and others. HITEx works by assembling these plug-ins into pipeline applications, along with other standard NLP plug-ins (some of which are part of Gate, such as Part-of-Speech tagger or Noun Phrase Chunker). Each plug-in in a pipeline may use the output of the previous plug-in. Power users are given full control over the plug-in parameters and the order of plug-ins in the application. General users may benefit from pre-configured pipeline applications that solve common medical problems, such as principal diagnoses extraction, discharge medications extraction, smoking status extraction and others. Please refer to the Examples section to see the complete list of sample pipeline applications.

Benefits of HITEx

Open source. Please refer the i2b2 software license agreement for details.
Easy to adapt, generalize and reuse. HITEx is a data-driven application. We have successfully used HITEx to process medical records from two different institutions. The only change we had to make was in the configuration files. We believe HITEx can also be used outside the medical domain.
Extendable. New plug-ins can be added to the set of known plug-ins, if they conform to Gate plug-in architecture.

Terms of Use

HITEx is an open-source software. By downloading HITEx you agree that the software is subject to the terms of the i2b2 software license agreement. Please read the license agreement carefully before downloading the files. Because HITEx uses the UMLS as the term/concept dictionary, this distribution includes an UMLS database. It is your responsibility to obtain and maintain a valid UMLS license from the National Library of Medicine, when using HITEx.

System requirements

CPU: Pentium III 700MHz or higher to run examples. Faster CPU is strongly recommended for a better user experience.
RAM: 256MB of RAM to run examples. More memory is strongly recommended to run long batch processing jobs.
Hard Drive Space:

Required tools/libraries	500 MB
Component resources / config	50 MB
UMLS database tables + indexes	1.45 GB
TOTAL	2 GB

Network: required if using an external UMLS database server
Operating system: Theoretically, any OS on which Java runs. However, HITEx was tested on Windows XP Professional only.

Download

You can download the latest version of HITEx here. Note: the file size is approx. 140MB.

Installation

1. HITEx application is completely written in Java and requires Java runtime environment to run. If you don't have Java installed, download and install Java JRE or SDK from Sun's website (http://java.sun.com). The application was tested to work with Java 1.4 and Java 5. Create JAVA_HOME environment variable, add $JAVA_HOME/bin directory to the path.

2. Download and install GATE natural language processing framework from the University of Sheffield's website (http://www.gate.ac.uk). The application was tested with GATE version 3.0 and 3.1. Version 3.1 is preferred. IMPORTANT: install GATE to the location with path that doesn't contain any spaces. Spaces will cause confusion and may prevent the application from working correctly. Create GATE_HOME environment variable. Add the following GATE jar files to the CLASSPATH (create classpath variable if necessary):

$GATE_HOME/bin/gate.jar
$GATE_HOME/lib/ontotext.jar
$GATE_HOME/lib/jasper-compiler-jdt.jar

3. Download and install WEKA: open-source machine learning software. WEKA is written in Java and can be downloaded from the sourceforge website: http://sourceforge.net/projects/weka. IMPORTANT: the application was tested and works only with WEKA version 3.4.4. The reason for this is that the classification models HITEx uses were created using WEKA 3.4.4. When installing WEKA, choose a custom installation path that DOES NOT contain any spaces. Add weka.jar file to your CLASSPATH. Also, if you are planning to use GATE GUI to run applications in the graphic mode, put weka.jar to $GATE_HOME/lib directory. Note that there may be existing weka jar in $GATE_HOME/lib directory, likely named weka-3.4.6.jar or similar. You should remove this existing jar file. Make a backup copy of it if needed.

4. Setup the MySQL database server. MySQL database is required to store the UMLS database tables. The application was tested with MySQL 5.0 version of the database server. Download and install Connector/J version 5.0 or 3.1 from http://dev.mysql.com/downloads. Connector/J is required to access MySQL database from Java. Add connector/J jar file to your CLASSPATH. Also, put a copy of this jar file to your $GATE_HOME/lib directory, if you plan to run the application using GATE GUI. Download and install MySQL GUI Tools (http://dev.mysql.com/downloads). You will need MySQL Administrator from this suite of tools, as UMLS database backup is distributed in MySQL Administrator SQL backup file format.

5. Create a database to hold the UMLS data. If you do not have have license to use UMLS, obtain a license from the National Library of Medicine. HITEx is pre-configured to use umls_2004aa as a database name, but this name can be changed. Setup the database account to access the UMLS database. You will need to give this account read permissions on the UMLS database. Un-zip and restore the UMLS database backup using MySQL Administrator restore functionality, selecting your desired target database. This operation will create and populate all required tables and build the important indexes. This is potentially time-consuming, as the backup file is very large.

6. Download and install JDOM from http://www.jdom.org. Certain parts of HITEx, such as configuration file management, depend on JDOM. Add jdom.jar to your CLASSPATH. Also, copy jdom.jar to your $GATE_HOME/lib directory if you plan to run HITEx using GATE GUI. Note that $GATE_HOME/lib already contains jdom.jar file. You will need to replace this file with the latest copy of the jar. As usual, make a backup of any files you replace.

7. Add the following jars to your CLASSPATH:

umls.jar
config.jar
ngram.jar
hitex.jar

umls.jar, config.jar and ngram.jar (but not hitex.jar) should also be copied to $GATE_HOME/lib directory to support running HITEx inside of GATE GUI.

8. Unzip the contents of the resources folder (resources.zip) to some accessible location on your machine.

Examples

To run the examples, modify the XML configuration files inside of the resources directory to adjust to your settings. At a bare minimum, make the below changes to the configuration file entries (where applicable):

INPUT_DIR: change to represent the location of the folder containing sample EMR records.
GATE_HOME: change to whatever GATE installation directory you have chosen. Default is C:/java/tools/gate31.
COMPONENTS_ORDER: The order of gate components in the execution pipeline. You shouldn't change this to run examples.
SECTIONIZER: change the headersURL value to point to the section header file on your machine.
TEXT_TOKENIZER: change the rulesURL value to point to the tokenization rules file on your machine.
SENTENCE_SPLITTER: nothing to change.
POS_TAGGER: adjust the values of lexiconURL and rulesURL to match your setup.
NOUN_PHRASE_SPLITTER: adjust the values of dictionaryURL and rulesURL to match your setup.
UMLS_CONCEPT_FINDER: adjust the UMLS term-CUI mapping filter's database connection properties (filterHostname, filterPort, filterDbname, filterUsername, filterPassword), and UMLS database server's connection properties (umlsHostname, umlsPort, umlsDatabaseName, umlsUsername, umlsPassword).
NEGATION_FINDER: adjust the value of rulesURL to match your setup.
TEMPORAL_FINDER: adjust the values of temporalPatternsURL, temporalWordsURL, temporalRulesURL, attributeFileURL and modelFileURL to match your setup.
REGEX_CONCEPT_FINDER: adjust the values of sectionCriteriaURL and expressionListURL to match your setup.
SMOKING_CLASSIFIER: adjust the values of sectionCriteriaURL, filterExpressionsURL, attributeFileURL and modelFileURL to match your setup.
FAMILY_HISTORY_UMLS: adjust the value of grammarURL to match your setup.
FAMILY_HISTORY_COMPONENT: adjust the value of grammarURL to match your setup.
FAMILY_HISTORY_MAIN: adjust the value of grammarURL to match your setup.

You don't need to modify creole.xml file to run the examples.

The following examples are bundled with this HITEx distribution:

Discharge Asthma Medication Finder

Given the list of regular expressions defining the asthma medications in Java regex format, and the list of document section categories in which discharge medications can be found, this pipeline application dicovers and adds asthma medications to the document, then prints the results to standard output. The application first finds section annotations in the document. Then, in the sections that match the criteria, the application matches the document content against the regular expressions. See the contents of med_config.xml for more details. To run Discharge Asthma Medication Finder, execute the following command: java hitex.examples.DischargeMedicationFinder {full URL of med_config.xml configuration file}. To print the help screen, run java hitex.examples.DischargeMedicationFinder.

Smoking Status Finder

Assigns smoking status on a sentence granularity level using the classification model. Multiple sentences in the document may have different smoking status assigned: current smoker, past smoker, never smoked, and denies smoking. See the contents of smoking_config.xml for more details. To run Smoking Status Finder, execute the following command: "java hitex.examples.SmokingStatusExtractor {full URL of smoking_config.xml configuration file}". To print the help screen, run "java hitex.examples.SmokingStatusExtractor".

Principal Diagnosis Finder

Finds principal diagnoses in the document. To get the principal diagnoses, the pipeline application searches for UMLS concepts in the document sections where the principal diagnoses may be found. The categories of these sections are specified as a section criteria parameter. The application also filters semantic types of the concepts to retrieve only those concepts that are either findings or symptoms. See the contents of diag_config.xml for more details. To run the Principal Diagnosis Finder, execute the following command: "java hitex.examples.DiagnosisExtractor {full URL of diag_config.xml configuration file}". To print the help screen, run "java hitex.examples.DiagnosisExtractor".

Negation Finder

This application is similar to the principal diagnosis finder, except that it searches for UMLS concepts (not specifically principal diagnoses) in all sections of the document and assigns the negation status (Actual, Possible or Negated) to the UMLS concepts. Negation status is produced using the modified version of NegEx algorithm. To see the application pipeline technical details, please refer to negation_config.xml file. To run the Negation Finder, execute the following command: "java hitex.examples.NegationFinder {full URL of negation_config.xml configuration file}". To print the help screen, run "java hitex.examples.NegationFinder".

Temporal Modifier Finder

This application is similar to the principal diagnosis finder, except that it searches for UMLS concepts (not specifically principal diagnoses) in all sections of the document and assigns the temporal modifier to the UMLS concepts. The temporal status is represented by a combination of more than 20 different attributes, such as date, time, time_unit, quantity, etc. The application also attempts to classify each UMLS concept as either plan/future or a fact. See the contents of temporal_config.xml for more details. To run the Temporal Modifier Finder, execute the following command: "java hitex.examples.TemporalExtractor {full URL of temporal_config.xml configuration file}". To print the help screen, run "java hitex.examples.TemporalExtractor".

Family History Modifier Finder

This application is similar to the principal diagnosis finder, except that it searches for UMLS concepts (not specifically principal diagnoses) in all sections of the document and assigns the family history modifier to the UMLS concepts. For example, asthma in the following context: "mother and sister with asthma" is assigned a family history modifier. See the contents of family_history_config.xml for more details. To run the Family History Modifier Finder, execute the following command: "java hitex.examples.FamilyHistoryFinder {full URL of family_history_config.xml configuration file}". To print the help screen, run "java hitex.examples.FamilyHistoryFinder".

Running the examples inside of Gate GUI

It is possible to run HITEx in graphic mode inside of the GATE GUI. Although this is useful for testing or debugging purposes, it should not be considered if performance is important. First, you need to make sure that the required jar files are in the $GATE_HOME/lib directory (refer to HITEx Installation Notes section). Start the Gate GUI, go to File -> Manage CREOLE Plugins -> Add a new CREOLE repository. Browse to the directory where creole.xml resides (typically /resources). A new repository called "resources" will appear in the list of known creole resources. Check "Load Now" and / or "Load always" and press OK. You should now be able to use HITEx components inside the GUI.

Contact

When contacting us, please keep in mind that HITEx is a research project and not a commercial software. Although we'll try to answer your questions in a timely manner, no guarantee can be made that you'll receive a quick response from us. Please read the manual first and make sure your question is not answered there. If you have GATE specific questions, there is GATE documentation available online. You can also post your question to the GATE mailing list. If you still have questions, please write to:

sgoryachev NOSPAM [at] dsg [dot] harvard [dot] edu

(remove NOSPAM to get the actual email)

References

Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak. 2006 Jul 26;6:30. [Pubmed link]
Goryachev S, Sordo M, Zeng QT. A Suite of Natural Language Processing Tools Developed for the I2B2 Project. AMIA Annu Symp Proc. 2006;:931. [Pubmed link]
Goryachev S, Kim H, Zeng QT. Identification and Extraction of Family History Information from Clinical Reports. [DSG technical report]
Goryachev S, Sordo M, Zeng QT, Ngo L. Implementation and Evaluation of Four Different Methods of Negation Detection. [DSG technical report]