The key challenge that i2b2 is designed to address is that of creating a comprehensive software and methodological framework to enable clinical researchers to accelerate the translation of genomic and “traditional” clinical findings into novel diagnostics, prognostics, and therapeutics. Conversely, it provides a collaborative organizational and software infrastructure to allow basic researchers to leverage insights arising from clinical studies.
The following scenario has been developed to illustrate the aforementioned goals. We recognize that many investigators might disagree with the path taken by this scenario but its utility is in the processes and functionality that it illustrates rather than any particular investigative approach. Although inspired by some of the research of our investigators, the particulars of the disease and gene described below are fictitious and any resemblance to real genes and diseases is purely coincidental.
Jill Genomus, MD is a clinical researcher who has had a long standing interest in a rare neuromuscular disease, X. She is well aware that 5 years ago, gene, Y, a transcription factor, for which mutations have been found to be responsible for 45% of all cases of X, was cloned. She has a hypothesis that mutations in Y contribute to the risk of a common type of cardiomyopathy, but one in which the age of that presentation is early middle age. She is now ready to spend some time and energy to explore her hypothesis.
Although there are many ways that she would like to approach the problem she first wants to know if there are known associations in any model organisms between gene Y and heart disease. She enters the gene name and the term “cardiac” in the i2b2 clinical research chart and shortly appear the following facts along with appropriate illustrations and links back to the original databases:
- A quantitative trait locus for cardiac disease was found in consomic rats from the Medical College of Wisconsin database that spans the chromosomal region that includes the homologue for gene Y. There is a mouse model available of the gene Y “knock-out” at the Jackson Laboratory for which expression data are available on several tissues.1
- A cardiac phenotype was observed for the homologue of gene Y in RNAi “knockdown” experiments in comprehensive genome-wide RNAi experiments in both D. melanogaster and C. elegans.2
- Gene Y is expressed in cardiac tissue and skeletal muscle as it its protein.3
Encouraged by these preliminary findings she proceeds to design an association study for that gene. She consults with the i2b2 Study Design tool that determines from the HapMap database4 which haplotypes exist across that gene. Based on the existence of areas of conserved homology, across a wide span of the phylogenetic tree, 30 kilobases 5’ to the gene’s start site and 20KB 3’ to the end of the gene’s untranslated region, the Study Design tool recommends that the association study cover a considerably larger stretch of DNA to include these possible regulatory regions.5 It also enters into her clinical research chart, a map of that region along with the minimal number of haplotype tagging SNP’s6 as well as a full annotation of existing knowledge on all variants in this region7.
Given Dr. G.’s estimate of the lower-bound of the risk effect and the distribution of haplotypes in that region8, the Study Design tool recommends a minimum of 5,500 individuals in each arm of the study. Undaunted, Dr. G describes in a formal but simple ontology the nature of the patients she is interested in to the i2b2 distributed, peer-to-peer, multi-institutional query tool9 to determine if that many individuals with the specified disease (and an equal number of controls) are available. Her research chart fills with a table showing which institutions have appropriate patients, have samples suitable for DNA analysis (i.e. blood) and cardiac tissue suitable for expression and proteomic analysis10. In addition to the numbers of patients available, the contact information for the institutional review boards at each institution, the age and gender of the patients, current knowledge of the patient consent level and anonymized11 pathology reports (for a subset of the patients) are also shown in this hyperlinked table. Estimated co12sts for performing the study using current pricing for genotyping on different platforms are also provided.
In the process of conducting the study, i2b2 helps Dr. G. manage the collation of patient materials, allows Dr. G’s research assistants to enrich the phenotypic annotations using a structured vocabulary, perform sample tracking and track the quality control of the genotyping13. When the data collection is completed, i2b2 provides Dr. G with the tools to analyze the population study, taking into account missing data and possible confounding due to population stratification14. The study does in fact show that a haplotype block in a conserved region 25KB 5’ to gene Y is in highly significant linkage disequilibrium with the early cardiomyopathy phenotype. This haplotype block is large and contains 35 sites with common variants. To resolve which variant may be responsible for the disease, the i2b2 clinical chart presents several options including: 1) A scan for putative and known transcriptional factor binding sites at/near these variants and how these variants might affect the function of these site15 2) A structural and evolutionary biology evaluation of each variant for its probability to contribute to a change in phenotype16 3) Identification of a basic biology researcher with interest in gene Y, or its superfamily, who could be approached for a functional assay of these variants17. 4) and other approaches, each with approximate time and dollar costs.
Dr. G. is interested in finding a collaborator to perform functional assays on a cell-line and then a mouse model of at least two of the variants, but before proceeding with that year long study, she wishes additional confirmation that the mouse model will be relevant to the human myopathy. She obtains several endocardial biopsy samples from humans (identified in the earlier multi-institutional query) and using the i2b2 expression analysis tools compares the expression pattern to that found in the Y “knockout” mouse cardiac tissue from the Jackson Laboratories (identified in the i2b2 clinical chart) and to that of known human and murine “wild types.”18 i2b2 also identifies sequence analyses19 previously done on the patients from which the endocardial biopsy is obtained and the genotypes of the mouse models. Dr. G. verifies with the structural analysis suite20 of i2b2 that these mutations result in similar local changes in conformation of the gene product, in the conserved domains.
As endocardial biopsies are more difficult to come by than peripheral blood samples, Dr. G. runs a multi-factorial comparison using the i2b2 expression analysis suite21 to measure the degree to which changes in expression in the human cardiomyopathy are tracked by changes in expression of lymphoblasts.
One year later, after Dr. G. has found a variant that appears to cause a reproducible change in the expression of Y as well as a cardiac phenotype in the mouse model, she wonders if other cardiomyopathies are caused by variants in genes in the same signaling pathway as Y, or whose expression is directly or indirectly influenced by Y? i2b2 provides her with access to several public and commercial pathway databases and also compiles from the online biomedical literature (using several textual information retrieval tools) a list of possible genes22. Perusing the list, Dr. G. identifies a handful of particularly promising genes and then submits them to the i2b2 site and the cycle/process continues…
1 Integration of databases from diverse locations in gene-centric, chromosome-locus centric and disease centric manner (Core 2).
2 Gene annotation from publications and on-line data sets of various formats (Core 2).
3 Ontologies of anatomic location and function.
4 Integration with population databases, ascertainment bias adjustment (Core 2, Core 1).
5 Comparative genomics, ontology/annotation of gene structure/function relation-ships (Core 1,2) .
6 Genomic clinical design aid (Core 1).
7 Information retrieval, natural language processing, ontologies (Core 2).
8 Study design over uncertainty and high dimensionality (Core 1).
9 Distributed, multi-institutional query systems (Core 2).
10 Information retrieval from clinical databases, quality control (Core 2).
11 Anonymization toolkit (Core 2).
12 Virtual notebook, data visualization (Core 2)
13 Data collection tools oriented towards clinical and genomic types (Core 2)/
14 Robust, reproducible analysis of noisy, incomplete and high-dimensional data sets (Core 1).
15 Alignment/consensus, structure-function relationships (Core 1).
16 Comparative genomics, structure-function relationships (Core 1).
17 Resources directory (Core 2).
18 Application of microarray analysis techniques (Core 2).
19 Retrieval and “tagging” of prior studies (Core 4).
20 Structural biology, conservation, domain analysis (Core 1).
21 Pathway annotation, pathway retrieval, retrieval from large existing genomic databases (Core 2).