New $3.2M NIH Grant Supporting Dfam

Mammalian and most other eukaryotic genomes contain a large amount of repetitive sequence,  mostly the remnants of ancient duplications of DNA segments called transposable elements (TEs). TEs have played a critical role in mammalian evolution, and their presence complicates genome sequence analysis in ways that demand high quality methods for identifying and labeling them.

In 2012, we released Dfam, an open-access database of profile hidden Markov models (HMMs) and corresponding metadata for transposable elements in the human genome, and showed that the use of profile HMMs enabled annotation of an additional 5% of the human genome (>150 million nucleotides). We used the human TE families for this proof of principal project and shortly thereafter expanded to include TE families from 4 additional model organisms, demonstrating both the utility and viability of this resource. The Dfam datasets have been utilized in a wide variety of research endeavors and despite the small number of species represented in this proof-of-principle resource the Dfam papers have been cited nearly 200 times [ 12 ]. In addition, we integrated Dfam with RepeatMasker, using our software nhmmer, making it possible to produce high-quality annotations of TE families in complete genomes.

The Dfam consortium has now been awarded a 5-year, $3.2M NIH resource grant to build a sustainable framework for the expansion and improvement of the Dfam resource, with ~$400K supporting work in the Wheeler lab at the University of Montana. With support from this grant, we will develop the Dfam infrastructure to expand to 1000s of genomes, and establish a self-sustaining TE Data Commons that enables community contribution of TE datasets with limited centralized curation. We will also  improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, will expand approaches to visualization of this complex data type, and will improve the modeling of TE subfamilies. By further developing this open access database, we will provide a strong incentive to reverse the trend of proliferation of unaffiliated non-standard repeat datasets and ease the burden of data management for those developing TE libraries.

Progress is already underway. Kaitlin Carey, a graduate student in the Wheeler lab, has made important progress in understanding the landscape of annotation confidence in this complex domain. Meanwhile, Jeb Rosen (recent graduate from the University of Montana Computer Science program) has joined forces with Robert Hubley and Arian Smit, where the three are developing the infrastructure required to support future Dfam growth.

Advertisements

NIH R15 grant for improved sequence database search

The Wheeler lab has been awarded an NIH R15 grant from the National Institute of General Medical Sciences to develop “Improved protein-DNA models for translated sequence search with profile Hidden Markov models”. The grant is for $426K over three years, beginning April 1, 2017.

Fast and sensitive sequence database search is fundamental to modern molecular biology. The funded research will improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.

Dr. Wheeler’s group will develop new models that account for frameshifting mutations or errors that obscure the protein-coding nature of sequence, and for splice sites that break genes or domains into distant fragments on the genome. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Institute’s Metagenomics Portal.

(See the press release: here)

Pilot grant to reduce error in sequence annotation

The Wheeler lab has been awarded competitive pilot project funding through the University of Montana Center for Biomolecular Structure and Dynamics (CBSD). The grant will support early development of methods for reducing false sequence annotation of large genomic DNA datasets due to repetitive sequence and alignment overextension, and will fund two students in our group for the next year. CBSD is supported by a National Institutes of General Medical Science (NIH NIGMS) IdeA program Center of Biomolecular Research Excellence (CoBRE) Phase II grant.