Mammalian and most other eukaryotic genomes contain a large amount of repetitive sequence, mostly the remnants of ancient duplications of DNA segments called transposable elements (TEs). TEs have played a critical role in mammalian evolution, and their presence complicates genome sequence analysis in ways that demand high quality methods for identifying and labeling them.
In 2012, we released Dfam, an open-access database of profile hidden Markov models (HMMs) and corresponding metadata for transposable elements in the human genome, and showed that the use of profile HMMs enabled annotation of an additional 5% of the human genome (>150 million nucleotides). We used the human TE families for this proof of principal project and shortly thereafter expanded to include TE families from 4 additional model organisms, demonstrating both the utility and viability of this resource. The Dfam datasets have been utilized in a wide variety of research endeavors and despite the small number of species represented in this proof-of-principle resource the Dfam papers have been cited nearly 200 times [ 1, 2 ]. In addition, we integrated Dfam with RepeatMasker, using our software nhmmer, making it possible to produce high-quality annotations of TE families in complete genomes.
The Dfam consortium has now been awarded a 5-year, $3.2M NIH resource grant to build a sustainable framework for the expansion and improvement of the Dfam resource, with ~$400K supporting work in the Wheeler lab at the University of Montana. With support from this grant, we will develop the Dfam infrastructure to expand to 1000s of genomes, and establish a self-sustaining TE Data Commons that enables community contribution of TE datasets with limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, will expand approaches to visualization of this complex data type, and will improve the modeling of TE subfamilies. By further developing this open access database, we will provide a strong incentive to reverse the trend of proliferation of unaffiliated non-standard repeat datasets and ease the burden of data management for those developing TE libraries.
Progress is already underway. Kaitlin Carey, a graduate student in the Wheeler lab, has made important progress in understanding the landscape of annotation confidence in this complex domain. Meanwhile, Jeb Rosen (recent graduate from the University of Montana Computer Science program) has joined forces with Robert Hubley and Arian Smit, where the three are developing the infrastructure required to support future Dfam growth.