Mammalian and most other eukaryotic genomes contain a large amount of repetitive sequence, mostly the remnants of ancient duplications of DNA segments called transposable elements (TEs). TEs have played a critical role in mammalian evolution, and their presence complicates genome sequence analysis in ways that demand high quality methods for identifying and labeling them.
In 2012, we released Dfam, an open-access database of profile hidden Markov models (HMMs) and corresponding metadata for transposable elements in the human genome, and showed that the use of profile HMMs enabled annotation of an additional 5% of the human genome (>150 million nucleotides). We used the human TE families for this proof of principal project and shortly thereafter expanded to include TE families from 4 additional model organisms, demonstrating both the utility and viability of this resource. The Dfam datasets have been utilized in a wide variety of research endeavors and despite the small number of species represented in this proof-of-principle resource the Dfam papers have been cited nearly 200 times [ 1, 2 ]. In addition, we integrated Dfam with RepeatMasker, using our software nhmmer, making it possible to produce high-quality annotations of TE families in complete genomes.
The Dfam consortium has now been awarded a 5-year, $3.2M NIH resource grant to build a sustainable framework for the expansion and improvement of the Dfam resource, with ~$400K supporting work in the Wheeler lab at the University of Montana. With support from this grant, we will develop the Dfam infrastructure to expand to 1000s of genomes, and establish a self-sustaining TE Data Commons that enables community contribution of TE datasets with limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, will expand approaches to visualization of this complex data type, and will improve the modeling of TE subfamilies. By further developing this open access database, we will provide a strong incentive to reverse the trend of proliferation of unaffiliated non-standard repeat datasets and ease the burden of data management for those developing TE libraries.
Progress is already underway. Kaitlin Carey, a graduate student in the Wheeler lab, has made important progress in understanding the landscape of annotation confidence in this complex domain. Meanwhile, Jeb Rosen (recent graduate from the University of Montana Computer Science program) has joined forces with Robert Hubley and Arian Smit, where the three are developing the infrastructure required to support future Dfam growth.
Two Wheeler lab members recently presented work at the 2018 ACM-BCB conference in Washington DC, with corresponding papers appearing in the ACM-BCB conference proceedings.
Alex Nord presented his work on Mirage, a splice-aware tool for aligning protein isoforms within and between species. See the paper here.
Daniel Olson presented his work on ULTRA, a model-based method for labeling repetitive regions of biological sequences. See the paper here.
Several of us recently returned from attending ISMB 2018 in Chicago.
Alex Nord presented his work on Mirage, a tool for aligning protein isoforms in a talk, and later a poster. The work will appear in the Proceedings of the ACM-BCB 2018 conference.
Daniel Olson presented his work on ULTRA, a model-based method for labeling repetitive regions of biological sequences. His work will also appear in the ACM-BCB 2018 Proceedings. Genevieve Krause presented rapidly-progressing work on an implementation of a profile HMM for annotating protein-coding DNA containing frameshift-inducing indels. Kaitlin Carey‘s poster described a method for assessing confidence in annotation when multiple related families compete for labeling a sequence. Tim Anderson described steps toward a new FPGA-acceleration approach to profile HMM search. Jack Roddy presented exciting early results for methods to reduce the overextension of sequence alignments into nonhomologous sequence regions.
(After review of the photos of everyone presenting their posters, you can see that they’re mostly doing a great job of following my instructions to talk to people while position your hands as if you’re holding a watermelon! 😄 )
Over the past year, several undergrads have participated in research with the group. Three of them shared the results of their work at the “UM Conference on Undergraduate Research” on April 27. Kudos to Jack Roddy, Conner Copeland, and Sarah Walling. A special congratulations to Sarah, who was awarded a “best poster” award!
Also, a belated congratulations to Joyce Liu, a high school student working in our lab, who was awarded the 1st place poster award in the 63rd Annual Montana Science Fair, hosted at UM.
Jack Roddy – Machine Learning strategies for improving sequence alignment boundaries
Conner Copeland – Southeast Asian plant phylogeny
Sarah Walling – Characterizing proteins containing Alternative Reading Frames
Sarah Walling – “Best poster” award
The Wheeler lab has been awarded the CBSD COBRE Junior Investigator project grant to develop “methods for fast bio-sequence comparison with profile hidden Markov models”. The grant will provide $450K in funding over three years, starting … now!
A good chunk of the lab had the pleasure of attending the ISMB 2017 in Prague, ending just a couple days ago. Before and during the conference, we had a chance to meet up with several lab collaborators, and to attend an endless stream of informative talks. We also managed to get a little touring in, with trips to the castle, the Charles bridge, a unique marionette performance of Don Giovanni, and exposure to the singularly Czech style of table service.
Since its inception in 2012, Dfam has demonstrated the promise of using profile hidden Markov Models (HMMs) to improve the detection sensitivity and annotation quality of Transposable Element (TEs) families in human and subsequently for four additional reference organisms. Despite these advances, the tools used to discover new families ( de-novo repeat finders ), improve families ( extend, defragment, subfamily clustering ), and classify TE families continue to depend on consensus sequence models. This discordance between methodologies is a direct impediment to Dfam’s expansion.
Read more: Introducing Dfam_consensus – Dfam’s consensus sequence twin