The Wheeler lab has been awarded a $1.15M four year grant (NIH R01) to develop machine learning approaches for improved accuracy and speed in sequence annotation.
Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. We will develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will develop methods based on both hidden Markov models and Artificial Neural Networks to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step.
If you’re reading this, maybe you’ve caught the big picture: we’ll be looking for people to help with these important and exciting projects. If they sound fun to you, get in touch!
Lab member Anna Marbut just presented a workshop on data management for granting organizations, at the Space Grants Western Regional conference. Anna is pictured on the right in the photo below. Caitlin Stainken (of Submittable) and some NASA employee are to her left.
The Dfam group met up in Palm Springs this week to attend FASEB Mobile DNA 2019. As always, the conference was terrific. Travis Wheeler talked about “Sequence Methods for Increasing Sensitivity and Reducing Errors in TE Annotation”, while Wheeler lab member Kaitlin Carey presented her cool recent work in a poster “Annotation Confidence Estimates Improve Transposable Element Annotation with Subfamilies”.
Meanwhile, Dfam collaborators Jeb Rosen (with help from Robert Hubley and Arian Smit, not shown) presented their poster “Dfam 3: An open community resource for transposable element annotations, consensus sequences, and profile Hidden Markov Models”.
Several of us recently attended AlCoB 2019 in Berkeley. All six attending students presented both talks and posters (sampled in pictures below). Alex Nord discussed his work on splice aware profile HMMs, Jack Roddy presented work on reducing the nasty problem of overextension of sequence alignments, Kaitlin Carey described her cool results on using sequence annotation confidence to improve annotation (including of homologous recombination), Tim Anderson described his new FPGA accelerator for profile HMM search, Sarah Walling described progress in understanding surprising alternative splicing outcomes, and Daniel Olson presented advances in annotating tandemly-repetitive sequence regions with ULTRA.
We also got a chance to visit the Computer Research Division at LBNL (where Genevieve Krause will be spending a summer). Part of that visit included an introduction to a test FPGA system (thanks Andrew and Farzad!)
Our collaborators in the Insel lab have been awarded an R15 grant from the NIH, to study learning and neural coding of social expectations. The work will be performed mostly by folks in the Insel group, but we’re excited to help develop computational methods for classification of video and neural recordings.
Mammalian and most other eukaryotic genomes contain a large amount of repetitive sequence, mostly the remnants of ancient duplications of DNA segments called transposable elements (TEs). TEs have played a critical role in mammalian evolution, and their presence complicates genome sequence analysis in ways that demand high quality methods for identifying and labeling them.
In 2012, we released Dfam, an open-access database of profile hidden Markov models (HMMs) and corresponding metadata for transposable elements in the human genome, and showed that the use of profile HMMs enabled annotation of an additional 5% of the human genome (>150 million nucleotides). We used the human TE families for this proof of principal project and shortly thereafter expanded to include TE families from 4 additional model organisms, demonstrating both the utility and viability of this resource. The Dfam datasets have been utilized in a wide variety of research endeavors and despite the small number of species represented in this proof-of-principle resource the Dfam papers have been cited nearly 200 times [ 1, 2 ]. In addition, we integrated Dfam with RepeatMasker, using our software nhmmer, making it possible to produce high-quality annotations of TE families in complete genomes.
The Dfam consortium has now been awarded a 5-year, $3.2M NIH resource grant to build a sustainable framework for the expansion and improvement of the Dfam resource, with ~$400K supporting work in the Wheeler lab at the University of Montana. With support from this grant, we will develop the Dfam infrastructure to expand to 1000s of genomes, and establish a self-sustaining TE Data Commons that enables community contribution of TE datasets with limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, will expand approaches to visualization of this complex data type, and will improve the modeling of TE subfamilies. By further developing this open access database, we will provide a strong incentive to reverse the trend of proliferation of unaffiliated non-standard repeat datasets and ease the burden of data management for those developing TE libraries.
Progress is already underway. Kaitlin Carey, a graduate student in the Wheeler lab, has made important progress in understanding the landscape of annotation confidence in this complex domain. Meanwhile, Jeb Rosen (recent graduate from the University of Montana Computer Science program) has joined forces with Robert Hubley and Arian Smit, where the three are developing the infrastructure required to support future Dfam growth.
Two Wheeler lab members recently presented work at the 2018 ACM-BCB conference in Washington DC, with corresponding papers appearing in the ACM-BCB conference proceedings.
Alex Nord presented his work on Mirage, a splice-aware tool for aligning protein isoforms within and between species. See the paper here.
Daniel Olson presented his work on ULTRA, a model-based method for labeling repetitive regions of biological sequences. See the paper here.