We were originally planning to host the 7th International Conference on Algorithms for Computational Biology (AlCoB) in Missoula back in April of this year. Then 2020 decided it didn’t want conferences in April (also, there was this pandemic; maybe you heard about it?), so we put it on ice.
Well … it’s back. Working under the optimistic assumption that in-person conferences will make sense by June 2021, we’re all set to host a new-and-improved “7th-8th International Conference on Algorithms for Computational Biology”, which will merge the scheduled program for AlCoB 2020 with a new series of papers submitted for the current year. Find out more (and submit a paper) at https://irdta.eu/alcob2020-2021/.
We’ve just been awarded a $1.05M DOE grant, in collaboration with Jason McDermott‘s group at PNNL, to develop Machine Learning approaches for integrating multi-omics data, with the goal of expanding microbiome annotation.
The project is motivated by the need to understand soil communities that play a key role in the plant-soil dynamic, with impacts on food- and fuel-crop production. To understand the roles of these microbial communities, it is vital to maximally annotate their genomic and functional capacity, yet the majority of data from newly acquired microbiomes remains unannotated.
This project will focus on the development of a novel method for incorporating non-genomic information into the process of annotating genomic sequence, and two complementary strategies building on recent advances in alignment-based and alignment-free labeling. In combination, these approaches are expected to substantially increase the completeness of labeling for difficult-to-annotate microbiome datasets.
If you’re reading this, and think “hey, that sounds like fun!”, get in touch!
The Wheeler lab has been awarded a $1.15M four year grant (NIH R01) to develop machine learning approaches for improved accuracy and speed in sequence annotation.
Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. We will develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will develop methods based on both hidden Markov models and Artificial Neural Networks to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step.
If you’re reading this, maybe you’ve caught the big picture: we’ll be looking for people to help with these important and exciting projects. If they sound fun to you, get in touch!
Lab member Anna Marbut just presented a workshop on data management for granting organizations, at the Space Grants Western Regional conference. Anna is pictured on the right in the photo below. Caitlin Stainken (of Submittable) and some NASA employee are to her left.
The Dfam group met up in Palm Springs this week to attend FASEB Mobile DNA 2019. As always, the conference was terrific. Travis Wheeler talked about “Sequence Methods for Increasing Sensitivity and Reducing Errors in TE Annotation”, while Wheeler lab member Kaitlin Carey presented her cool recent work in a poster “Annotation Confidence Estimates Improve Transposable Element Annotation with Subfamilies”.
Meanwhile, Dfam collaborators Jeb Rosen (with help from Robert Hubley and Arian Smit, not shown) presented their poster “Dfam 3: An open community resource for transposable element annotations, consensus sequences, and profile Hidden Markov Models”.
Several of us recently attended AlCoB 2019 in Berkeley. All six attending students presented both talks and posters (sampled in pictures below). Alex Nord discussed his work on splice aware profile HMMs, Jack Roddy presented work on reducing the nasty problem of overextension of sequence alignments, Kaitlin Carey described her cool results on using sequence annotation confidence to improve annotation (including of homologous recombination), Tim Anderson described his new FPGA accelerator for profile HMM search, Sarah Walling described progress in understanding surprising alternative splicing outcomes, and Daniel Olson presented advances in annotating tandemly-repetitive sequence regions with ULTRA.
We also got a chance to visit the Computer Research Division at LBNL (where Genevieve Krause will be spending a summer). Part of that visit included an introduction to a test FPGA system (thanks Andrew and Farzad!)
Our collaborators in the Insel lab have been awarded an R15 grant from the NIH, to study learning and neural coding of social expectations. The work will be performed mostly by folks in the Insel group, but we’re excited to help develop computational methods for classification of video and neural recordings.
Mammalian and most other eukaryotic genomes contain a large amount of repetitive sequence, mostly the remnants of ancient duplications of DNA segments called transposable elements (TEs). TEs have played a critical role in mammalian evolution, and their presence complicates genome sequence analysis in ways that demand high quality methods for identifying and labeling them.
In 2012, we released Dfam, an open-access database of profile hidden Markov models (HMMs) and corresponding metadata for transposable elements in the human genome, and showed that the use of profile HMMs enabled annotation of an additional 5% of the human genome (>150 million nucleotides). We used the human TE families for this proof of principal project and shortly thereafter expanded to include TE families from 4 additional model organisms, demonstrating both the utility and viability of this resource. The Dfam datasets have been utilized in a wide variety of research endeavors and despite the small number of species represented in this proof-of-principle resource the Dfam papers have been cited nearly 200 times [ 1, 2 ]. In addition, we integrated Dfam with RepeatMasker, using our software nhmmer, making it possible to produce high-quality annotations of TE families in complete genomes.
The Dfam consortium has now been awarded a 5-year, $3.2M NIH resource grant to build a sustainable framework for the expansion and improvement of the Dfam resource, with ~$400K supporting work in the Wheeler lab at the University of Montana. With support from this grant, we will develop the Dfam infrastructure to expand to 1000s of genomes, and establish a self-sustaining TE Data Commons that enables community contribution of TE datasets with limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, will expand approaches to visualization of this complex data type, and will improve the modeling of TE subfamilies. By further developing this open access database, we will provide a strong incentive to reverse the trend of proliferation of unaffiliated non-standard repeat datasets and ease the burden of data management for those developing TE libraries.
Progress is already underway. Kaitlin Carey, a graduate student in the Wheeler lab, has made important progress in understanding the landscape of annotation confidence in this complex domain. Meanwhile, Jeb Rosen (recent graduate from the University of Montana Computer Science program) has joined forces with Robert Hubley and Arian Smit, where the three are developing the infrastructure required to support future Dfam growth.
Several of us recently returned from attending ISMB 2018 in Chicago.
Alex Nord presented his work on Mirage, a tool for aligning protein isoforms in a talk, and later a poster. The work will appear in the Proceedings of the ACM-BCB 2018 conference.
Daniel Olson presented his work on ULTRA, a model-based method for labeling repetitive regions of biological sequences. His work will also appear in the ACM-BCB 2018 Proceedings. Genevieve Krause presented rapidly-progressing work on an implementation of a profile HMM for annotating protein-coding DNA containing frameshift-inducing indels. Kaitlin Carey‘s poster described a method for assessing confidence in annotation when multiple related families compete for labeling a sequence. Tim Anderson described steps toward a new FPGA-acceleration approach to profile HMM search. Jack Roddy presented exciting early results for methods to reduce the overextension of sequence alignments into nonhomologous sequence regions.
(After review of the photos of everyone presenting their posters, you can see that they’re mostly doing a great job of following my instructions to talk to people while position your hands as if you’re holding a watermelon! 😄 )