Uhoh

Folks in the Wheeler lab maintain versioned software using git, with repos on GitHub. I live in (minor) fear that something bad might happen – maybe a repo gets accidentally deleted at exactly the same moment that the only computer it’s on bursts into flames, or all the repos get intentionally deleted by some hacker who demands my first born child for their safe return (maybe worth it, to save the effort of cobbling together all the various projects that are currently strewn across more than a dozen computers).

That’s where Uhoh comes in. Uhoh lets you back up all your GitHub repos quickly and conveniently using, well, Git. Thanks to the incomparable George Lesica for that (see his original release post).

Uhoh queries the GitHub API for a list of repos (which can be filtered by owner and name), then checks its backup location for a clone. If it finds one, then it runs a git pull. If not, it runs a git clone. Either way, you end up with a backup copy of your repos. Run it in a nightly cron job, and you’ll have one less to worry about.

Uhoh is written in Dart. Check it out on GitHub: https://github.com/TravisWheelerLab/uhoh. The README contains basic instructions for use. See the releases page for pre-compiled downloads.

Conference on Algorithms for Computational Biology in Missoula, June 2021.

We were originally planning to host the 7th International Conference on Algorithms for Computational Biology (AlCoB) in Missoula back in April of this year. Then 2020 decided it didn’t want conferences in April (also, there was this pandemic; maybe you heard about it?), so we put it on ice.

Well … it’s back. Working under the optimistic assumption that in-person conferences will make sense by June 2021, we’re all set to host a new-and-improved “7th-8th International Conference on Algorithms for Computational Biology”, which will merge the scheduled program for AlCoB 2020 with a new series of papers submitted for the current year. Find out more (and submit a paper) at https://irdta.eu/alcob2020-2021/.

New $1.05M grant – more Machine Learning, plus Multi-omics!

We’ve just been awarded a $1.05M DOE grant, in collaboration with Jason McDermott‘s group at PNNL, to develop Machine Learning approaches for integrating multi-omics data, with the goal of expanding microbiome annotation.

The project is motivated by the need to understand soil communities that play a key role in the plant-soil dynamic, with impacts on food- and fuel-crop production. To understand the roles of these microbial communities, it is vital to maximally annotate their genomic and functional capacity, yet the majority of data from newly acquired microbiomes remains unannotated.

This project will focus on the development of a novel method for incorporating non-genomic information into the process of annotating genomic sequence, and two complementary strategies building on recent advances in alignment-based and alignment-free labeling. In combination, these approaches are expected to substantially increase the completeness of labeling for difficult-to-annotate microbiome datasets.

If you’re reading this, and think “hey, that sounds like fun!”, get in touch!

New $1.1M NIH Grant – Machine Learning in genome annotation

The Wheeler lab has been awarded a $1.15M four year grant (NIH R01) to develop machine learning approaches for improved accuracy and speed in sequence annotation.

Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. We will develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will develop methods based on both hidden Markov models and Artificial Neural Networks to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step.

If you’re reading this, maybe you’ve caught the big picture: we’ll be looking for people to help with these important and exciting projects.  If they sound fun to you, get in touch!

Dfam @ FASEB – Mobile DNA

The Dfam group met up in Palm Springs this week to attend FASEB Mobile DNA 2019.  As always, the conference was terrific. Travis Wheeler talked about “Sequence Methods for Increasing Sensitivity and Reducing Errors in TE Annotation”, while Wheeler lab member Kaitlin Carey presented her cool recent work in a poster “Annotation Confidence Estimates Improve Transposable Element Annotation with Subfamilies”. 20190624_192320

Meanwhile, Dfam collaborators Jeb Rosen (with help from Robert Hubley and Arian Smit, not shown) presented their poster “Dfam 3: An open community resource for transposable element annotations, consensus sequences, and profile Hidden Markov Models”.

20190626_200849

Students present at AlCoB 2019

Several of us recently attended AlCoB 2019 in Berkeley. All six attending students presented both talks and posters (sampled in pictures below). Alex Nord discussed his work on splice aware profile HMMs, Jack Roddy presented work on reducing the nasty problem of overextension of sequence alignments, Kaitlin Carey described her cool results on using sequence annotation confidence to improve annotation (including of homologous recombination), Tim Anderson described his new FPGA accelerator for profile HMM search, Sarah Walling described progress in understanding surprising alternative splicing outcomes, and Daniel Olson presented advances in annotating tandemly-repetitive sequence regions with ULTRA.

 

We also got a chance to visit the Computer Research Division at LBNL (where Genevieve Krause will be spending a summer). Part of that visit included an introduction to a test FPGA system (thanks Andrew and Farzad!)

20190530_133527

New $3.2M NIH Grant Supporting Dfam

Mammalian and most other eukaryotic genomes contain a large amount of repetitive sequence,  mostly the remnants of ancient duplications of DNA segments called transposable elements (TEs). TEs have played a critical role in mammalian evolution, and their presence complicates genome sequence analysis in ways that demand high quality methods for identifying and labeling them.

In 2012, we released Dfam, an open-access database of profile hidden Markov models (HMMs) and corresponding metadata for transposable elements in the human genome, and showed that the use of profile HMMs enabled annotation of an additional 5% of the human genome (>150 million nucleotides). We used the human TE families for this proof of principal project and shortly thereafter expanded to include TE families from 4 additional model organisms, demonstrating both the utility and viability of this resource. The Dfam datasets have been utilized in a wide variety of research endeavors and despite the small number of species represented in this proof-of-principle resource the Dfam papers have been cited nearly 200 times [ 12 ]. In addition, we integrated Dfam with RepeatMasker, using our software nhmmer, making it possible to produce high-quality annotations of TE families in complete genomes.

The Dfam consortium has now been awarded a 5-year, $3.2M NIH resource grant to build a sustainable framework for the expansion and improvement of the Dfam resource, with ~$400K supporting work in the Wheeler lab at the University of Montana. With support from this grant, we will develop the Dfam infrastructure to expand to 1000s of genomes, and establish a self-sustaining TE Data Commons that enables community contribution of TE datasets with limited centralized curation. We will also  improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, will expand approaches to visualization of this complex data type, and will improve the modeling of TE subfamilies. By further developing this open access database, we will provide a strong incentive to reverse the trend of proliferation of unaffiliated non-standard repeat datasets and ease the burden of data management for those developing TE libraries.

Progress is already underway. Kaitlin Carey, a graduate student in the Wheeler lab, has made important progress in understanding the landscape of annotation confidence in this complex domain. Meanwhile, Jeb Rosen (recent graduate from the University of Montana Computer Science program) has joined forces with Robert Hubley and Arian Smit, where the three are developing the infrastructure required to support future Dfam growth.

Talks at ACM-BCB in DC

Two Wheeler lab members recently presented work at the 2018 ACM-BCB conference in Washington DC, with corresponding papers appearing in the ACM-BCB conference proceedings.

Alex Nord presented his work on Mirage, a splice-aware tool for aligning protein isoforms within and between species. See the paper here.

Daniel Olson presented his work on ULTRA, a model-based method for labeling repetitive regions of biological sequences. See the paper here.