Archives

Categories

Free draft ebook – Pandora’s Toolbox for Bioinformatics

I am putting together an e-book to describe the tools in Pandora’s Toolbox for Bioinformatics. It is free. So, feel ‘free’ to grab a copy !! Beware – the current version is very far from complete, but once you get a copy, you will be able to access the improved versions at no extra cost.

large

Pandora’s Toolbox includes a small subset of mostly recent bioinformatics programs (BLAST, HMMER, RAPsearch2, BWAMEM, Samtools, KMC2, BCALM, Minia, SPAdes, SOAPdenovo, SOAPdenovo-trans, Trinity, Sailfish, Tophat, Cufflink, DALIGNER, HGAP, DBG2OLC, PHYLIP) to help a researcher solve problems in biology. One overarching goal is to let them get a feel for the algorithms and not use the programs as black-boxes. Therefore, each chapter describes a program and includes the following sections – (i) history, (ii) how to run, (iii) features, (iv) details of algorithm, (v) details of code, (vi) further readings.

Status of the genome assembly book

I had been working extensively on the genome assembly book, but decided to take a detour from it to put together this manual. That is because I seemed to be referring to the software programs in Pandora’s Toolbox again and again in the genome assembly book and felt that it would help to get the topic covered well elsewhere. I plan to spend a week to complete and edit the sections in this ‘Pandora’s Toolbox for Bioinformatics’ manual and then go back to the assembly book right after that.

Readers, please help

I picked a wide and representative subset of programs in Pandora’s Toolbox for Bioinformatics. Each of the program is reasonably efficient and algorithmically innovative as well. However, if you think any important category or program is not covered, please let me know.

ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching

A new scaffolding paper is published in Bioinformatics.

MOTIVATION:
Next-generation high-throughput sequencing (HTS) has become a state-of-the-art technique in genome assembly. Scaffolding is one of the main stages of the assembly pipeline. During this stage contigs assembled from the paired-end reads are merged into bigger chains called scaffolds. Due to a high-level of statistical noise, chimeric reads, genome repeats the problem of scaffolding is a challenging task. Current scaffolding software packages widely vary in their quality and are highly dependent on the read data quality and genome complexity. There are no clear winners and multiple opportunities for further improvements of the tools still exist.
RESULTS:
This paper presents an efficient scaffolding algorithm ScaffMatch that is able to handle reads with both short (<600 bp) and long (>35000 bp) insert sizes producing high-quality scaffolds. We evaluate our scaffolding tool with the F-score and other metrics (N50, corrected N50) on 8 data sets comparing it with the most available packages. Our experiments show that ScaffMatch is the tool of preference for the most data sets.
AVAILABILITY:
The source code is available at http://alan.cs.gsu.edu/NGS/?q=content/scaffmatch.

GATK Battle and BALSA versus GATK Comparison

In the context of “GATK Battle – 2015 Edition“, I requested Ruibang Luo, the first author of BALSA paper, to send me some comparative stats of BALSA vs GATK.

Capture

Apparently, the GCAT site at Bioplanet hosts such a comparison open to everyone. By going through the charts, I do not see any qualitative difference between the two programs. In other words, if someone removed the program names from the website and asked me to find GATK from the comparative charts, I could not have done so.

Here is one of the comparative charts among many in their site

Capture

Given that BALSA runs way faster, produces results as good as GATK and comes at no cost, it is unclear to me why people would pay so much for GATK. One quoted number I saw in twitter was ~$40K/license (!!!), although the others said the actual price is ‘not that high’.

GATK Battle – 2015 Edition

The time for two minutes hate over GATK is back.

Capture0

Capture2

Last time, it was due to GATK being licensed by a private company with contract from Broad Institute (check “Sanger Dropping Broad’s GATK after License Change“). This time, people are upset to see Broad Institute running the licensing service directly.

Direct licensing and support through Broad

Until now, we have relied on a commercial partner to provide licensing and premium support services. Starting April 16, 2015, we will be providing licensing and support directly to commercial entities that will be running the GATK or MuTect internally or as part of their own hardware offering. Current licensed users will transition to Broad Institute when their current license expires. This new model will allow licensed customers better access to the GATK and MuTect development and support teams, full support for the latest releases of our tools, and the most up-to-date Best Practice recommendations that are based on our team’s extensive analysis and R&D work.

The biggest mystery, of course, is why GATK is still in use. BWA-MEM, the most efficient component of GATK, is available for free (GPL). The remaining components are written in Java and are supposedly too bulky. Everyone complains about them. Isn’t it time to switch to BALSA from T.-W. Lam’s group and save both (running) time and money? Keeping BWA-MEM and combining with other components of BALSA would be a viable option as well. Heng Li had other suggestions mentioned in the following link.

What Will Replace GATK in the BALSA Age?

Using de Bruijn Graphs for Short-read Assembly – (ii)

Here is the next part of the first chapter of the genome assembly book that I am working on. Please feel free to grab a copy from here, if you like. You can find more details at this link.

—————————————-

Description of this book

Over the last four years, several introductory articles on de Bruijn graph-based assembly programs were published at homolog.us blog¹ and they remain popular among our readers. For their convenience, the articles were compiled into tutorials and this book is an expansion of those tutorials to explore the same topics in further detail. This book describes how de Bruijn graph-based assembly programs work, explain why so much RAM is needed by the assemblers, show the impact of sequencing error on graph structures, discuss the differences between genome assemblers, transcriptome assemblers and metagenome assemblers and cover many recent algorithmic developments.

Chapters

i) Chapter “The Genome Assembly Problem” explains the concept of shotgun assembly and introduces the readers to various assembly algorithms, such as greedy strategy, overlap-layoutconsensus method and de Bruijn graph.

ii) Chapter “De Bruijn Graph of a Genome” explains how the de Bruijn graph of an already assembled genome looks like and demonstrates how perfect reads can be assembled using de Bruijn graph approach.

iii) Chapter “Experimental Considerations” covers sequencing technologies.

iv) Chapter “Genome Assembly from Noisy Reads with Uneven Coverage” explains genome assembly procedures for real-life reads containing errors.

v) Chapter “Assembling Transcriptomes, Metagenomes and Highly Polymorphic Genomes” shows how to solve other assembly problems.

vi) Chapter “Faster, Better and Cheaper” discusses various algorithmic improvements.

vii) Chapter “In Depth Discussion of Three de Bruijn Graph-based Assemblers” discusses the algorithms and codes of three assemblers in detail.

Algebraic notations are avoided as much as possible

To make this book more accessible to biologists, preference is given to pictorial style of narration over algebraic notations commonly used in bioinformatics papers. Readers should note that our chosen style is in no way inferior to algebra and can in fact provide better understanding of graphbased algorithms used for assembly. The preference for algebraic notations by computational papers is primarily to make the description more amenable to computer programming.

Living electronic book
This book chooses a publishing style that allows regular updates of the book to keep content up to date with the rapidly advancing NGS bioinformatics. Unlike other technical books, whose contents are static once the book is published, this electronic book will continue to get updated with new information, and readers will be able to download and read the latest version at no extra cost. Leanpub, as an online publisher, allows readers to take advantage of this feature. In addition, the book provides other advantages of electronic publishing, such as using hyperlinks to go between chapters, being able to use search capabilities of electronic reader, etc.

Core concepts are reinforced through repetition

The primary aim of this book is to make sure the readers fully understand and remember the core concepts after reading this book. Therefore, entire chapters in the early part of the book are dedicated to explaining ideas covered in at most one or two sentences in technical papers, whereas a later chapter “Faster, Better and Cheaper” condenses discoveries presented in many advanced papers. If the readers grasp the core concepts explained in earlier chapters, they will be able to go through the advanced papers themselves with some contextual help from our explanations. Explanations for some of the key concepts are repeated in multiple places to facilitate understanding, and the codes and exercises are expected to further reinforce the learning.

Regarding the exercises

Code – Pandora’s Toolbox for Bioinformatics

Many useful bioinformatics programs are scattered all over the internet and it is not easy for a newcomer to locate them without some guidance. Therefore, Pandora’s Toolbox for Bioinformatics² brings a number of useful codes together. The toolbox is named after Symbion pandora, an unusual animal discovered in 1995.

—————————–
p

Fig 3. Symbion pandora (image courtesy Dr. Peter Funch)

—————————–

Within Pandora’s Toolbox, three programs – kmc, bcalm and tip-cutter – will be frequently used in this book. Graphviz, a fourth program, is useful for viewing the de Bruijn graphs.









ProgramUse
kmck-mer counting
bcalm de Bruijn graph compaction
tip-cutter removing tips from the graph
bandage viewing de Bruijn graph
bwa aligner
samtools for processing alignments

The program kmc takes one or more read libraries as input and counts the number of k-mers of certain size (k) within them. The following step produces two binary files – kmers.kmc_pre and kmers.kmc_suf. K-mers present only once in the libraries are removed.

1 % mkdir tmp
2 % kmc -k25 [library] kmers tmp

The program kmc_dump_bcalm extracts k-mers present 4 or more times from the binary output of kmc and saves in text file in.dot in format suitable for bcalm.

1 % kmc_dump_bcalm -ci4 kmers in.dot

The program bcalm builds a de Bruijn graph from the k-mers in in.dot and compresses their linear portions to generate out.dot. Then tip-cutter removes all tips from out.dot and saves in out-clean.dot.

1 % bcalm in.dot out.dot
2 % tip-cutter out.dot > out-clean.dot

Data – E. coli and Electrophorus

The following data sources will be used in the examples given in the book.

i) ‘Hydrogen atom’
This small 1,000nt E. coli sequence along with associated reads has been provided by the authors of SPAdes as a toy set, and it is being used here for demonstration purposes. The genome and reads can be downloaded from here³.

ii) E. coli genome and reads (SPAdes)

—————————–
ecoli

Fig 4. A segment of E. coli genome. Blue in the figure shows repetitive regions. Codes for constructing similar plots for the other parts the genome are available from homolog.us website.

—————————–

The fully assembled E. coli genome is only 5 megabases is size and can be used to learn about algorithms. A shor read dataset covering the E. coli genome at 1000x can can be downloaded from the SPAdes website⁴.

iii) Electric eel (Electrophorus electricus) reads (SRA)
Electrophorus genome is about 700 megabases long. The short read sequences can be downloaded from the NCBI SRA website⁵.

Other online resources for learning

Apart from the books within our introductory series, a number of online resources exist for those interested in learning bioinformatics on their own. They are classified below as ‘introductory resources’, ‘intermediate resources’ and ‘advanced resources’.

Introductory resources

Rosalind and Coursera

—————————–
rosalind-logo

Rosalind

—————————–

Rosalind⁶, developed by a team led by Prof. Pavel Pevzner, allows students to learn bioinformatics
online through problem solving. Professor Pevzner and his student Phillip Compeau also developed
an online bioinformatics course through Coursera that is freely available from youtube⁷.

Software Carpentry
Many newcomers to bioinformatics do not have any background in running computer programs,
and especially those programs requiring experience with command line and unix operating system
pose particular problems. Software Carpentry created a number of useful tutorials⁸ to simplify the
learning process.

R and Ipython notebook

R⁹ is an useful computational tool, where statistical analysis and plotting of data can be easily done without knowledge of programming.

Ipython notebook¹⁰ is a web-based interactive computational environment to help with executing codes, adding plots and writing text and equations.

Seqanswers
In online forum Seqanswers¹¹, users ask question about various bioinformatics programs or analysis
approaches and get answers from the experts.

Biostar
Biostar¹² is an online forum like seqanswers, but it follows a different format. Each answer gets ranked by the members of the community so that the correct or most useful answer rises to the top. It creates an useful knowledge-base for future reference.

Intermediate Resources
The following blogs provide useful information on bioinformatics algorithms.

Homolog.us
Homolog.us¹³ blog presents useful information about the latest research in bioinformtics.

Heng Li
Dr. Heng Li of Broad Institute maintains two informative blogs – one at github¹⁴ to cover bioinformatics algorithms and one at Attractive Chaos¹⁵ to cover his lightweight klib library and other computer science-related topics.

Dazzler blog
The Dazzler blog¹⁶ maintained by renowned researcher Gene Myers releases information about his long-read assembly tools.

C. Titus Brown
The blog maintained by Dr. C. Titus Brown¹⁷ discusses benchmarks and analysis of various useful bioinformatics programs.

Advanced resources

Arxiv and biorxiv

The preprint websites arxiv¹⁸ and biorxiv¹⁹ have become invaluable resources for those doing advanced research in bioinformatics. These days, most computational researchers submit their papers at one of those repositories long before they get formally accepted at journals.

Github

Many bioinformaticians doing cutting-edge research do not write blog posts to describe their work, but often maintain very active github sites to publicly post their codes. Others may look into the codes to learn about the problem-solving approaches used by them. The links to a number of useful github pages are provided below.

Bioinformaticians and their github pages

GATB²⁰
Rayan Chikhi²¹
Heng Li²²
Jared Simpson²³
Gene Myers²⁴
Alex Bowe²⁵
Jason Chin²⁶
Dinghua Li²⁷

Acknowledgements for this book
The author is thankful to Rayan Chikhi, Jason Chin, Kevin Karplus, Anton Korobeynikov, Heng Li, Ruibang Luo and Jared Simpson for helpful discussions over the years or comments on this manuscript. The author also thanks the readers of the homolog.us blog and Twitter followers for bringing informative links on latest research to his attention.

Further readings

—————————————–
¹http://homolog.us
²https://github.com/homologus/Pandoras-Toolbox-for-bioinformatics
³https://github.com/homologus/Hydrogen-atom
⁴http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/
⁵http://www.ncbi.nlm.nih.gov/sra/SRX554971
⁶http://rosalind.info/problems/locations
⁷https://www.youtube.com/user/bioinfalgorithms
⁸http://software-carpentry.org
⁹http://www.r-project.org
¹⁰http://ipython.org/notebook.html
¹¹http://seqanswers.com
¹²http://biostars.org
¹³http://homolog.us/blogs
¹⁴http://lh3.github.io/
¹⁵https://attractivechaos.wordpress.com/
¹⁶https://dazzlerblog.wordpress.com/
¹⁷http://ivory.idyll.org/blog/
¹⁸http://arxiv.org
¹⁹http://biorxiv.org
²⁰https://github.com/GATB/
²¹https://github.com/rchikhi
²²https://github.com/lh3
²³https://github.com/jts/
²⁴https://github.com/thegenemyers/
²⁵https://github.com/alexbowe/
²⁶https://github.com/cschin/
²⁷https://github.com/voutcn

Using de Bruijn Graphs for Short-read Assembly – (i)

Here is a part of the first chapter of the genome assembly book that I am working on. Please feel free to grab a copy from here, if you like. You can find more details at this link.

————————————————————

Chapter 1. Introduction

This book is written for thousands of biologists and bioinformaticians eager to gain new knowledge about the living systems using recent technological breakthroughs in nucleotide sequencing, known as next-generation sequencing (NGS). NGS technologies revolutionized biology by giving access to high-throughput sequencing instruments to most research institutions around the globe and even some well-funded individual labs. Such easy access to sequencing brought a dramatic shift in the way genetics research gets done. Merely a decade back, genome sequencing of eukaryotes used to be very long and expensive endeavor, for which hundreds of scientists had to raise funds together by writing joint white papers and petitioning to various government agencies. The tasks of sequencing and assembly were handled by dedicated sequencing centers, of which only a few existed around the globe. Naturally, the capacities at those sequencing centers were significantly constrained from high volume of requests. In contrast, today any scientist can get more nucleotides than the human genome sequenced within a week from a local facility for an insignificant price.

This rapid and somewhat unexpected democratization of sequencing capabilities left the biologists unprepared for the resulting ‘data deluge’. Every downstream analysis step, including sequence assembly, mapping and even storage of data, has now become a challenge. In previous arrangements, genome assemblies were performed by a few dedicated groups within the large sequencing centers, and those groups sheltered the biologists from various unpleasant aspects of handling large volumes of data. Many of those dedicated groups still remain the leaders in research on sequence assembly, but the scientists from other institutions are venturing into assembly and analysis to make sense of locally acquired short-read libraries. In fact, it is now nearly impossible to do high-quality genetics research in life sciences or medical sciences without proper knowledge of the bioinformatics tools. To make matters worse for the newcomers, the technological frontier in bioinformatics itself is moving fast with introduction of new tools almost every week. This book is written to make the lives of the life scientists easy by providing a simple introduction to the recent and ongoing developments of algorithms related to short read assembly.

Assembly procedure requires an understanding of sequencing technologies, algorithms and statistics

The genome is the primary informational unit of living organisms. It is packaged into multiple chromosomes. Those chromosomes get replicated and passed on from generation to generation to maintain continuity of the species. Natural selection works through random changes in the genome (‘mutation’) and fixation of advantageous mutations within the population. Therefore, genome sequencing and assembly is an important step toward understanding the phenotypes and their modifications during evolution.

The genome assembly process, described in abstract terms, requires an understanding of three aspects – sequencing technology, algorithms and statistics. Let us see what roles they play.

Sequencing technologies

NGS technologies cover wide range of sequencing instruments, each offering unique benefits and challenges during assembly. The assembly process often requires combining sequences from multiple technologies. Moreover, many researchers are restricted to use certain technologies due to ease of access or cost. For these reasons, this book covers the differences between sequencing technologies and their impact on assembly process in chapter 4.

Algorithms

An algorithm consists of a set of rules to perform a given task. For example, young children learn algorithms to add numbers using carry and subtract using borrow. Conceptually, algorithms are not too different from the experimental protocols used by biologists. Their mastery requires two steps – understanding the concept and memorization through practice. Mechanization, such as using calculators to add and subtract numbers, cannot replace the value added from understanding of the concepts. Therefore, this book focuses on explaining the algorithms involved in assembly process. This point is further explained in the following section.

Statistics

Given that nucleotide sequencing is an experimental process, experimental variations and noise are always part of the data. Those artifacts manifest in the assembly procedure in two ways – (i) use of cutoff parameters to limit the amount of noise and (ii) use of statistical methods, such as averaging or hidden Markov model. Biologists are well familiar with those procedures from other contexts. This book explains the statistical aspects by first assuming error-free data and presenting the core assembly algorithm in the absence of noise (Chapter 3). Then the impact of noise and uneven coverage are explained in a later chapter (Chapter 5). By partitioning the assembly problem into three components, a biologist will be able to figure out whether his assembly quality can be improved by moving to different technology, better algorithm or higher coverage. Moreover, this book adds adequate amount of simulations and analysis of real data to allow readers play with various aspects and learn.

Why should biologists learn about assembly algorithms?

This book’s approach of focusing on algorithmic concepts departs from the current training practices in bioinformatics. In the current approach, the biologists analyzing NGS sequences typically learn to use and benchmark various software tools, such as Velvet, ABySS, Trinity, Oases, SOAPdenovo, Minia, Ray, IDBA, SPAdes, etc. to assemble the underlying genomes, transcriptomes or metagenomes. Even though all of those software programs use de Bruijn graphs to reconstruct genomes from NGS libraries, proper understanding of assembly algorithms is not needed to run
them. Is there any benefit for biologists in learning the inner workings of those assembly programs? The answer is an emphatic yes. Such an understanding helps in performing every step of the work including designing experiments, purchasing computing hardware, improving assembly quality, interpreting results and, most importantly, staying abreast of technological developments. Furthermore, NGS sequencing and assembly has become routine tools for analyzing transcriptomes (RNAseq) of non-model organisms, and the same benefits of learning about assembly algorithms apply there. The biggest advantage of understanding the algorithms is likely to come from the ability it provides in developing clever and unconventional experiments using NGS technologies. It appears that the full potential of NGS technologies is not being realized, because a large gap exits between those who think in terms of experiments and those who develop algorithms to analyze related large data. This book intends to close that gap and give an experimentalist power to think through all aspects of data analysis. Each of the above points is discussed in detail below.

Software tools change, but the algorithms remain more stable

Over the last three years, computational researchers developed many efficient programs to improve various assembly steps, but those programs remain under-utilized until they are fully incorporated into the existing pipelines. Switching from one program to another, or even deciding about proper benchmarking measures to test the effectiveness of new programs, is a major challenge for those trained with software programs. Proper knowledge of algorithms should make the task easier, because the underlying algorithms behind various programs stay more stable than the codes.

———————–
fig1
————————

Knowledge of algorithms helps in experimental design

Designing of NGS experiments require making decisions regarding the mix of sequencing technologies, mix of mate pair libraries and read coverage. Such decisions are hard to make without a conceptual understanding of how the assembly gets done.

Understanding algorithms helps in purchasing adequate computing hardware

Bioinformaticians knowing the assembly algorithms can solve larger problems without buying expensive computer hardware. Exhausting available computer memory (RAM) is the first obstacle faced by new bioinformaticians working on de Bruijn graph-based assemblers. The most obvious and commonly sought solution of finding a computer with large RAM does not scale well with ever-increasing amount of data. A number of elegant algorithms were recently proposed by several bioinformaticians, but their implementation and incorporation into existing workflows require some knowledge of the underlying assembly process.

————————
fig2
————————

Improving assembly quality and interpreting results better

When a bioinformatician blindly runs a program to obtain results, it is unclear to him whether the gaps or errors in his assembly are due to shortcoming of technology, lack of coverage, execution of program or noise. Even with a familiar program, the quality of assembly can be greatly improved by searching over the entire parameter space, but some of those parameters can only be explained with reference to the de Bruijn graph structure.

RNAseq in non-model organisms

De Bruijn graph-based algorithms have become important for a new class of problems, namely transcriptome assembly (RNAseq). Transcriptomes did not need to get assembled in the older days of Sanger sequencing, because the sizes of ESTs were comparable to genes, whereas the gene expression was measured by microarray technology. High-throughput NGS can solve both problems together, but shorter NGS reads need to be assembled into genes, especially for organisms lacking underlying genomes. De Bruijn graph-based Trinity and Oases transcriptome assemblers are popular among researchers working on RNAseq data.

Using NGS technologies to solve novel problems

In 2010, a group from Fred Hutch Research center used NGS technologies to sequence the variable regions of human immune cells, namely T-cells and B-cells. Their method has been commercialized by Adaptive Biotechnologies, a Seattle based company. In 2012, a Stanford group came up with a method to sequence long reads synthetically using short read sequencing. Their technology was commercialized by Moleculo, a company that was later acquired by Illumina. Those are two of many examples of creative uses of NGS technologies. In all cases, the analysis of large amount of data was a critical component of development of the technology, and it often required designing new algorithms. It is expected that many other unconventional uses of NGS technology will be discovered in coming years, and an experimentalist with proper knowledge of algorithms will have an upper hand to think about creative applications.

————————————————————

Comments and suggestions appreciated. Also note that the sections posted here are draft versions to be incorporated in the next update of the book. I try to keep the number of updates to one per month or less.

Someone Please Take Titus Brown for a Drink

I have been reading a blog post titled – What I’d tell myself about startups if I could go back 5 years

and came across these comments –

The programming language/ framework wars are great fun in the pub, but of limited value in real life

A good developer can pick up any language or platform in a few weeks

It is a great list and I encourage you to go through the rest.

———————————–

Given that academic labs are also like start-ups (at least when the professor is pre-tenure), I was surprised by reading this comment from (post-tenure) Titus Brown

I disagree strongly with Jared’s black & white statement that “this isn’t a language problem” — part of it absolutely is! Scripting languages enable a much more flexible and organic interaction with algorithms than languages like Java and C++, in my extensive personal experience. People can also pick up scripting languages much more easily than they can C++ or Java, because the learning curve is not so steep (although the ultimate distance climbed may be long).

———————————

It immediately started the pub-brawl as expected, with Christopher Hogue being the first to protest –

I find your “elitist” language and “one-true-way” argument subjective and divisive.

“Fundamentally, moving from a lightweight Python layer on top of a heavier, optimized C++ library into a standalone binary seems like a step backwards to me.”

Well, the mass of software running is lighter as a standalone binary, as there is no language interpreter binary running in memory, so you have me confused as to your definition of “lightweight” and “heavier”.

This post seems like yet another argument for software implementation tribalism (i.e. my language/tools approach is best!). I’ve been around long enough to know that these arguments are always dependent on the tribal language experience and context of the participants.

———————————–

If you have any doubt that academic labs do not operate like start ups, Jared’s comment should remove that.

I think part of bioinformatics education should be learning to pick the right tool for the job. This isn’t a C++ library for accessing VCF files, its a library for performing inference using very low-level data. I think the developers that are interested in contributing in this space will be fairly computationally sophisticated and willing to use C++. I hope.

Sure it would be nice if I had already had a well-designed library, bindings to high-level language, etc, but this project is ~three months old and I have limited time. I necessarily have to focus on the most important problems (how well the program works, can anyone else run it?).

BYOB (and one for Titus) and enjoy the rest of the fight !

POA and Nanopolish

This is an old (well-cited) paper from 2002, but the algorithm is finding plenty of use lately with SPAdes and Jared Simpson’s nanopolish. I have not checked what algorithm Jason Chin uses in his pbdagcon.

Multiple sequence alignment using partial order graphs

Motivation: Progressive Multiple Sequence Alignment (MSA) methods depend on reducing an MSA to a linear profile for each alignment step. However, this leads to loss of information needed for accurate alignment, and gap scoring artifacts.

Results: We present a graph representation of an MSA that can itself be aligned directly by pairwise dynamic programming, eliminating the need to reduce the MSA to a profile. This enables our algorithm (Partial Order Alignment (POA)) to guarantee that the optimal alignment of each new sequence versus each sequence in the MSA will be considered. Moreover, this algorithm introduces a new edit operator, homologous recombination, important for multidomain sequences. The algorithm has improved speed (linear time complexity) over existing MSA algorithms, enabling construction of massive and complex alignments (e.g. an alignment of 5000 sequences in 4 h on a Pentium II). We demonstrate the utility of this algorithm on a family of multidomain SH2 proteins, and on EST assemblies containing alternative splicing and polymorphism.

Availability: The partial order alignment program POA is available at http://www.bioinformatics.ucla.edu/poa.

—————————

Jared posted his code and instructions here.

Brief usage instructions

The pipeline is still a prototype so it is fragile at the moment. It will be revised for general use after we submit the paper.

The reads that are input into the HMM must be output as a .fa file by poretools. This is important as poretools writes the path to the original .fast5 file (containing the signal data) in the fasta header. These paths must be correct or nanopolish cannot find the events for each read. Let’s say you have exported your reads to reads.fa and you want to polish draft.fa. You should run:

make -f consensus.make READS=reads.fa ASSEMBLY=draft.fa
This will map the reads to the assembly with bwa mem -x ont2d and export a file mapping read names to fast5 files.

How Much Will the Americans Suffer, If NIH Shuts Down?

This is the third and final post of our series critically reexamining Francis Collins’ claim that the existence of NIH contributes to health improvement of Americans. Collins used life expectancy at birth as a good measure of success and we do the same.

In the previous two posts (“Francis Collins Admits NIH Under Him Has Been Failing” and “Report from Asia – Will Asia ‘Unfollow’ NIH’s Failed Research Model?“), we showed that the life expectancy at birth of USAians is lower than all advanced countries. Even the residents of some countries bombed or wrecked by Americans three or four decades now caught up with USA. However, today’s number will absolutely surprise you.

images

How much will the life expectancy at birth of Americans fall, if NIH shuts down? To add to the misery, all people of the country (USA) will be placed into slums in an area no more than two times the size of Massachusetts and bombed periodically. The answer is that the life expectancy will fall by only five years. The evidence is here in the life expectancy at birth at Gaza strip –

indexmundi_ex30

To understand that the Gaza strip number is not a measurement error, we can also check Lebanon, a country neighboring Israel and Palestine. The life expectancy at birth there is higher than USA.

Capture

The above numbers should destroy any myth that NIH and its precision medicine scams contribute to human health. Instead of sending misleading stats to public, Francis Collins should dramatically scale down the wasteful programs starting with shutting down of NHGRI.

Let’s Discuss – Is it Time to Shut Down NHGRI?

Guangzhou Researcher Debunks ‘Duon’ Discovery of Stamatoyannopoulos

Dr. John Stamatoyannopoulos, who receives University of Washington Alumni Early Achievement Award for his ENCODE ‘discoveries’ surprised the world again in late 2013 by discovering second genetic code and publishing in a glam journal.

Not-at-all glam journal Molecular Biology and Evolution reports (emphasis ours) –

Reassessing the “Duon” Hypothesis of Protein Evolution

20121110122148199030

There are two distinct types of DNA sequences, namely coding sequences and regulatory sequences, in a genome. A recent study of the occupancy of transcription factors (TFs) in human cells suggested that protein-coding sequences also serve as the codes of TF occupancy, and proposed a “duon” hypothesis in which up to 15% of codons of human protein genes are constrained by the additional coding requirements that regulate gene expression. This hypothesis challenges our basic understanding on the human genome. We reanalyzed the data and found that the previous study was confounded by ascertainment bias related to base composition. Using an unbiased comparison in which G/C and A/T sites are considered separately, we reveal a similar level of conservation between TF-bound codons and TF-depleted codons, suggesting largely no extra purifying selection provided by the TF occupancy on the codons of human genes. Given the generally short binding motifs of TFs and the open chromatin structure during transcription, we argue that the occupancy of TFs on protein-coding sequences is mostly passive and evolutionarily neutral, with to-be-determined functions in the regulation of gene expression.

Observations –

1. ‘Not at all glam’ equals ‘respected’, when it comes to science and journals. Molecular Biology and Evolution was founded by Walter Fitch and Masatoshi Nei, two well-respected evolutionary biologists, and continues to publish serious papers.

2. The only ‘press release’ this debunking received is a mention in Dan Graur’s blog. Duon ‘discovery’ was covered by all major media including Times, Forbes and possibly even Playboy.

3. The debunkers work at a Chinese university in the state located right across the border from Hong Kong. Guangzhou is the capital of Guandong, which used to be one of the poorest provinces of China 25 years back. Then the economic liberalization made it the fastest growing manufacturing center, and now the place is focusing on building a knowledge economy. Shenzhen is the more well-known city of the province.

4. Professor Xionglei He, the senior author of the paper, did his PhD at University of Michigan with Jiangzhi Zhang, who himself was a PhD student of Masatoshi Nei. Masatoshi Nei is a colleague of Ken Weiss at Penn State University and supervised Dan Graur’s PhD thesis in the 80s. So, you can clearly see the pattern that USA/UK are sending away highly trained scientists to China, while feeding the clowns like Birney and Stamatoyannopoulos. Rinse and repeat for a few years and you get –

Homolog.us Reports from Asia – ‘Will Hong Kong be the Bioinformatics Capital of the World?’

5. We bring back the obligatory video that is reserved for Dr. Stamatoyannopoulos.