Please Sign up for our Membership Section

Dear readers,

Over the last few months, several of you contacted us requesting updates on various sections of the tutorials. Despite all good intentions to keep them up to date, we have not been able to do so for lack of time. Some of you contacting us brought up the idea of membership that we proposed couple of years back. We like to give it a try to see whether the community benefits from it.

Based on readers’ emails and other discussions with various collaborators over the last year, we realize that they are facing numerous problems in using the latest and greatest bioinformatics programs.

(i) First, there are simply too many programs for solving every problem, with no obvious way to judge which ones are the most appropriate. You can get a flavor of that from yesterday’s post on Pacbio assembly.

(ii) The solution may be to install a number of programs and compare the results, but installation appears to be the next big hurdle. It is nice that almost all bioinformatics papers deposit their codes in github these days, but not all codes come with full documentation. Moreover, very few of them have been tested in multiple environments. We tried to install the programs from Pandora’s Toolbox for Bioinformatics in a different unix operating system, and only three compiled fully without any intervention. You may argue that bioinformatics programs should not be tried outside linux, but we chose to do so after experiencing some failures in existing programs after ‘upgrades’ in linux. This is a common problem, when shared libraries get replaced by new versions.

(iii) Then comes the problem of scaling through RAM. In conversation with others, we find many researchers still using that method for their data analysis. They tend to stick to the programs they know best, because doing otherwise involves a steep learning curve. Therefore, when they encounter massive amount of data, the obvious solution is to tackle with larger machine with massive amount of RAM ( ~1 TB!). We wrote extensively about this expensive non-solution in 2012 and 2013, but still hear about it being practiced.

(iv) Another problem we often hear about comes from the incompatibility of output format. For example, a reader asked for help on solving an assembly problem, and we recommended a set of programs to be used as a pipeline. Unfortunately, the reader could not use the combination, because the output of program X could not be fed directly into program Y. This may sound trivial for those coming from computing background, but may not be as trivial for biologists.

(v) The last problem we came across was a bit different from the other ones listed above, and let us call it ‘bioinformatics overuse’ for lack of a better name. If you heard about the perils of antibiotic overuse, wait till you see a case of ‘bioinformatics overuse’ !! In this case, the users are able to install various programs, but do not know what is inside those black-boxes. Therefore, they decide to use every single one of them, and then come up with some logic to combine the output. I even came across suggestions of ‘using the programs in triplicate’ just like using three experimental samples !!

We plan to address all those issues in our membership section, but the first task is to present information already posted at various parts of this site in a well-organized manner. Also, we are testing some very interesting ideas to help you learn programs without losing hair from installing them. The membership site is currently undergoing tests and will be open in a month. If you enjoy our blog, please sign up for the membership here.

Capture

Combinatorial Scoring of Phylogenetic Networks

Several eons ago, we reviewed “Algorithms for Constructing of Phylogenetic Network (Reticulate Evolution)”. Those, who are interested in the latest and greatest, may check this new arxiv paper.

Construction of phylogenetic trees and networks for extant species from their characters represents one of the key problems in phylogenomics. While solution to this problem is not always uniquely defined and there exist multiple methods for tree/network construction, it becomes important to measure how well constructed networks capture the given character relationship across the species.
In the current study, we propose a novel method for measuring the specificity of a given phylogenetic network in terms of the total number of distributions of character states at the leaves that the network may impose. While for binary phylogenetic trees, this number has an exact formula and depends only on the number of leaves and character states but not on the tree topology, the situation is much more complicated for non-binary trees or networks. Nevertheless, we develop an algorithm for combinatorial enumeration of such distributions, which is applicable for arbitrary trees and networks under some reasonable assumptions.

RECKONER: Read Error Corrector Based on KMC

Our readers are already familiar with the work of Sebastian Deorowicz and colleagues for their innovative kmc2 kmer-counting program. In this new paper posted at arxiv, they used kmc2 to build an error correction program.

Motivation: Next-generation sequencing tools have enabled producing of huge amount of genomic information at low cost. Unfortunately, presence of sequencing errors in such data affects quality of downstream analyzes. Accuracy of them can be improved by performing error correction. Because of huge amount of such data correction algorithms have to: be fast, memory-frugal, and provide high accuracy of error detection and elimination for variously-sized organisms.

Results: We introduce a new algorithm for genomic data correction, capable of processing eucaryotic 300 Mbp-genome-size, high error-rated data using less than 4 GB of RAM in less than 40 minutes on 16-core CPU. The algorithm allows to correct sequencing data at better or comparable level than competitors. This was achieved by using very robust KMC~2 k-mer counter, new method of erroneous regions correction based on both k-mer counts and FASTQ quality indicators as well as careful optimization. Availability: Program is freely available at this http URL Contact: [email protected]

Another memory-efficient approach was published in LoRDEC, which used Rayan Chikhi et al’s GATB library.

Zika Oxitech Connection Can be Explained without Resorting to ‘Conspiracies’

Zerohedge reported –

zika-virus-2

Zika Outbreak Epicenter In Same Area Genetically-Modified Mosquitoes Released In 2015

When examining a rapidly expanding potential pandemic, it’s necessary to leave no stone unturned so possible solutions, as well as future prevention, will be as effective as possible. In that vein, there was another significant development in 2015.

Oxitec first unveiled its large-scale, genetically-modified mosquito farm in Brazil in July 2012, with the goal of reducing “the incidence of dengue fever,” as The Disease Daily reported. Dengue fever is spread by the same Aedes mosquitoes which spread the Zika virus — and though they “cannot fly more than 400 meters,” WHO stated, “it may inadvertently be transported by humans from one place to another.” By July 2015, shortly after the GM mosquitoes were first released into the wild in Juazeiro, Brazil, Oxitec proudly announced they had “successfully controlled the Aedes aegypti mosquito that spreads dengue fever, chikungunya and zika virus, by reducing the target population by more than 90%.”

Though that might sound like an astounding success — and, arguably, it was — there is an alarming possibility to consider.

Nature, as one Redditor keenly pointed out, finds a way — and the effort to control dengue, zika, and other viruses, appears to have backfired dramatically.

——————————————————————

A poorly trained technical writer at Discover magazine argued –

No, GM Mosquitoes Didn’t Start The Zika Outbreak

A new ridiculous rumor is spreading around the internets. According to conspiracy theorists, the recent outbreak of Zika can be blamed on the British biotech company Oxitec, which some are saying even intentionally caused the disease as a form of ethnic cleansing or population control. The articles all cite a lone Redditor who proposed the connection on January 25th to the Conspiracy subreddit. “There are no biological free lunches,” says one commenter on the idea. “Releasing genetically altered species into the environment could have disastrous consequences” another added. “Maybe that’s what some entities want to happen…?”

For some reason, it’s been one of those months where random nonsense suddenly hits mainstream. Here are the facts: there’s no evidence whatsoever to support this conspiracy theory, or any of the other bizarre, anti-science claims that have popped up in the past few weeks. So let’s stop all of this right here, right now: The Earth is round, not flat (and it’s definitely not hollow). Last year was the hottest year on record, and climate change is really happening (so please just stop, Mr. Cruz). And FFS, genetically modified mosquitoes didn’t start the Zika outbreak.

——————————————————————

Let us try it differently.

Observation. A correlation between spread of zika virus and Oxitec’s action has been observed.

Correlation does not mean causation (check – “Do not Go for a Swim after Having an Ice Cream”). However, the correlation is still a valid observation that needs to be explained.

The problem with Discover Magazine’s argument is that instead of trying to think about various ways to explain the correlation and ruling them out, it decided to present its own model of why there could not be any connection between Oxitec and Zika, and then boldly claimed theirs is the only possible model.

How about another possibility? Let us say there are two subspecies of Aedes aegypti mosquito – Aedes aegypti “dengue” and Aedes aegypti “zika” – spreading dengue and zika respectively. The current scientific knowledge assigns both diseases to one ‘Aedes aegypti’, but that may be a limitation of current knowledge. It is possible that Oxitec targeted Aedes aegypti “dengue” subspecies and wiped it out. As a result, the other mosquito fill up its space. Will zika cases explode in such a scenario?

—————————————————–

They lost half of the DATA !

Unicorn-barbecue party –

sc

Two Potentially Important Developments on Nanopore

The first paper demonstrates selective sequencing, whereas the second ones improves accuracy by introducing circularization.

Real time selective sequencing using nanopore technology

The Oxford Nanopore MinION is a portable real time sequencing device which functions by sensing the change in current flow through a nanopore as DNA passes through it. These current values can be streamed in real time from individual nanopores as DNA molecules traverse them. Furthermore, the technology enables individual DNA molecules to be rejected on demand by reversing the voltage across specific channels. In theory, combining these features enables selection of individual DNA molecules for sequencing from a pool, an approach called ‘Read Until’. Here we apply dynamic time warping to match short query current traces to references, demonstrating selection of specific regions of small genomes, individual amplicons from a group of targets, or normalisation of amplicons in a set. This is the first demonstration of direct selection of specific DNA molecules in real time whilst sequencing on any device and enables many novel uses for the MinION.

INC-Seq: Accurate single molecule reads using nanopore sequencing

Nanopore sequencing provides a rapid, cheap and portable real-time sequencing platform with the potential to revolutionize genomics. Several applications, including RNA-seq, haplotype sequencing and 16S sequencing, are however limited by its relatively high single read error rate (>10%). We present INC-Seq (Intramolecular-ligated Nanopore Consensus Sequencing) as a strategy for obtaining long and accurate nanopore reads starting with low input DNA. Applying INC-Seq for 16S rRNA based bacterial profiling generated full-length amplicon sequences with median accuracy >97%. INC-Seq reads enable accurate species-level classification, identification of species at 0.1% abundance and robust quantification of relative abundances, providing a cheap and effective approach for pathogen detection and microbiome profiling on the MinION system.

ENCODE Scholar Resigns from UChicago after Sex Scandal

NY Times reports –

Chicago Professor Resigns Amid Sexual Misconduct Investigation

A prominent molecular biologist at the University of Chicago has resigned after a university recommendation that he be fired for violating the school’s sexual misconduct policy.

The professor, Jason Lieb, 43, made unwelcome sexual advances to several female graduate students at an off-campus retreat of the molecular biosciences division, according to a university investigation letter obtained by The New York Times, and engaged in sexual activity with a student who was “incapacitated due to alcohol and therefore could not consent.”

Dr. Lieb, who has received millions of dollars in federal grants over the last decade, did not respond to requests for comment.

It is unclear why NY Times chose the word ‘prominent’ instead of ‘talented’, ‘brilliant’ or some other qualification appropriate for a scientist. We feel it is more accurate to call him a scholar, and being an ‘ENCODE scholar’ is the pinnacle of success one can reach in the modern NHGRI-controlled scientific dystopia.

ewan

Dr. Lieb was a co-author in Ewan Birney’s 2007 ENCODE pilot paper as well as the infamous 80% paper. After 2007, he became a leader of modENCODE to bring the ‘success’ of ENCODE to C. elegans. ENCODE and modENCODE were the sources of his millions of dollars of federal grants.

NY Times also mentioned –

He was put on staff despite potential warning signs.

Before he was hired, molecular biologists on the University of Chicago faculty and at other academic institutions received emails from an anonymous address stating that Dr. Lieb had faced allegations of sexual harassment or misconduct at previous jobs at Princeton and the University of North Carolina.

“Both U.N.C. and Princeton launched investigations,” the email read.

Dan Graur, who gained immortality by exposing ENCODE’s BS, argues that ENCODE money was the reason Lieb’s other misconducts were overlooked.

Capture2

Capture3

That brings us to the more important question – which university will Lieb head to after this round? Most likely his funding will double, given that NHGRI becomes more supportive of scientists after they get caught in various misdeeds.

——————————————————

Readers may also like – Let’s Discuss – Is it Time to Shut Down NHGRI?

I have a (bioinformatics) dream

…..saw the following message somewhere. Was it a dream?

Are you a biologist overwhelmed with too many open-source NGS programs? Do you often give up at the installation stage, or even worse, by reading the “How to Install” section of the instruction manual? This manual and its associated training modules are developed to take your fear of bioinformatics away and to help you build confidence.

We cover a small but powerful subset of all available open-source bioinformatics tools. The set of programs included here not only solves a large spectrum of problems, but they are the most well-maintained. They should be the best starting point for venturing into a wider spectrum of applications.

Now here is the better part. You do not have to spend hours in installing them to learn how to use those programs. Just get an account at our Amazing Cloud server, and we will have all programs installed for you.

Finally, here is the best part. If you want to run the same programs in your own server, we make that easy. You need to simply download a file with the linux image of the same programs in your server, and start execution of that image. Voila !! You have all programs you need installed in your server in exactly two minutes (or however long the download takes place).

Upgrading is equally easy. Simply download the new image to replace the old image. In fact, you can even have both the upgraded version and the old version running in your server, if need to compare results.

MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows

A few years back, Veli Mäkinen’s group proposed using min-flow for RNAseq instead of expectation-maximization (EM). I suspect this new new arxiv paper is related algorithmically, but developed for a different context (metagenomics).

High-throughput sequencing (HTS) of metagenomes is proving essential in understanding the environment and diseases. State-of-the-art methods for discovering the species and their abundances in an HTS metagenomic sample are based on genome-specific markers, which can lead to skewed results, especially at species level. We present MetaFlow, the first method based on coverage analysis across entire genomes that also scales to HTS samples. We formulated this problem as an NP-hard matching problem in a bipartite graph, which we solved in practice by min-cost flows. On synthetic data sets of varying complexity and similarity, MetaFlow is more precise and sensitive than popular tools such as MetaPhlAn, mOTU, GSMer and BLAST, and its abundance estimations at species level are two to four times better in terms of L1-norm. On a real human stool data set, MetaFlow identifies B.uniformis as most predominant, in line with previous human gut studies, whereas marker-based methods report it as rare. MetaFlow is freely available at http://cs.helsinki.fi/gsa/metaflow

Genomic Analysis with Spark – A Few Examples

Yesterday’s post on Spark mentioned that the technology is not being used in bioinformatics. That is not entirely correct, and we came across a (small) number of mentions here and there.

1. SparkSeq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision

Many time-consuming analyses of next generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics due to their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data querying.

The SparkSeq software has been created to take advantage of a new MapReduce framework, Apache Spark, for next generation sequencing data. SparkSeq is a general-purpose, flexible and easily extendable library for genomic cloud computing. It can be used to build genomic analysis pipelines in Scala and run them in an interactive way. SparkSeq opens up the possibility of customised ad hoc secondary analyses and iterative machine learning algorithms. This paper demonstrates its scalability and overall very fast performance by running the analyses of sequencing datasets. Tests of SparkSeq also prove that the use of cache and HDFS block size can to be tuned for the optimal performance on multiple worker nodes.

Availability and Implementation: Available under open source Apache 2.0 license: https://bitbucket.org/mwiewiorka/sparkseq/

———————————————–

2. Genomic Analysis Using ADAM, Spark and Deep Learning

———————————————–

Readers may also take a look at Heng Li’s answer to the following question at biostars –

Why is Hadoop not used a lot in bio-informatics?

Heng Li:

Most of applications you mentioned can be and have already been implemented on top of hapdoop. A good examples is the ADAM format, a hapdoop friendly replacement of BAM, and its associated tools. They are under active development by professional programmers. Nonetheless, I see a few obstacles to its wider adoption:

1. It is harder to find a local hadoop cluster. My impression is that hadoop really shines in large scale cloud computing where we have a huge (virtual) pool of resources and can respond users on demand. In a multi-user environment given limited resources, I don’t know if a local hadoop is as good as LSF/SGE in terms of fairly balancing resources across users.
2. We can use AWS, google cloud, etc, but we have to pay. Some research labs may find this unfamiliar. Those who have free access to institution wide resources would be even more reluctant.
3. Some pipelines are able to call variants from 1 billion raw reads in 24 hours with multiple CPU cores. This is already good enough in comparison to the time and cost spent on sequencing. There is not a huge need of better technologies. In addition, although hadoop frequently saves wall-clock time due to its scalability, at times it wastes CPU cycles on its extra layer. In a production setting, the total CPU time across many jobs matters more than the wall-clock time of a single job. Some argue that the compute-close-to-data model of hadoop is better, but for many analyses we only read through data once. The data transferred over network is the same as dispatching data in the hadoop model.
4. Improvements to algorithms frequently have much bigger impact on data processing than using a better technology. For example, there is a hadoop version of MarkDuplicates that takes much less wall-clock time (more CPU time, though) than Picard. However, recent streamed algorithms, such as SamBlaster and the new Picard, can do this faster in terms of both CPU and wall-clock time. For another example, there is a concern with the technical difficulty in multi-sample variant calling, so someone developed a hadoop-based caller. When it comes out, GATK has moved to gVCF, which solves the problem in a much better way, at least for deep sequencing. Personally, I would more like to improve algorithms than to adapt my working tools to hadoop.
For some large on-demand services, hadoop from massive cloud computing providers is hugely advantageous over the traditional computing model. Hadoop may also do a better job for certain bioinfo tasks (gVCF merging and de novo assembly coming into my mind). However, for the majority of analyses, hadoop only adds complexity and may even hurt performance.

Heng Li works for Broad Institute, which processes considerable amount of data –

CZ1icT6XEAElb3R

Given that his answer is 15 months old, maybe Broad is evaluating the technologies now. The real challenge is the lack of competent people with knowledge of both bioinformatics and fault-tolerant hardware technologies. Hopefully, our posts will generate some awareness about the usefulness of these new approaches.

Apache Spark

Yesterday I attended a seminar on Apache Spark, and thought the readers may find this new technology interesting for their programming and data analysis. There is no bioinformatics connection at the moment, and I am simply mentioning it as a type of fault-tolerant software-hardware technology with a lot of development work going on, and may turn out to be useful in the future.

How good is spark? An example

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!

In case you missed our earlier blog post, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record.

Description of the technology from wiki –

Apache Spark is an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications.[1] By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms.[2]

Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos.[3] For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS),[4] Cassandra,[5] OpenStack Swift, Amazon S3, Kudu, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core.

Web Analytics