A number of good lightweight application are coming out taking advantage of fast, lock-free kmer counting of Jellyfish. Previously, we discussed about Sailfish, which could get good speed gain over RSEM in RNAseq applications.
DEseq and Sailfish Papers for RNAseq
Good C++ Development Practices in Sailfish Code
We also talked about the Minimer concept published by Jim Yorke in 2004, which has been finding many uses lately.
De Novo Assembly of Human Genome with Only 1.5 GB RAM
Wood and Salzberg published a new paper in Genome Biology for rapidly classifying metagenome sequences, where they use both of those concepts.
Kraken: Ultrafast Metagenomic Sequence Classification Using Exact Alignments
Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.
Salzberg, as always, continues to do very creative work. Speaking of quality of Kraken, we are happy to rely on Nick Loman’s word on it.
We noticed this tweet about a talk at a ‘science’ conference.
The problem is that science does not have room for “PR” and “advertising”, as argued by Feynman 40 years back.
Cargo Cult Science
The easiest way to explain this idea is to contrast it, for example, with advertising. Last night I heard that Wesson oil doesn’t soak through food. Well, that’s true. It’s not dishonest; but the thing I’m talking about is not just a matter of not being dishonest; it’s a matter of scientific integrity, which is another level. The fact that should be added to that advertising statement is that no oils soak through food, if operated at a certain temperature. If operated at another temperature, they all will–including Wesson oil. So it’s the implication which has been conveyed, not the fact, which is true, and the difference is what we have to deal with.
I would like to add something that’s not essential to the science, but something I kind of believe, which is that you should not fool the layman when you’re talking as a scientist. I am not trying to tell you what to do about cheating on your wife, or fooling your girlfriend, or something like that, when you’re not trying to be a scientist, but just trying to be an ordinary human being. We’ll leave those problems up to you and your rabbi. I’m talking about a specific, extra type of integrity that is not lying, but bending over backwards to show how you’re maybe wrong, that you ought to have when acting as a scientist. And this is our responsibility as scientists, certainly to other scientists, and I think to laymen.
For example, I was a little surprised when I was talking to a friend who was going to go on the radio. He does work on cosmology and astronomy, and he wondered how he would explain what the applications of his work were. “Well,” I said, “there aren’t any.” He said, “Yes, but then we won’t get support for more research of this kind.” I think that’s kind of dishonest. If you’re representing yourself as a scientist, then you should explain to the layman what you’re doing– and if they don’t support you under those circumstances, then that’s their decision.
One example of the principle is this: If you’ve made up your mind to test a theory, or you want to explain some idea, you should always decide to publish it whichever way it comes out. If we only publish results of a certain kind, we can make the argument look good. We must publish BOTH kinds of results.
I say that’s also important in giving certain types of government advice. Supposing a senator asked you for advice about whether drilling a hole should be done in his state; and you decide it would be better in some other state. If you don’t publish such a result, it seems to me you’re not giving scientific advice. You’re being used. If your answer happens to come out in the direction the government or the politicians like, they can use it as an argument in their favor; if it comes out the other way, they don’t publish at all. That’s not giving scientific advice.
Other kinds of errors are more characteristic of poor science. When I was at Cornell, I often talked to the people in the psychology department. One of the students told me she wanted to do an experiment that went something like this–it had been found by others that under certain circumstances, X, rats did something, A. She was curious as to whether, if she changed the circumstances to Y, they would still do A. So her proposal was to do the experiment under circumstances Y and see if they still did A.
These days, he would have looked at ENCODE instead of psychology department.
A good video -
We wrote about MaSuRCA assembler a few months back. Today a reader asked the following question by email -
I was reading the paper as well as your blog on MaSuRCA assembler and I have a simple question. If they choose the unique k-mer from k-mer counting table for read extension, then isn’t that k-mer is erroneous? The k-mer counting tools always get rid of k-mers with low frequency, right?
Answer is simple as well :). As Anton Korobeynikov commented in the previous thread, MaSuRCA is building a de Bruijn graph under the hood. That means they are splitting the reads into k-mers, but storing the k-mers in a table instead of RAM. Once you have that, you can proceed along the de Bruijn graph one base at a time and join the neighboring nodes, if there is no ambiguity.
Two things can happen with removal of low-frequency k-mers by counting tools.
(i) Better quality assembly: If the low-frequency k-mer comes from noisy reads (for example, last nucleotide of a read being incorrect), those noisy regions go away from the graph and you get high-quality reconstruction. If, instead, you kept the low frequency k-mers, read extension would have stopped too soon and given very short contigs.
(ii) Introduction of errors: If, suppose, similar nucleotides are present at two places in the genome and the coverage is low in one of those two, removal of low-frequency k-mers will tell the assembler that there is no ambiguity. Then, the reconstruction of the contig will be erroneous.
I suspect (i) is far more dominant than (ii), but the answer always depends on quality of data, type of genome, size of genome, etc.
On January 1st, Omicsomics blog requested for the perfect scaffolder.
Envisioning The Perfect Scaffolder
Rather than make any New Year’s resolutions of my own, which I would then feel guilty about not keeping, I’ve decided to make one for someone else: they will write the perfect open source scaffolder. There’s a lot of scaffolders out there, both stand-alone and integrated into various assemblers, but none are quite right.
On March 6th, a benchmarking paper from Sanger Institute delivers it, provided…
A comprehensive evaluation of assembly scaffolding tools
We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output.
Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our
The paper seems to have done a very thorough job by checking each scaffolder based on multiple aligners. The choice of aligner matters !!
1. Please note that the subject line is written in jest. Real scaffolders on real data perform quite lousily given that ’90% correct’ scaffolding is still a bad situation for those 10% of erroneous parts of the genome.
2. Wish the paper could compare SPAdes along with other scaffolding methods.
@infoecho forwarded this arxiv paper that our readers may find interesting. Scaffolding in genome assembly is the problem we have in mind. The scaffolding graph can get big and it also possibly follows power-law distribution in terms of connectivities. @infoecho is likely looking at the possibility of storing a large string graph for PacBio assembly.
GraphChi-DB: Simple Design for a Scalable Graph Database System — on Just a PC
We propose a new data structure, Parallel Adjacency Lists (PAL), for efficiently managing graphs with billions of edges on disk. The PAL structure is based on the graph storage model of GraphChi (Kyrola et. al., OSDI 2012), but we extend it to enable online database features such as queries and fast insertions. In addition, we extend the model with edge and vertex attributes. Compared to previous data structures, PAL can store graphs more compactly while allowing fast access to both the incoming and the outgoing edges of a vertex, without duplicating data. Based on PAL, we design a graph database management system, GraphChi-DB, which can also execute powerful analytical graph computation. We evaluate our design experimentally and demonstrate that GraphChi-DB achieves state-of-the-art performance on graphs that are much larger than the available memory. GraphChi-DB enables anyone with just a laptop or a PC to work with extremely large graphs.
Scott Edmunds (@scedmunds) is an editor of GigaSciece, the new, supposedly cutting-edge journal sponsored by BGI. This is a cross-post from his blog at GigaScience. I think GigaScience is doing a lot of innovative thinking in building a scientific journal that fits the internet-centric world of 21st century.
Endorsing Data Citation
Nicely timed for the Data Citation Principles workshop at the IDCC meeting in San Francisco yesterday, the finalized Joint Declaration of Data Citation Principles has just been posted on the Force11 website. We of course endorse these, as data citation is an area we have been promoting and practicing since our formation, using it as a mechanism to incentivize and credit the early release of data from data producers. Most of the challenges have been cultural rather than technical, and despite some setbacks (for example from Nature Genetics), for over two years now we have had generally positive interactions working closely with publishers to make sure our dataset citations have been correctly cited according to DCC and DataCite guidelines. From working very closely with the editors of Genome Biology our sorghum dataset was our first to be correctly cited in the references of a published paper, and BioMed Central now uses this example in the formatting instructions for all of their journals. We have blogged regularly on the topic, but for a more detailed overview of our and others efforts in data citation check out our paper in the BMC Research Notes Data standardization, sharing and publication series.
Amounting to more than a hill of beans: new data and functionality in GigaDB
Following in the footsteps of sorghum, the latest dataset to be published in GigaDB today is another agricultural crop important to food security in the developing world, the genome of the chickpea. As with sorghum, this is another useful example for data citation, being release just in time to showcase new functionality in our GigaScience GigaDB database. The latest release just out this week includes a number of new features, including some minor improvements to formatting, browsing and the submission system, and the ability to contact dataset authors directly, but most relevant here it now has citation manager support. Using functionality handily provided by DataCite, we have added new buttons to allow citation information to be downloaded in RIS, BibTeX and text (see the blue boxes next to the citation information in the screen shot below), allows citation information to be downloading in a format suitable for most citation manager software.
Please let us know if you find are any bugs in this new release, but these new tools aim to make the process of citing data even simpler, reducing the technical barriers, and leaving only cultural ones to overcome. We will not get into the etiquette of when to cite data or papers, but Sarah Callaghan does a fantastic job in her recent blog covering this topic. Our rationale is that if that if you feel that data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs then it should be treated in the same manner. We would encourage others to sign the declaration and help spread the practice of data citation further.
1. Edmunds, S., Pollard, T., Hole, B., & Basford, A. (2012). Adventures in data citation: sorghum genome data exemplifies the new gold standard BMC Research Notes, 5 (1) DOI: 10.1186/1756-0500-5-223
2. Varshney,RK et al. (2014): Genomic data of the chickpea (Cicer arietinum). GigaScience Database. http://dx.doi.org/10.5524/100076
3. Force11 Data Citation Principles http://www.force11.org/datacitation/
A recent interview (How Academia and Publishing are Destroying Scientific Innovation: A Conversation with Sydney Brenner) of Sydney Brenner is going around. It contains a link to a witty and allegorical article he wrote in Current Biology 1996 complaining against ‘Nascence’ that the readers might enjoy.
The plaintiffs claim that by not being able to publish in Nascence, they have suffered injury to their professional careers and are claiming compensatory damages. It can be argued that this is the fate of many scientists and that their claims should be rejected just as their paper was, but we intend to establish that the plaintiffs were wrongfully excluded, that they were unable to confront the negative referee directly and that the Editor was negligent in not checking the validity of this referee’s statements. Even though the Editors will claim that many factors were taken into consideration in their rejection, it is a fair implication that it was the negative comments of one referee that turned the balance.
Your Lordship may find it surprising that, in a profession that prides itself on the objectivity and rigour of scientific argument, individuals are allowed to make ex cathedra statements without any direct support and that the journals believe that they need to preserve the anonymity of such commentators. Their names have now been provided by the defendants on pain of imprisonment, since your Lordship’s ruling that failure to do so would be viewed as contempt of court.
We intend to prove by cross examining the referee that the statements had no justification. We also will show that the Editor, although possessing an academic qualification of some relevance, was essentially a lay person in this
specialised field and should have sought additional opinion rather than giving undue weight to a negative view, not once but twice.
Kudos to Brenner for realizing the problems about twenty years back. Here are his thoughts on peer review. We agree with almost everything he said here and in the rest of the interview.
I think peer review is hindering science. In fact, I think it has become a completely corrupt system. It’s corrupt in many ways, in that scientists and academics have handed over to the editors of these journals the ability to make judgment on science and scientists. There are universities in America, and I’ve heard from many committees, that we won’t consider people’s publications in low impact factor journals.
Now I mean, people are trying to do something, but I think it’s not publish or perish, it’s publish in the okay places [or perish]. And this has assembled a most ridiculous group of people. I wrote a column for many years in the nineties, in a journal called Current Biology. In one article, “Hard Cases”, I campaigned against this [culture] because I think it is not only bad, it’s corrupt. In other words it puts the judgment in the hands of people who really have no reason to exercise judgment at all. And that’s all been done in the aid of commerce, because they are now giant organisations making money out of it.
Elsewhere, Brenner gave away the secret formula for developing creativity
The slide that we agreed most with is the following one. We went through similar experience as him (collaborators generating more and diverse data every few months) and decided to go for iterative approach like him -
Please view all slides below. There is plenty to digest.
We earlier covered two of the RECOMB talks.
a) The paper by Rayan Chikhi (Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson and Paul Medvedev. On the representation of de Bruijn graphs)
De Novo Assembly of Human Genome with Only 1.5 GB RAM
b) The paper by SPAdes group (Yana Safonova, Anton Bankevich and Pavel Pevzner. DipSPAdes: Assembler for Highly Polymorphic Diploid Genomes)
dipSpades Beats Haplomerger Hands Down in Diploid Assembly
The following talks also appear interesting -
c) Siavash Mirarab, Nam-Phuong Nguyen and Tandy Warnow. PASTA: ultra-large multiple sequence alignment
● Divide-and-conquer approach to alignment
● Decomposes the taxa set into small subsets
● Aligns the subsets using a “base” alignment
● “Merges” the subset alignments into a full MSA
● Co-estimates a phylogenetic tree
d) Ngan Nguyen, Glenn Hickey, Daniel Zerbino, Brian Raney, Dent Earl, Joel Armstrong, David Haussler and Benedict Paten. Building a Pangenome Reference for a Population
e) Henry C.M. Leung, S.M. Yiu and Francis Chin. IDBA-MTP: A Hybrid MetaTranscriptomic Assembler Based on Protein Information
Code can be downloaded from here
f) Derek Aguiar, Eric Morrow and Sorin Istrail. Tractatus: an exact and subquadratic algorithm for inferring identity-by-descent multi-shared haplotype tracts
download code here
Here is what the community likes (according to Springer) -
We looked for the last paper, but found this one instead (same authors, different title, different year). How different is their approach compared to Sailfish by Petro?
An Alignment-free Regression Approach to Estimating Allele-Specic Expression in F1 Animals
The remaining ones are given below.
Accepted Papers for RECOMB 2014
Jianzhu Ma, Sheng Wang and Jinbo Xu. MRFalign: Protein Homology Detection through Alignment of Markov Random Fields
Siavash Mirarab, Nam-Phuong Nguyen and Tandy Warnow. PASTA: ultra-large multiple sequence alignment
Wen-Yun Yang, Farhad Hormozdiari, Eleazar Eskin and Bogdan Pasaniuc. A Spatial-Aware Haplotype Copying Model with Applications to Genotype Imputation
Emily Berger, Deniz Yorukoglu and Bonnie Berger. HapTree: A novel Bayesian framework for single individual polyplotyping using NGS data
Ngan Nguyen, Glenn Hickey, Daniel Zerbino, Brian Raney, Dent Earl, Joel Armstrong, David Haussler and Benedict Paten. Building a Pangenome Reference for a Population
Zhanyong Wang, Jae-Hoon Sul, Sagi Snir, Jose A. Lozano and Eleazar Eskin. Gene-Gene Interactions Detection Using A Two-stage Model
Arne Müller, Frank Bruggeman, Brett Olivier and Leen Stougie. Fast Flux Module Detection using Matroid Theory
Shaun Mahony, Matthew Edwards, Esteban Mazzoni, Richard Sherwood, Akshay Kakumanu, Carolyn Morrison, Hynek Wichterle and David Gifford. An integrated model of multiple-condition ChIP-seq data reveals predeterminants of Cdx2 binding
Rui Wang and Scott Schmidler. Bayesian Multiple Protein Structure Alignment
Jianling Zhong, Todd Wasson and Alexander Hartemink. Learning protein-DNA interaction landscapes by integrating experimental data through computational models
Ewa Szczurek and Niko Beerenwinkel. Modeling mutual exclusivity of cancer mutations
Marinka Zitnik and Blaz Zupan. Imputation of Quantitative Genetic Interactions in Epistatic MAPs by Interaction Propagation Matrix Completion
Hamidreza Chitsaz and Mohammad Aminisharifabad. Exact Learning of RNA Energy Parameters From Structure
Shutan Xu, Shuxue Zou and Lincong Wang. A geometric clustering algorithm and its applications to structural data
Armin Töpfer, Tobias Marschall, Rowena A Bull, Fabio Luciani, Alexander Schönhuth and Niko Beerenwinkel. Viral quasispecies assembly via maximal clique enumeration
Yana Safonova, Anton Bankevich and Pavel Pevzner. DipSPAdes: Assembler for Highly Polymorphic Diploid Genomes
Jan Hoinka, Alexey Berezhnoy, Zuben E. Sauna, Eli Gilboa and Teresa Przytycka. AptaCluster – A Method to Cluster HT-SELEX Aptamer Pools and Lessons from its Application
Mingfu Shao, Yu Lin and Bernard Moret. An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes
Kelley Harris, Sara Sheehan, John Kamm and Yun S. Song. Decoding Coalescent Hidden Markov Models in Linear Time
Chen-Ping Fu, Vladimir Jojic and Leonard Mcmillan. An Alignment-Free Regression Approach for Estimating Allele-Specific Expression using RNA-Seq Data
Henry C.M. Leung, S.M. Yiu and Francis Chin. IDBA-MTP: A Hybrid MetaTranscriptomic Assembler Based on Protein Information
Arun Konagurthu, Parthan Kasarapu, Lloyd Allison, James Collier and Arthur Lesk. On sufficient statistics of least-squares superposition of vector sets
Yu Zheng and Louxin Zhang. Reconciliation with Non-binary Gene Trees Revisited
Keith Noto, Carla Brodley, Saeed Majidi, Diana Bianchi and Donna Slonim. CSAX: Characterizing Systematic Anomalies in eXpression Data
Adrian Guthals, Christina Boucher and Nuno Bandeira. The generating function approach for peptide identification in spectral networks
Raunak Shrestha, Ermin Hodzic, Jake Yeung, Kendric Wang, Thomas Sauerwald, Phuong Dao, Shawn Anderson, Himisha Beltran, Mark A. Rubin, Colin Collins, Gholamreza Haffari and S. Cenk Sahinalp. HIT.nDRIVE: Multi-Driver Gene Prioritization based on Hitting Time
Murray Patterson, Tobias Marschall, Nadia Pisanti, Leo van Iersel, Leen Stougie, Gunnar W. Klau and Alexander Schoenhuth. WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads
Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson and Paul Medvedev. On the representation of de Bruijn graphs
Adam Bloniarz, Ameet Talwalkar, Jonathan Terhorst, Michael Jordan, David Patterson, Bin Yu and Yun Song. Changepoint Analysis for Efficient Variant Calling
Y. William Yu, Deniz Yorukoglu and Bonnie Berger. Traversing the k-mer landscape of NGS read datasets for quality score sparsification
Hetu Kamisetty, Bornika Ghosh, Christopher James Langmead and Chris Bailey-Kellogg. Learning Sequence Determinants of Protein:protein Interaction Specificity with Sparse Graphical Models
Ben Raphael and Fabio Vandin. Simultaneous Inference of Cancer Pathways and Tumor Progression from Cross-Sectional Mutation Data
Shay Zakov and Vineet Bafna. Reconstructing Breakage Fusion Bridge architectures using noisy copy numbers
Derek Aguiar, Eric Morrow and Sorin Istrail. Tractatus: an exact and subquadratic algorithm for inferring identity-by-descent multi-shared haplotype tracts
Hua Wang, Heng Huang and Chris Ding. Correlated Protein Function Prediction via Maximization of Data-Knowledge Consistency
Yana Safonova, Anton Bankevich and Pavel Pevzner wrote a new paper that is accepted for RECOMB – “DipSPAdes: Assembler for Highly Polymorphic Diploid Genomes”.
From dipSpades website,
dipSPAdes is a genome assembler designed specifically for diploid highly polymorphic genomes based on SPAdes. It takes advantage of divergence between haplomes in repetitive genome regions to resolve them and construct longer contigs. dipSPAdes produces consensus contigs (representing a consensus of both haplomes for the orthologous regions) and performs haplotype assembly. Note that dipSPAdes can only benefit from high polymorphism rate (at least 0.4%). For the data with low polymorphism rate no improvement in terms of N50 vs consentional assemblers is expected.
The assembly pipeline consists of three steps -
1. Assembly of haplocontigs (contigs representing both haplomes).
2. Consensus contigs construction.
3. Haplotype assembly.
Here is an example of how the complex regions are resolved.
The benchmarks look very impressive, as you can find in the following table.
We expect the real competition to be between technology (all PacBio assembled by Jason Chin’s diploid assembler) and algorithm (short reads + dipSPAdes).