On January 1st, Omicsomics blog requested for the perfect scaffolder.
Envisioning The Perfect Scaffolder
Rather than make any New Year’s resolutions of my own, which I would then feel guilty about not keeping, I’ve decided to make one for someone else: they will write the perfect open source scaffolder. There’s a lot of scaffolders out there, both stand-alone and integrated into various assemblers, but none are quite right.
On March 6th, a benchmarking paper from Sanger Institute delivers it, provided…
A comprehensive evaluation of assembly scaffolding tools
We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output.
Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our
The paper seems to have done a very thorough job by checking each scaffolder based on multiple aligners. The choice of aligner matters !!
1. Please note that the subject line is written in jest. Real scaffolders on real data perform quite lousily given that ’90% correct’ scaffolding is still a bad situation for those 10% of erroneous parts of the genome.
2. Wish the paper could compare SPAdes along with other scaffolding methods.
@infoecho forwarded this arxiv paper that our readers may find interesting. Scaffolding in genome assembly is the problem we have in mind. The scaffolding graph can get big and it also possibly follows power-law distribution in terms of connectivities. @infoecho is likely looking at the possibility of storing a large string graph for PacBio assembly.
GraphChi-DB: Simple Design for a Scalable Graph Database System — on Just a PC
We propose a new data structure, Parallel Adjacency Lists (PAL), for efficiently managing graphs with billions of edges on disk. The PAL structure is based on the graph storage model of GraphChi (Kyrola et. al., OSDI 2012), but we extend it to enable online database features such as queries and fast insertions. In addition, we extend the model with edge and vertex attributes. Compared to previous data structures, PAL can store graphs more compactly while allowing fast access to both the incoming and the outgoing edges of a vertex, without duplicating data. Based on PAL, we design a graph database management system, GraphChi-DB, which can also execute powerful analytical graph computation. We evaluate our design experimentally and demonstrate that GraphChi-DB achieves state-of-the-art performance on graphs that are much larger than the available memory. GraphChi-DB enables anyone with just a laptop or a PC to work with extremely large graphs.
Scott Edmunds (@scedmunds) is an editor of GigaSciece, the new, supposedly cutting-edge journal sponsored by BGI. This is a cross-post from his blog at GigaScience. I think GigaScience is doing a lot of innovative thinking in building a scientific journal that fits the internet-centric world of 21st century.
Endorsing Data Citation
Nicely timed for the Data Citation Principles workshop at the IDCC meeting in San Francisco yesterday, the finalized Joint Declaration of Data Citation Principles has just been posted on the Force11 website. We of course endorse these, as data citation is an area we have been promoting and practicing since our formation, using it as a mechanism to incentivize and credit the early release of data from data producers. Most of the challenges have been cultural rather than technical, and despite some setbacks (for example from Nature Genetics), for over two years now we have had generally positive interactions working closely with publishers to make sure our dataset citations have been correctly cited according to DCC and DataCite guidelines. From working very closely with the editors of Genome Biology our sorghum dataset was our first to be correctly cited in the references of a published paper, and BioMed Central now uses this example in the formatting instructions for all of their journals. We have blogged regularly on the topic, but for a more detailed overview of our and others efforts in data citation check out our paper in the BMC Research Notes Data standardization, sharing and publication series.
Amounting to more than a hill of beans: new data and functionality in GigaDB
Following in the footsteps of sorghum, the latest dataset to be published in GigaDB today is another agricultural crop important to food security in the developing world, the genome of the chickpea. As with sorghum, this is another useful example for data citation, being release just in time to showcase new functionality in our GigaScience GigaDB database. The latest release just out this week includes a number of new features, including some minor improvements to formatting, browsing and the submission system, and the ability to contact dataset authors directly, but most relevant here it now has citation manager support. Using functionality handily provided by DataCite, we have added new buttons to allow citation information to be downloaded in RIS, BibTeX and text (see the blue boxes next to the citation information in the screen shot below), allows citation information to be downloading in a format suitable for most citation manager software.
Please let us know if you find are any bugs in this new release, but these new tools aim to make the process of citing data even simpler, reducing the technical barriers, and leaving only cultural ones to overcome. We will not get into the etiquette of when to cite data or papers, but Sarah Callaghan does a fantastic job in her recent blog covering this topic. Our rationale is that if that if you feel that data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs then it should be treated in the same manner. We would encourage others to sign the declaration and help spread the practice of data citation further.
1. Edmunds, S., Pollard, T., Hole, B., & Basford, A. (2012). Adventures in data citation: sorghum genome data exemplifies the new gold standard BMC Research Notes, 5 (1) DOI: 10.1186/1756-0500-5-223
2. Varshney,RK et al. (2014): Genomic data of the chickpea (Cicer arietinum). GigaScience Database. http://dx.doi.org/10.5524/100076
3. Force11 Data Citation Principles http://www.force11.org/datacitation/
A recent interview (How Academia and Publishing are Destroying Scientific Innovation: A Conversation with Sydney Brenner) of Sydney Brenner is going around. It contains a link to a witty and allegorical article he wrote in Current Biology 1996 complaining against ‘Nascence’ that the readers might enjoy.
The plaintiffs claim that by not being able to publish in Nascence, they have suffered injury to their professional careers and are claiming compensatory damages. It can be argued that this is the fate of many scientists and that their claims should be rejected just as their paper was, but we intend to establish that the plaintiffs were wrongfully excluded, that they were unable to confront the negative referee directly and that the Editor was negligent in not checking the validity of this referee’s statements. Even though the Editors will claim that many factors were taken into consideration in their rejection, it is a fair implication that it was the negative comments of one referee that turned the balance.
Your Lordship may find it surprising that, in a profession that prides itself on the objectivity and rigour of scientific argument, individuals are allowed to make ex cathedra statements without any direct support and that the journals believe that they need to preserve the anonymity of such commentators. Their names have now been provided by the defendants on pain of imprisonment, since your Lordship’s ruling that failure to do so would be viewed as contempt of court.
We intend to prove by cross examining the referee that the statements had no justification. We also will show that the Editor, although possessing an academic qualification of some relevance, was essentially a lay person in this
specialised field and should have sought additional opinion rather than giving undue weight to a negative view, not once but twice.
Kudos to Brenner for realizing the problems about twenty years back. Here are his thoughts on peer review. We agree with almost everything he said here and in the rest of the interview.
I think peer review is hindering science. In fact, I think it has become a completely corrupt system. It’s corrupt in many ways, in that scientists and academics have handed over to the editors of these journals the ability to make judgment on science and scientists. There are universities in America, and I’ve heard from many committees, that we won’t consider people’s publications in low impact factor journals.
Now I mean, people are trying to do something, but I think it’s not publish or perish, it’s publish in the okay places [or perish]. And this has assembled a most ridiculous group of people. I wrote a column for many years in the nineties, in a journal called Current Biology. In one article, “Hard Cases”, I campaigned against this [culture] because I think it is not only bad, it’s corrupt. In other words it puts the judgment in the hands of people who really have no reason to exercise judgment at all. And that’s all been done in the aid of commerce, because they are now giant organisations making money out of it.
Elsewhere, Brenner gave away the secret formula for developing creativity
The slide that we agreed most with is the following one. We went through similar experience as him (collaborators generating more and diverse data every few months) and decided to go for iterative approach like him -
Please view all slides below. There is plenty to digest.
We earlier covered two of the RECOMB talks.
a) The paper by Rayan Chikhi (Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson and Paul Medvedev. On the representation of de Bruijn graphs)
De Novo Assembly of Human Genome with Only 1.5 GB RAM
b) The paper by SPAdes group (Yana Safonova, Anton Bankevich and Pavel Pevzner. DipSPAdes: Assembler for Highly Polymorphic Diploid Genomes)
dipSpades Beats Haplomerger Hands Down in Diploid Assembly
The following talks also appear interesting -
c) Siavash Mirarab, Nam-Phuong Nguyen and Tandy Warnow. PASTA: ultra-large multiple sequence alignment
● Divide-and-conquer approach to alignment
● Decomposes the taxa set into small subsets
● Aligns the subsets using a “base” alignment
● “Merges” the subset alignments into a full MSA
● Co-estimates a phylogenetic tree
d) Ngan Nguyen, Glenn Hickey, Daniel Zerbino, Brian Raney, Dent Earl, Joel Armstrong, David Haussler and Benedict Paten. Building a Pangenome Reference for a Population
e) Henry C.M. Leung, S.M. Yiu and Francis Chin. IDBA-MTP: A Hybrid MetaTranscriptomic Assembler Based on Protein Information
Code can be downloaded from here
f) Derek Aguiar, Eric Morrow and Sorin Istrail. Tractatus: an exact and subquadratic algorithm for inferring identity-by-descent multi-shared haplotype tracts
download code here
Here is what the community likes (according to Springer) -
We looked for the last paper, but found this one instead (same authors, different title, different year). How different is their approach compared to Sailfish by Petro?
An Alignment-free Regression Approach to Estimating Allele-Specic Expression in F1 Animals
The remaining ones are given below.
Accepted Papers for RECOMB 2014
Jianzhu Ma, Sheng Wang and Jinbo Xu. MRFalign: Protein Homology Detection through Alignment of Markov Random Fields
Siavash Mirarab, Nam-Phuong Nguyen and Tandy Warnow. PASTA: ultra-large multiple sequence alignment
Wen-Yun Yang, Farhad Hormozdiari, Eleazar Eskin and Bogdan Pasaniuc. A Spatial-Aware Haplotype Copying Model with Applications to Genotype Imputation
Emily Berger, Deniz Yorukoglu and Bonnie Berger. HapTree: A novel Bayesian framework for single individual polyplotyping using NGS data
Ngan Nguyen, Glenn Hickey, Daniel Zerbino, Brian Raney, Dent Earl, Joel Armstrong, David Haussler and Benedict Paten. Building a Pangenome Reference for a Population
Zhanyong Wang, Jae-Hoon Sul, Sagi Snir, Jose A. Lozano and Eleazar Eskin. Gene-Gene Interactions Detection Using A Two-stage Model
Arne Müller, Frank Bruggeman, Brett Olivier and Leen Stougie. Fast Flux Module Detection using Matroid Theory
Shaun Mahony, Matthew Edwards, Esteban Mazzoni, Richard Sherwood, Akshay Kakumanu, Carolyn Morrison, Hynek Wichterle and David Gifford. An integrated model of multiple-condition ChIP-seq data reveals predeterminants of Cdx2 binding
Rui Wang and Scott Schmidler. Bayesian Multiple Protein Structure Alignment
Jianling Zhong, Todd Wasson and Alexander Hartemink. Learning protein-DNA interaction landscapes by integrating experimental data through computational models
Ewa Szczurek and Niko Beerenwinkel. Modeling mutual exclusivity of cancer mutations
Marinka Zitnik and Blaz Zupan. Imputation of Quantitative Genetic Interactions in Epistatic MAPs by Interaction Propagation Matrix Completion
Hamidreza Chitsaz and Mohammad Aminisharifabad. Exact Learning of RNA Energy Parameters From Structure
Shutan Xu, Shuxue Zou and Lincong Wang. A geometric clustering algorithm and its applications to structural data
Armin Töpfer, Tobias Marschall, Rowena A Bull, Fabio Luciani, Alexander Schönhuth and Niko Beerenwinkel. Viral quasispecies assembly via maximal clique enumeration
Yana Safonova, Anton Bankevich and Pavel Pevzner. DipSPAdes: Assembler for Highly Polymorphic Diploid Genomes
Jan Hoinka, Alexey Berezhnoy, Zuben E. Sauna, Eli Gilboa and Teresa Przytycka. AptaCluster – A Method to Cluster HT-SELEX Aptamer Pools and Lessons from its Application
Mingfu Shao, Yu Lin and Bernard Moret. An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes
Kelley Harris, Sara Sheehan, John Kamm and Yun S. Song. Decoding Coalescent Hidden Markov Models in Linear Time
Chen-Ping Fu, Vladimir Jojic and Leonard Mcmillan. An Alignment-Free Regression Approach for Estimating Allele-Specific Expression using RNA-Seq Data
Henry C.M. Leung, S.M. Yiu and Francis Chin. IDBA-MTP: A Hybrid MetaTranscriptomic Assembler Based on Protein Information
Arun Konagurthu, Parthan Kasarapu, Lloyd Allison, James Collier and Arthur Lesk. On sufficient statistics of least-squares superposition of vector sets
Yu Zheng and Louxin Zhang. Reconciliation with Non-binary Gene Trees Revisited
Keith Noto, Carla Brodley, Saeed Majidi, Diana Bianchi and Donna Slonim. CSAX: Characterizing Systematic Anomalies in eXpression Data
Adrian Guthals, Christina Boucher and Nuno Bandeira. The generating function approach for peptide identification in spectral networks
Raunak Shrestha, Ermin Hodzic, Jake Yeung, Kendric Wang, Thomas Sauerwald, Phuong Dao, Shawn Anderson, Himisha Beltran, Mark A. Rubin, Colin Collins, Gholamreza Haffari and S. Cenk Sahinalp. HIT.nDRIVE: Multi-Driver Gene Prioritization based on Hitting Time
Murray Patterson, Tobias Marschall, Nadia Pisanti, Leo van Iersel, Leen Stougie, Gunnar W. Klau and Alexander Schoenhuth. WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads
Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson and Paul Medvedev. On the representation of de Bruijn graphs
Adam Bloniarz, Ameet Talwalkar, Jonathan Terhorst, Michael Jordan, David Patterson, Bin Yu and Yun Song. Changepoint Analysis for Efficient Variant Calling
Y. William Yu, Deniz Yorukoglu and Bonnie Berger. Traversing the k-mer landscape of NGS read datasets for quality score sparsification
Hetu Kamisetty, Bornika Ghosh, Christopher James Langmead and Chris Bailey-Kellogg. Learning Sequence Determinants of Protein:protein Interaction Specificity with Sparse Graphical Models
Ben Raphael and Fabio Vandin. Simultaneous Inference of Cancer Pathways and Tumor Progression from Cross-Sectional Mutation Data
Shay Zakov and Vineet Bafna. Reconstructing Breakage Fusion Bridge architectures using noisy copy numbers
Derek Aguiar, Eric Morrow and Sorin Istrail. Tractatus: an exact and subquadratic algorithm for inferring identity-by-descent multi-shared haplotype tracts
Hua Wang, Heng Huang and Chris Ding. Correlated Protein Function Prediction via Maximization of Data-Knowledge Consistency
Yana Safonova, Anton Bankevich and Pavel Pevzner wrote a new paper that is accepted for RECOMB – “DipSPAdes: Assembler for Highly Polymorphic Diploid Genomes”.
From dipSpades website,
dipSPAdes is a genome assembler designed specifically for diploid highly polymorphic genomes based on SPAdes. It takes advantage of divergence between haplomes in repetitive genome regions to resolve them and construct longer contigs. dipSPAdes produces consensus contigs (representing a consensus of both haplomes for the orthologous regions) and performs haplotype assembly. Note that dipSPAdes can only benefit from high polymorphism rate (at least 0.4%). For the data with low polymorphism rate no improvement in terms of N50 vs consentional assemblers is expected.
The assembly pipeline consists of three steps -
1. Assembly of haplocontigs (contigs representing both haplomes).
2. Consensus contigs construction.
3. Haplotype assembly.
Here is an example of how the complex regions are resolved.
The benchmarks look very impressive, as you can find in the following table.
We expect the real competition to be between technology (all PacBio assembled by Jason Chin’s diploid assembler) and algorithm (short reads + dipSPAdes).
Dear readers, We wanted to devote this commentary to the history of genome assembly algorithms from 1995 to today, as promised in An Opinionated History of Genome Assembly Algorithms – (i). However, when we tried to qualify the sentence “if a Nobel prize is ever awarded for the human genome, he (Gene Myers) should get it alone”, it was impossible to restrict the discussion to computer-science concepts only. At the end, we decided to split the commentary into two parts – this one on the ‘biological’ aspects of the human genome project and the next one on the technical developments. Remember, everything is ‘opinionated’ and some commentaries are more opinionated than others :).
1988 Video of Eric Lander talking about Human Genome Project (click on image to listen) – his ‘functionator’ was finally ‘demonstrated’ by Ewan Birney in 2012 ENCODE paper
Human Genome Project Put on Five Year Plans
Anyone growing up in the former communist world laughs hearing the words ‘five year plans’, because those plans typical symbolize utter incompetence. Chinese had their five year plans during Mao’s time, Soviets had them during Stalin and the most incompetent government of all continues to develop ‘five year plans’ today. In contrast, economic conditions in China improved dramatically, when its government had the good sense to ignore its ‘plans’.
Apparently, NIH did not get the memo, when it placed the human genome project on five year plans. In all fairness, it was not entirely of NIH’s fault. When the project was conceived (1988), USSR was still going strong and all 54 of 50 ‘Soviet experts’ in Wasington DC saw the system to last for ever.
Nevertheless, the human genome project itself was a giant boondoggle that had been off to a rocky start. The original plan was to finish the project in 2005 after three five year plans. To convince the US public that money was not being wasted, a reputed scientist like James Watson was placed as the project leader. The story-line fascinated the public – sixty year old James Watson, who discovered DNA structure in 1953, is now heading the NIH project to decode the human genome and staking his reputation to make sure that money is not being wasted.
Once the public money was secured, keeping Watson was too much of an inconvenience and the opportunity to fire him came soon. A ‘patent crisis’ emerged in 1991 as we covered in an earlier commentary.
AT A CONGRESSIONAL BRIEFING ON THE Human Genome Project last summer, molecular biologist Craig Venter of the National Institute of Neurological Disorders and Stroke dropped a bombshell whose repercussions are still reverberating throughout the genome community. While describing his new project to sequence partially every gene active in the human brain, Venter casually mentioned that his employer, the National Institutes of Health, was planning to file patent applications on 1000 of these sequences a month.
“I almost fell off my chair,” says one briefing participant who asked not to be named. James Watson, who directs the genome project at NIH, did more than that, exploding and denouncing the plan as “sheer lunacy.” With the advent of automated sequencing machines, “virtually any monkey” can do what Venter’s group is doing, said Watson, who in one sentence managed to insult Venter, his dismayed postdocs, and Reid Adler, the director of NIH’s Office of Technology Transfer, who advised Venter to pursue the patents. “What is important is interpreting the sequence,” insisted Watson. If these random bits of sequences can be patented, he said, “I am horrified.”
Watson was gone in 1992 (check Watson Departure Vexes Genome Experts and Why Watson Quit as Project Head) and the new head Francis Collins joined in a year (1993). The changes created some minor inconveniences for the planners. By the time Collins boarded as captain, the ‘five year plans’ were off target by three whole years. Therefore, 2003 was selected as the new ending year (1993 + 2 five year periods). It was the perfect year for a new PR story – human genome will be decoded in the 50 year anniversary of publication of Watson/Crick paper.
Francis Collins turned out to be the perfect leader for the boondooggle as well. By his third year as leader of HGP, he retracted five of his own papers and put the entire blame on a junior collaborator (‘trainee’) for falsifying data. One of those papers had only two authors – Collins and the ‘trainee’. So, clearly Collins never bothered to critically think about a scientific paper before putting his name on it and relied entirely on the other colleague. That theme of senior researchers taking all credits for good papers but no blame for bad/fake papers emerged as a major model of NIH-sponsored science in Collins’ time.
A Quick Recap on Stalin’s Five Year Plans, If You Are Feeling Nostalgic
Redears are warned not to not fall for the ‘success’ of the first five years. It was achieved through ‘holodomor’ (extermination by hunger) of Ukrainians, which took place in 1932-33.
The Holodomor (Ukrainian: Голодомор, “Extermination by hunger” or “Hunger-extermination”; derived from ‘Морити голодом’, “Killing by Starvation” ) was a man-made famine in the Ukrainian Soviet Socialist Republic in 1932 and 1933 that killed up to 7.5 million Ukrainians. During the famine, which is also known as the “Terror-Famine in Ukraine” and “Famine-Genocide in Ukraine”, millions of citizens of Ukrainian SSR, the majority of whom were Ukrainians, died of starvation in a peacetime catastrophe unprecedented in the history of Ukraine. Since 2006, the Holodomor has been recognized by the independent Ukraine and several other countries as a genocide of the Ukrainian people.
Continue reading An Opinionated History of Genome Assembly Algorithms – (ii)
Please stop telling us by email that Gene Myers was not pleased to read our summary of his Dazzler assembler talk. Our commentaries are not written to please anyone, and even less so, when we try to quickly ‘assemble’ a coherent narrative from over two hundred tweets.
For that matter, neither are our commentaries written to displease any serious scientist (a qualifier that excludes ENCODE ‘leaders’), and we like to present the truth as accurately as possible. Therefore, for the benefit of the community, we will dig into the history of genome assembly algorithms and explain how de Bruijn graphs, string graph, etc. all came together.
Gene Myers made major contribution to genomics as the author of Celera assembler and as a co-author of Stephen Altschul’s BLAST paper. In fact, we argued elsewhere (but cannot locate even with Google’s help) that if a Nobel prize is ever awarded for the human genome, he should get it alone. However, if the Nobel committee decides to pick a group, he needs to share the award with all Celera stockholders, who bought the stock in March 2000 !!
Those working on next-gen sequencing also benefit greatly from his another contribution that does not get mentioned too often. His 1990 paper with Udi Manber got the ball rolling on suffix arrays by introducing a simple, space-efficient alternative to suffix trees. That concept, combined with Burrows Wheeler transform and FM index, got incorporated into all modern short read aligners. We discussed the history two years back in the following link.
Burrows Wheeler transform, Suffix Arrays and FM Index
How could one person make so many fundamental contributions in both genome assembly and sequence alignment? For that, you need to dig into the 1991 doctoral dissertation of John D. Kececioglu, which contains a lot of intellectual groundwork for his later contributions.
Exact and approximation algorithms for DNA sequence reconstruction
Also, it is not coincidental that Kececioglu had both Manber and Myers on his thesis committee (will explain later).
Continue reading An Opinionated History of Genome Assembly Algorithms – (i)
These slides have description of ECTools from M. Schatz and presents hybrid assembly of rice genome.
The most over-rated genome in the world -