Note. Please check here for various commentaries on Assemblathon paper.
—–
Over the last two days, we have been going through the Assemblathon-2 paper and its supplements carefully, and here is what we learned so far. The following text is not a coherent discussion of the paper, but rather an assortment of mostly unrelated comments.
End of Velvet Era
In Assemblathon-1, many teams used Velvet as their primary assembler. This time, we hardly see it mentioned.
Number of Independent Assemblers
At first, we were excited to find out that Baylor’s assemblies competed well and thought that they developed another kick-ass assembly program we can start parsing in the coming months. Then we realized that they used ALLPATHS-LG as the contig assembler, and their own Atlas-Link and Atlas-GapFill for scaffolding and gap-filling steps. Overall, three assemblers – AbySS, ALLPATHS-LG and SOAPdenovo2 dominated the landscape and were used by multiple groups. Other assemblers – Phusion, Ray, Meraculous, Newbler, Celera and SGA were used primarily by their developers. Also, the French Symbiose team developed a set of assembly tools that competed very well. Wish they gave an unique name to the package for marketing purpose
The above observation about mixing of assembly tools opens up an interesting challenge for anyone trying to interpret the results. Let us say Broad Institute, Baylor, CSHL and University of Georgia all used ALLPATHS-LG for their assemblies and ranked differently in the contest. Does that mean ALLPATHS-LG is a good and a lousy assembler at the same time? Or does it mean the associated tools used by four groups were of different quality? Or is the assemblathon measuring competence of teams and not qualities of automated assembly tools?
Cost of Genome Assembly
Fish genome (1 Gb) had 192x coverage from 8 Illumina HiSeq libraries. The library cost is ~$15-20K.
For assembly cost, we can play with numbers from various teams. Here is the one from Ray -
Computational requirements
Version: 32 computers, 8 cores per computer, 24 GB RAM per computer. Approx. running time: 36–72 hours (depending on species).
Machine cost ~ $32000.
Energy requirement for computers – 60 hours x 1 kwh/machine x 32 = 1920 kwh
At 20c/kwh, energy cost = $384/assembly
We exclude cooling cost for computing center, which is expected to add the base energy cost.
Are we doing the calculation correctly? Does an user run multiple assemblies to pick up the best one, or is the 36-72 hours inclusive of all iterations?
Given that fish genome is 1/3 the size of human genome, does it cost $384*3=$1152 in energy to assemble human genome?
There are other costs for storing sequences, high-speed internet, etc. that we neglected here. What is the real cost of automated assembly of mammalian genome after all costs are accounted for?
Mirror Mirror on the Wall, Who Is the Fairest of Them All
Biggest contribution of Assemblathon paper is likely their list of metrics to judge the qualities of assemblies.
1. NG50 scaffold length: a measure of average scaffold length that is comparable between different assemblies (higher = better).
2. NG50 contig length: a measure of average contig length (higher = better)
3. Amount of scaffold sequence gene-sized scaffolds (>= 25 Kbp): measured as the absolute difference from expected genome size, this helps describe the suitability of an assembly for gene finding purposes (lower = better).
4. CEGMA, number of 458 core genes mapped: indicative of how many genes might be present in assembly (higher = better).
5. Fosmid coverage: calculated using the COMPASS tool, reflects how much of the VFRs were captured by the assemblies (higher = better).
6. Fosmid validity: calculated using the COMPASS tool, measures the amount of the assembly that could be validated by the VFRs.
7. VFR tag scaffold summary score: number of VFR tag pairs that both match the same scaffold multiplied by the percentage of uniquely mapping tag pairs that map at the correct distance apart. Rewards short-range accuracy (higher = better).
8. Optical map data, Level 1 coverage: a long-range indicator of global assembly accuracy (higher = better).
9. Optical map data, Levels 1+2+3 coverage: indicates how much of an assembly correctly aligned to an optical map, even if due to chimeric scaffolds (higher = better).
10. REAPR summary score: rewards short- and long-range accuracy, as well as low error rates (higher = better).
What will happen in Assemblathon-3?
Assemblathon-3 is possibly the last thing in the minds of various genome assembly teams after finishing this major paper, but is there a need for Assemblathon-3? We believe there should not be any Assemblathon-3. Instead, the next step should be to completely dismantle various assembler programs into smaller units, and explain why they performed the way they did by mixing and matching various components from competing assemblers.
Spammer’s paradise
Here is something funny. When we opened the Assemblathon-2 paper, the first thing that came to our mind was that email spammers would have a field day with so many email addresses given together. I guess it is just the cost of doing business






Nice post !
Assemblathon 2 was really a nice project. I enjoyed being part of that.
For the Ray pricing, using IaaS (infrastructure as a service) is way cheaper if the workflow
is not done regularly. Otherwise, it’s a case-by-base analysis.
This Amazon EC2 instance type is suitable for de novo assembly with Ray:
Cluster Compute Eight Extra Large Instance
60.5 GiB of memory
88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core)
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
EBS-Optimized Available: No*
API name: cc2.8xlarge
* On demand price for Linux/UNIX: 2.400 $ / h
* Spot instance pricing: 0.27 $ / h (price change according to demand)
In Assemblathon 2, we used 256 cores (32 machines, 8 cores each). So that’s 16 cc2.8xlarge instances.
16 cc2.8xlarge instances give:
* 256 Intel Xeon cores (16 * 16 cores)
* 968 GiB of distributed memory (16 * 60.5 GiB)
* Fast interconnect: Very High (10 Gigabit Ethernet)
I would probably try out the hyperthreading capability of the Xeon E5-2670 too, that could
boost performance without any additional cost.
Using the maximum (72 hours), that would cost 2764.79 $ ( 16 * (72 h) * (2.4 $ / h))
On Amazon EC2, kWh are included in the price.
If you use the spot instance market, you can probably get the job done with 311.04 $ (assuming
there are no bursts in the market and that the price remains at 0.27 $ / h).
If data is not already in Amazon EBS (Elastic Block Storage), the cheapest way to go is to
create a volume to receive data (1 TiB for example), start a cheap Amazon EC2 instance,
attach the EBS volume on the instance and transfer the data. You don’t want to start 16 cc2.8xlarge
instances and have them waiting for the transfer to complete !
When transfer is done, kill the cheap instance and start the 16 cc2.8xlarge instances.
Attach the volume on one of the instances. Then create a network file system (either with NFS or sshfs).
Install Ray (needs openmpi and g++ on a Amazon AMI), and voilà.
You need to add the cost of EBS (Elastic Block Storage), but this will not be significant compared to
the cost of running instances (see above).
You need to add EBS volume charges:
Amazon EBS Provisioned IOPS volumes
$0.125 per GB-month of provisioned storage
$0.10 per provisioned IOPS-month
Assuming 1 TiB stored during 1 month: 125 $
The number of IOPS-month is hard to estimate, but Ray does not touch disks that much so it does not matter.
1 IOPS-month (IOPS = input/output operation per second) is around 2629800.0 IOPs (plural form of input/output operation).
something I noticed in Additional_file_3.pdf:
Note: The submitted SOAPdenovo snake assembly was generated at a time when some of the
Illumina mate-pair libraries were temporarily mislabelled (details of 4 Kbp and 10 Kbp libraries
were mistakenly switched). A new assembly based, using correct insert sizes of 4 Kbp and 10
Kbp libraries was produced and available at:
ftp://…
Note that those mate-pair libraries from flowcell 2 were not used in either the original submitted
entry or this new assembly. The scaffold N50 and contig N50 of new assembly are 7,144,364 bp
and 53,419 bp, about 4-fold and 3-fold longer than that of the submitted entry, which were
1,772,383 bp and 17,869 bp, respectively
Thanks Sébastien. That is a very informative comment.
Is the 72 hour estimate one shot deal, or do you try the assembly with different options to see which one is best? How many parameters do you need to iterate over to decide about the best assembly?
> Is the 72 hour estimate one shot deal, or do you try the assembly with different
> options to see which one is best?
It depends. Since I am a developer, I did a lot of jobs with checkpoints, but almost none with all
steps. Furthermore, I did not try with MAXKMERLENGTH=96 (default is MAXKMERLENGTH=32) because
the communication code that deals large k-mers was not stable at the time.
> How many parameters do you need to iterate over
> to decide about the best assembly?
With Ray, you basically just play the the k-mer length (-k) and also you want to
verify that Ray detects all paired libraries correctly (this is automated and
it works fine in most cases).
But for Assemblathon 2, I basically just used -k 31. -k 51 or something like that
would be really better with the current Ray as the large k-mer code is quite robust.
From what people can see in the Assemblathon 2 paper, Ray is very good for assembling contigs
almost without any errors, but the scaffolder plugin in Ray is way too conservative.
I have a ticket opened to add QoS (Quality of Service) to the Ray scaffolder to see what’s going on.
I don’t have a clue right now why scaffolding with Ray is not very good whereas the rest is very good.
A recent human genome assembly with Ray using Illumina HiSeq 2500 2×150 data is on my blog, and I am currently working on scaling subsystems such as the Bloom filters (which are now adaptive).
http://dskernel.blogspot.ca/2013/01/assembly-of-human-genome-with-latest.html
But I think for now Ray is really targeting bacterial genomes and meta-genomes.
But Ray Platform (the parallel engine that Ray uses) scales de novo assembly regardless if it’s for bacteria, mammals or metagenomes.
The Assemblathon 2 datasets are small datasets, relatively speaking.
If you take the white spruce dataset SRA056234, it’s 8.5 billions reads for a 20 Gb genome.
http://www.ebi.ac.uk/ena/data/view/SRA056234&display=html
Shaun Jackman assembled that with ABySS !
I am waiting to some compute time to do a Ray run with the new adaptive Bloom filter.
For Big Data assemblies, you really need to scale on a lot of cores, a lot of nodes (or instances) and
also your data structures need to be adaptive.
Thank you Sébastien.
[...] Notes on Assemblathon Paper [...]