Archives

Categories

Y
Y

Broke UK Government Taxing Soldiers’ Medal for Bravery to Fund #100KGP GWAS Madness

Bt-Tvg_CMAAwMqf

If you check the twitter page of UK company Genomics England, a happy picture emerges of an ambitious company building the latest medical knowledge base. Dig a bit deeper and you find that Genomics England is just an arm of UK government (“Genomics England was established in July 2013 as a company 100% owned by the Department of Health.”) and the project is funded by taxpayer money. One can reinterpret it as a prosperous country investing in future of science and technology. The related news stories are all reflective of that ambitious future plan -

DNA project ‘to make UK world genetic research leader’

A project aiming to revolutionise medicine by unlocking the secrets of DNA is under way in centres across England.

Prime Minister David Cameron has said it “will see the UK lead the world in genetic research within years”.

The first genetic codes of people with cancer or rare diseases, out of a target of 100,000, have been sequenced.

Experts believe it will lead to targeted therapies and could make chemotherapy “a thing of the past”.

Just one human genome contains more than three billion base pairs – the building blocks of DNA.

But how prosperous is UK really? Apparently it is so broke that the government started to tax the medals won by soldiers for bravery in past war (to fund GWAS among other things) !!

‘I was told to pay death duty on Dad’s medals’

Like most old soldiers, my father Jim would never talk about the horrors he witnessed in the Second World War. He served as a signals officer with the Royal Navy on convoy duty in the treacherous waters of the north Atlantic, on the bridge of the admiral’s flagship on D-Day, and in operations to search and destroy enemy submarines in the Indian Ocean and off the west coast of Africa. It was dangerous work, and the sight of men drowning in torpedo oil as merchant ships were sunk in freezing waters was too painful to recount.
Even the medals he received at the end of the war were given scant attention. My mother, Mary, kept them in a box which she placed in a brown envelope simply labelled “Dad’s Medals”, and put them in the back of a drawer.
My father only took them out once. He had, for a time, served with the Free French on a Corvette with a ship’s company of 98 French sailors and only four British. In 1992 the French Government honoured all foreign nationals who had taken up arms with their forces and awarded my father a small pension. They also invited him to receive a handsome certificate commending his bravery on behalf of their nation and a medal, Le Croix du Combattant. I accepted this on his behalf at a ceremony at Caen Town Hall, on the north French coast, at which I was kissed enthusiastically on both cheeks by the mayor.

Following this, Dad was invited to the French ambassador’s residence in Kensington on Bastille Day, for which he took out his medals, pinned them proudly to his chest and marched into the elegant mansion, saluted by Foreign Legionnaires.

Dad died 12 years ago. My mother died at Christmas. Suddenly, the medals were mine.

So I was touched when the nice young man from our solicitors, George Ide LLP in Chichester, took a special interest in Dad’s medals, asked what they were awarded for and inquired in general about his war record. He had come to assess my mother’s modest furniture and few possessions for probate. She lived in a small flat in this West Sussex town and left a modest estate.

A few days later, a financial assessment of chests, pictures, lamps and ornaments in my mother’s flat dropped through her letter box. It came to just £375. I had no issue with this until my gaze fell on a price attatched to Dad’s medals.

It read: “A World War Two Medal group of five to James Gilchrist in presentation frame: £40”.

And how relevant is this gigantic 100K genome project (#100KGP) for anything in science? Let us come back to that in a future commentary.

Our New PNAS Paper Debunks the Genomics of Positivity

Readers may recall our blog post from last year -

Tragedy of the Day: PNAS Got Duped by Positivity Lady !!

Whiskey Tango Foxtrot – Arianna Huffington Embraces PNAS-published Junk Science of Positivity !!!

UC Berkeley, Lior Pachter’s University, Offers MOOC Course on ‘Science’ of Positivity

That blog post led to a formal paper that just came out in PNAS -

A critical reanalysis of the relationship between genomics and well-being

Fredrickson et al. [Fredrickson BL, et al. (2013) Proc Natl Acad Sci USA 110(33):13684–13689] claimed to have observed significant differences in gene expression related to hedonic and eudaimonic dimensions of well-being. Having closely examined both their claims and their data, we draw substantially different conclusions. After identifying some important conceptual and methodological flaws in their argument, we report the results of a series of reanalyses of their dataset. We first applied a variety of exploratory and confirmatory factor analysis techniques to their self-reported well-being data. A number of plausible factor solutions emerged, but none of these corresponded to Fredrickson et al.’s claimed hedonic and eudaimonic dimensions. We next examined the regression analyses that purportedly yielded distinct differential profiles of gene expression associated with the two well-being dimensions. Using the best-fitting two-factor solution that we identified, we obtained effects almost twice as large as those found by Fredrickson et al. using their questionable hedonic and eudaimonic factors. Next, we conducted regression analyses for all possible two-factor solutions of the psychometric data; we found that 69.2% of these gave statistically significant results for both factors, whereas only 0.25% would be expected to do so if the regression process was really able to identify independent differential gene expression effects. Finally, we replaced Fredrickson et al.’s psychometric data with random numbers and continued to find very large numbers of apparently statistically significant effects. We conclude that Fredrickson et al.’s widely publicized claims about the effects of different dimensions of well-being on health-related gene expression are merely artifacts of dubious analyses and erroneous methodology.

For a simple explanation of what it is all about, please check the well-written blog post of our co-author James Coyne -

Reanalysis: No health benefits found for pursuing meaning in life versus pleasure

Was the original article a matter of “science” made for press release? Our article poses issues concerning the gullibility of the scientific community and journalists regarding claims of breakthrough discoveries from small studies with provocative, but fuzzy theorizing and complicated methodologies and statistical analyses that apparently even the authors themselves do not understand.

1. Multiple analyses of original data do not find separate factors indicating striving for pleasure versus purpose

2. Random number generators yield best predictors of gene expression from the original data

If you need an even simpler summery, reporter Sharon Begley from Reuters interviewed our co-author and wrote an article as well.

Happiness study draws frowns from critics

More crucially, Brown said, the happiness questionnaire was flawed. People who scored high on three items meant to identify hedonists scored equally highly on 11 items meant to identify people who seek eudaimonic well-being.

“The two constructs are essentially measuring the same thing,” Brown said, so putting people in one category rather than another was “meaningless.”

Most devastating was what happened when Brown grouped the items randomly, calling those who scored high on questions 1, 7 and 8 (or any of 8,191 other combinations) one kind of person and those who scored high on others a second type.

Even with such meaningless groupings, there were patterns of gene activity seemingly characteristic of each group.

Statistics professor Andrew Gelman of Columbia University, who was not involved in either study, called Brown’s critique “reasonable.”

Flawed statistics have become such a serious problem for journals that many of the world’s top titles are adding extra levels of statistical checks in the peer-review process. Deputy executive editor Daniel Salsbury said PNAS was not changing its longstanding practice, which is “to work within our review process to ensure the work is sound in all aspects.”

In a reply to Brown, Frederickson and coauthor Steven Cole of the University of California, Los Angeles, reject the criticism and say they have replicated their 2013 findings in a new sample of 122 people.

Happiness study draws frowns frum critics

NEW YORK (Reuters) – A high-profile 2013 study at concludet at differnt kinds o’happiness air associatid wit dramaticallee differnt patterns o’gene activitee is fatallee flawet, accerdyun’ ta un analeesis publishet un Mundie which tore into its target wit langwage rarelee see n’ science journals.

T’ new pap’r, publishet like t’furst n’ Proceedings o’t’ Nashshunal Academy o’Sciences, slams t’reseerch fer “dubiyus analeeses” an’ “erroneyus methodology” an’ sez it “conjuret nonaxistent effecks out o’thin air.”

N’ t’2013 study, reseerchers had adults anser a 14-item questyunnaire meent ta sort ‘um into acoupla groups: innerested n’ hedonic well-bein (fun an’ selfish pleshur) er eudaimonic well-bein (leedyun’ a meeningfil life).

We will add all other relevant links as they come along.

Gene expression study shows those enjoying subtle humor are healthier than those who do not

Here is a riddle for our readers. Let us say you open the newspaper and read about the latest and greatest new study.

“Gene expression Study shows those enjoying subtle humor are healthier than those who do not”

“Gene expression Study shows Republican-type people are smarter than Democrat-type people”

“Gene expression study in mice shows babies receiving proper parental care are less prone to develop psychological problems later in life”

How do you figure out whether the ‘study’ is properly conducted?

Four act play

You go to the original paper being cited in that report and find that it followed a ‘four-act play’.

figure

Act A. questionnaire or phenotypic observations

In the first act, a number of questions are given to about 50-60 people. The fifty people are classified into two groups based on the answers of those questions.

“Do you drink tea?” Yes means you enjoy subtle humor.

“Do you follow Kim Kardashian in twitter?” Yes means you cannot enjoy subtle humor.

“Do you have a gun?” – Yes means you are a ‘republican-type’ person.

The classification can also be done using phenotypic observations. By now it is well established that mice cannot fill up questionnaire. So, a grad student is sent to the cage five times a day to check whether the mother is taking proper care of its babies.

Act B. gene expression data collected and analyzed according to clusters

The second act is data-intensive. The researchers collect, let’s say, blood samples from those 50 people and perform gene expression measurements through one of the high-throughput technologies. Those data get heavily churned and massaged through procedures that remind one of making butter from milk, except that in case of gene expression data, the ultimate product is ‘genes with p<0.05'.

Act C. Focus on three or four over/under-expressed genes identified by gene expression study

Next, a handful of ‘relevant genes’ are pulled out from the list of genes with p<0.05 and further analyzed. This part depends on the imagination of the researcher. Maybe he checks the 5' region for all those genes and finds a common motif. Voila !!

Act D. Publication in a highly visible peer-reviewed journal and press-release

It is not difficult to fool the editors and reviewers of highly visible journals given one follows proper procedure (as listed above). After all, who is really going into tons and tons of data and check all assumptions in the analysis?

Media loves to report breathlessly about any unusual human-related finding and that is how you happened to see the ‘new study’ in the first place.

Is it important to debunk bad studies?

One challenge in detecting bad studies is that the four steps are often similar to excellent legitimate studies. Therefore, some readers of bad papers tend to assume that anyone following the above four steps is reporting an outstanding discovery, but is that always so?

The bigger challenge is that the experts in various fields are busy with their own lives, and are not interested in reanalyzing data from someone else’s paper. The rewards tend to be very low. We asked a few researchers working on basic science, whether they would be interested in taking time to re-evaluate one of these studies. Often the response was that it was not important to do so, because ‘bad science would eventually get rejected’, when others could not replicate the observations. Yes, maybe in hundred years, but here is how things work out in real life until then.

a) Those publishing bad studies get more funding to publish even more bad studies, and those not interested in debunking do not get heard (or worse, leave science).

b) Once bad science get published in highly visible journals, the doctors, nutritionists, psychologists, etc. feel compelled to guide their patients based on such ‘latest research’. That is a very important social issue that needs to be considered.

c) Things can be even worse. There are cases, where the doctors are ‘forced to use’ recommendations from such research due to updated government regulations.

We will write more on this in a subsequent commentary posted later today.

Benchmarking of Red-colored Bridges

While being stuck in traffic at the Golden Gate bridge during my last San Francisco trip, an idea occurred to me. How about I do some research work to benchmark all suspension bridges in the world?

A quick check at Wikipedia gave me over one hundred candidates, leading me to restrict my ambitious plan to only the red-colored bridges. That made sense, because the red colored bridges attract unusually high number of tourists and therefore should be an ideal subset representing all traffic-heavy suspension bridges.

After carefully considering all relevant factors, I picked the following bridges -

1. 25 de Abril Bridge, Portugal

2. Xihoumen Bridge, China

3. Golden Gate Bridge, San Francisco, USA

4. Yichang Bridge, China

5. Hirado Bridge, Nagashaki, Japan

Benchmarking Criteria and Funding

For the purpose of my study, I selected the following criteria. I will drive on the above bridges through every available access road on three different days of the week and at three different times during the day – early morning, midday and later in the evening.

I collected some preliminary data on the Golden Gate bridge accordingly and applied for a grant from NIH to do benchmarking. The grant application made my hobby project appear relevant for human health, as you can see from the introduction -

Rapid access to medical care is one of the biggest concerns, and overcrowded bridges and highways can lead to fatality for those patients seeking emergency treatment. This study will investigate the time of access across the busiest suspension bridges of the world.

To make the research project even more relevant for human health, I threw in a few irrelevant and arbitrary observations such as two of the selected bridges have the highest suicide rates. What the heck !!

The project got funded !!

Finally the Paper

After completing my study, I wrote a paper and submitted to the most respected civil engineering journal. All three reviewer agreed about the importance of my study and replied using one letter technical jargon that I did not understand.

Reviewer 1: LOL.
Reviewer 2: LOL.
Reviewer 3: LOL.

My civil engineer friend working on construction projects helped me out by sending me a larger email explanation.

Dude,

Just like a computer program moves data from hard-disk to memory and back to hard-disk, a bridge moves people around. Before constructing a bridge, we collect extensive amount of field-data, construct a model on expected traffic flow and design the bridge. That design includes various other factors including cost and margin of error that you did not pay attention to.

You can surely drive up and down the bridges a few times and write travel notes about how cold Golden Gate bridge is during mid-summer, but I do not see what your paper will do to the technical aspect of our field.

I disagree, because as a user, I have a right to benchmark the bridges and as a researcher, I have a right to publish those result in a scientific journal. I am wondering whether to pay PLOS One to publish my paper or send it to open access site homolog.us and save $1350.

Legal Disclaimer

The above story is fictitious and is not related to growing number of bioinformatics papers trying to ‘benchmark’ various software programs as black boxes without any discussion about the underlying algorithms.

An Algorithmic Comparison of BLASR/BWA-MEM, DALIGN and MHAP

The problem of assembling a large number of noisy long reads is expected to show up, no matter whether one uses Pacbio or nanopore long read technology. The good news is that it is possible to do the assembly and the quality is a lot better than what can be achieved with short reads or even Sanger reads at far higher cost. The latest MHAP paper has excellent demonstration of this point regarding the Drosophila genome.

With that knowledge, let us look at the biggest informatics challenge at hand and that is how to align many read pairs against each other. To visualize the scale of the problem, if one plans to assemble human genome from 5kB reads with 50x coverage, that means he will have to work with about 30 million reads. Aligning those 30 million reads all against all is the most time-consuming step in the assembly. That means 30 million x 30 million = 900 trillion alignments of very noisy reads, where Smith-Waterman is the most viable approach.

How to do 900 trillion alignments of long read pairs? The answer is simple – don’t do it, and that is where the main innovation of BLASR/BWAMEM, DALIGN and MHAP come in. All three methods pick only a subset of read pairs to do the actual alignment, and that subset is selected based on sharing of unusually large number of k-mers between read pairs. The difference between the algorithms is in how they find those shared k-mers. To compare the time-performances of the algorithms, one needs to understand both mathematical aspects of the respective methods and the hardware-related issues, because a large amount of data are being swapped between hard-disk, memory and processor during the course of execution.

DALIGN

Gene Myers’ DALIGN uses the most straightforward method. It takes a subset of reads, finds all kmers along with their locations, sorts them and identifies matching pairs. Sorting is the most time-consuming step here, and Myers uses the speed difference between L1 cache and RAM to build a clever sorting algorithm that keeps data within the L1 cache. That can give one 100x boost in access time over RAM, and moreover the conflict between processors is not an issue, because each processor has its own L1 cache.

BLASR/BWA-MEM

BLASR and BWA-MEM use Burrows Wheeler transform and FM index of the collection of reads and then finds shared kmers by doing BWT-based perfect match alignment.

There are three difficulties here compared to DALIGN.

(i) Unlike human genome, where the reference is fixed, the BWT of reads have to be constructed again and again for every read library and that adds to the cost of alignment.

(ii) BWT-based FM index is saved in the RAM and it randomizes the locations compared to the genomic locations. Therefore, the RAM access is effectively random and slows down the process.

(iii) If one has many processors aligning many reads in parallel, memory bandwidth becomes the major bottleneck, because each processor wants to align its own segment with the index in the RAM.

MHAP

MHAP creates a hash table of kmers in a read and attempts to find other matching ones based on the number of hash collisions. I think that is essentially what their algorithm boils down to after you remove all bells and whistles about MinHash sketch. If my interpretation is correct, maybe one can even get rid of the MinHash sketch and simply use a one-dimensional hash.

Here is the problem. The number of comparisons in their approach will scale as N^2, whereas the scaling will be closer to N or N.log(N) for other two approaches (ignoring the insignificant cost of BWT construction for BWAMEM and BLASR). [Edit. the claim about N^2 is wrong, as Adam points out in the supplement.]

Am I missing something here?

———————–

Edit.

1. The author of MHAP responds:

Capture

2. I asked Gene Myers about this post, and he mentioned that there is a third component (‘verifier’) after ‘kmer filtering’ and ‘alignment’ that my blog post did not mention. He also pointed out that verifier takes the most time among all, and should be subject to algorithm development and optimization. Here is his full email -

One thing that isn’t correct though it that the majority of the time taken by DALIGN and all of these tools, is for computing and verifying an alignment between two reads once a pair of reads has been identified as possibly matching. That is, the computation is divided into a “filter” that as efficiently as possible identifies pairs of reads that might have a local/overlap alignment, followed by a “verifier” that determines if the pair actually share a local alignment and delivers the alignment.

The verify takes 80%-90% of the time for DALIGN. MHAP and BLASR both spend a lot of effort focused on making the filter fast, which is not the main issue because most of the time is in the verifier. You do not describe the verifiers for the three methods and that is the important part of the problem ! Also the authors of MHAP and BLASR also don’t seem to understand that this is where the optimization effort should be going.

Biology Student Faces 8 Years in Jail for Posting Scientist’s Thesis on Scribd

Remember TPP? Here is a good example of what ‘free trade’ will look like after all countries adopt the rules written by multinational corporations.

A Colombian biology student is facing up to 8 years in jail and a fine for sharing a thesis by another scientist on a social network.

Diego Gómez Hoyos posted the 2006 work, about amphibian taxonomy, on Scribd in 2011. An undergraduate at the time, he had hoped that it would help fellow students with their fieldwork. But two years later, in 2013, he was notified that the author of the thesis was suing him for violating copyright laws.

The ‘why’ part is explained in the following paragraph –

But according to prosecutors, the move was criminal. Colombian copyright law was reformed in 2006 to meet the stringent copyright protection requirements of a free trade agreement that the country signed with the United States. Yet while the US has few criminal penalties for copyright infringement, Colombia allows only for a few exceptions.

To learn more about TPP, check -

TPP: One More Attempt to End Internet Freedom after Failed SOPA/PIPA

New Bioinformatics Blog to Keep an Eye on – James Knight

james-knight

James Knight (@knightjimr) joined Yale as a research scientist and director of bioinformatics, and also started a blog, where you can find a lot of useful information. For those, who do not know him, he developed the Newbler assembler for 454 reads and also collaborated with Eugene Myers in the 90s, before Myers became world famous (or rather made Craig Venter world famous).

Graphs, Alignments, Variants and Annotations, pt. 1

Note: This post, and following posts, were triggered by the recent post by Heng Li, describing his proposed GFA format for assembly graph data. After a twitter request for comments, my first tweet was that this was the right person to do the format. My second tweet, after reading the post, was that the format “was a miss…” Heng rightly asked me to explain, and this begins that explanation.

In recent years, there has been a growing trend toward developing a standard file format for assemblies, specifically a graph-based format (since all of the assemblers’ internal data structures hold a graph of one sort or another). A number of years ago, it was the AMOS project that made the attempt, two to three years ago, it was FASTG, and now this GFA format proposal.

When the FASTG format was really being talked about at the several conferences two years ago, where by luck or by happenstance most of the lead authors of the major NGS assemblers were present, I came really close to commenting on the format, but refrained. The main reason is that I didn’t see anything wrong with the format itself…it does a very good job capturing the structure that nearly all of the assemblers build, and is a usable format for organizing the assemblies’ data. The thoughts in my mind were all about who was doing the design, and, in a related way, why the FASTA format that we all know and love (or don’t love) is called the FASTA format.

Graphs, Alignments, Variants and Annotations, pt. 2

Instead of heading towards the more theoretic graph design, the day after writing part 1 of what is turning out to be a series, I focused on concrete software changes that might answer the first question I posed in the previous post (“If you recast the problem as that of calling ‘annotated variants’, can you speed up the current pipelines?”), and that is what this post will be about. The more theoretic graph design will be coming, but I’m still thinking through Heng Li’s update describing previous work on graph algorithms, as well as a recent slide deck he presented (the picture of where he is coming from is becoming clearer…).

BWT Construction Parallelized in ‘Parabwt’

Last year, we commented that a large number of bioinformatics groups were working on constructing BWT from huge Illumina libraries. The group of Yongchao Liu, Thomas Hankeln and Bertil Schmidt (who previously worked on various GPU algorithms) was not mentioned, but they also joined the fray with their parabwt release.

ParaBWT is a new and practical parallelized Burrows-Wheeler transform (BWT) and suffix array construction algorithm for big genome data, which has a linear space complexity with a small constant factor. The performance of ParaBWT has been evaluated using two sequences generated from two human genome assemblies: the Ensembl Homo sapiens assembly and the human reference genome, on a workstation with two Intel Xeon X5650 hex-core CPUs and 96 GB RAM, running the Ubuntu 12.04 LTS operating system. Our performance comparison to FMDindex and Bwt-disk reveals that on 12 CPU cores, ParaBWT runs up to 2.2 times faster than FMD-index, reducing the runtime from 26.56 hours to 12.34 hours for a sequence of about 60 billion nucleotides, and up to 99.0 times faster than Bwt-disk.

Paper is still forthcoming. Please be satisfied with the sourceforge page for the time being.

After HGAP And SPAdes Comes New PacBio Assembler – MHAP

Adam Phillippy and collaborators submitted a new paper to biorxiv -

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

A few comments -

1. Primary innovation

They use MinHash sketch.

In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997),[1] and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.[2] It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.[1]

A large scale evaluation has been conducted by Google in 2006 [10] to compare the performance of Minhash and Simhash[11] algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling[12] and using Minhash and LSH for Google News personalization.[13]

Here is a figure from the paper, explaining how it works -

Capture

2. SPAdes vs MHAP comparison

Using PBcR-MHAP, microbial genomes can be completely assembled from long reads in roughly the same time required to generate incomplete assemblies from short reads. For example, PBcR-MHAP was able to accurately resolve the entire genome of E. coli K12 using 85X of SMRT reads in 4.6 CPU hours, or 20 minutes using a modern 16-core desktop computer. In comparison, the state-of-the-art39 SPAdes assembler40 required 4.1 CPU hours to assemble 85X Illumina reads from the same genome. Both short- and longread assemblies are highly accurate at the nucleotide level (>99.999%), but the short-read assembly is heavily fragmented and contains more structural errors (Supplementary Table S4, Supplementary Fig. S3). Our initial SMRT assembly does contain more single-base insertion/deletion (Indel) errors, but polishing it with Quiver (requiring an additional 6.6 CPU hours) resulted in the lowest number of consensus errors of all assemblies (11 vs. 96 for SPAdes).

3. Assembly cost

Exponentially lower costs have democratized DNA sequencing, but assembling a large genome still requires substantial computing resources. Cloud computing services offer an alternative for researchers that lack access to institutional computing resources. However, the cost of assembling long-read data using cloud computing has been prohibitive. For example, using Amazon Web Services (AWS), the estimated cost to generate the D. melanogaster PBcR-BLASR assembly is over $100,000 at current rates, an order of magnitude higher than the sequencing cost. With MHAP, this cost is drastically reduced to under $300. To expand access to the PBcR-MHAP assembly pipeline, we have provided a free public AWS image as well as supporting documentation for non-expert users that reproduces the D. melanogaster assembly presented here in less than 10 hours using AWS. Allocating additional compute nodes, which would marginally increase costs, could further reduce assembly time. For E. coli, the total cost of PBcR-MHAP assembly and Quiver polishing is currently less than $2. With MHAP, assembly costs are now a small fraction of the sequencing cost for most genomes, making long-read sequencing and assembly more widely accessible.

4. Overall assessment

a) The introduction of MinHash sketch in assembly is very innovative. Also, it is very helpful that they gave full demonstration of their assembly technique for small and large genomes.

b) Aligner comparison -

In addition to speed, MHAP is a highly sensitive overlapper. We evaluated the sensitivity and specificity of MHAP versus BLASR32, the only other aligner currently capable of overlapping SMRT reads. BWA-MEM, SNAP, and RazerS were also evaluated, but their current versions were unable to reliably detect noisy overlaps (Supplementary Note 2).

Not sure, why they overlooped DALIGN. Also, BWA-MEM is tuned to near perfect alignment (k=19) and a small change in parameter will give what they are looking for. So, “their current versions were unable to reliably detect noisy overlaps” can be easily and painlessly corrected by sending a quick email to Heng Li.

c) The cost section is somewhat sloppy and is probably written for marketing department of PacBio, not serious audience. Why is cloud cost of E. coli given with Quiver and Drosophila without Quiver? Also, what can one project from $300 for Drosophila to the cloud cost of assembling human-sized genome using the same method? Does it grow as O(N) or O(N^2) with respect to size? We do not need exact numbers, but the order of growth of assembly time with genome size is something any user will look for.

Edit. A comment from @aphillippy

Capture

d) With regard to the general field of bioinformatics, it is indeed great news that assembling complete genomes, which is one of the long-standing problems, is going to be solved satisfactorily. That means bioinformaticians will have to rethink their strategy for future -

Bioinformatics at a Crossroad Again – Which Way Next?

Who Has Legal Control Over Your ‘Big’ Data?

The Biggest Lesson from Microsoft’s Recent Battle with the US Government

A court ruling involving Microsoft’s offshore data storage offers an instructive lesson on the long reach of the US government-and what you can do to mitigate this political risk.

A federal judge recently agreed with the US government that Microsoft must turn over its customer data that it holds offshore if requested in a search warrant. Microsoft had refused because the digital content being requested physically was located on servers in Ireland.
Microsoft said in a statement that “a US prosecutor cannot obtain a US warrant to search someone’s home located in another country, just as another country’s prosecutor cannot obtain a court order in her home country to conduct a search in the United States.”

The judge disagreed. She ruled that it’s a matter of where the control of that data is being exercised, not of where the data is physically located.

This ruling is not at all surprising. It’s long been crystal clear that the US will aggressively claim jurisdiction if the situation in question has even the slightest, vaguest, or most indirect connection. Worse yet, as we’ve seen with the extraterritorial FATCA law, the US is not afraid to impose its own laws on foreign countries.

One of the favorite pretexts for a US connection is the use of the US dollar. The US government claims that just using the US dollar-which nearly every bank in the world does-gives it jurisdiction, even if there were no other connections to the US. It’s quite obviously a flimsy pretext, but it works.

Recently the US government fined (i.e., extorted) over $8 billion from BNP Paribas for doing business with countries it doesn’t like. The transactions were totally legal under EU and French law, but illegal under US law. The US successfully claimed jurisdiction because the transactions were denominated in US dollars-there was no other US connection.