Highlights of Gene Myers’ Talk at #AGBT15

We covered Myers’ talk at AGBT last year (Dazzler Assembler for PacBio Reads – Gene Myers), where he announced dazzler assembler. This time you can view his talk through live-streaming, thanks to Pacbio.

His this year’s first talk has two major highlights.

1. Perfect assembly

Myers argued that perfect assembly is possible as long as the average read length is larger than the longest repeat. To quote him – “repeats less than 13kb is solved in pacbio. 13kb from median reads length minus 2 franking lengths.” Error rate is not a factor.



In this context, readers are encouraged to take a look at various papers (e.g. Bresler et al. and more recent Shomorony et al.) from David Tse’s group at Berkeley.

2. Release of DAscrubber

This year, he is planning to release the second module of DAZZLER pipeline and it will be used to clean reads. Here is his slide on various existing assembly pipelines.


Regarding the Daligner released last year, he said “I pulled every trick I ever learned” with Daligner to get 25x – 40x speed improvement over BLASR”.

We will add to this post after his tomorrow’s talk.


The second talk of Gene Myers is described the DAscrubber module, and proposed a new and efficient standard for storing alignment data instead of bam format.


The purpose of this step is to clean the reads before finishing the final assembly.

On what DA scrub does

“DAscrub “TracePoint” really cleans up data; removes hairpins; patches gaps; finds chimeras; etc”

“Scrubber favors editing rather than not editing”

“using intrinsic QV (not from the instrument, but from other reads)”

“99% reads are now perfect; double-scribing increases this to 99.99%”.

On rejection of bam format

“encoding the edit script (BAM) is hugely space inefficient but computing alignment as needed is hugely time inefficient”

“new standard of sequencing file : trace of each point + difference between successive points. Space & time efficient” [as opposed to bam files]

“Keep distance between points, keep # of differences, both space and time efficient, encoding alignments. 60x faster”

“describes the ideas of “Trace point” to compress alignment data for efficiency, allowing reconstruction alignment in linear time”

Pacbio Talks at #AGBT – What to Expect

About one and half year back, I wrote two-part blog posts starting with the following comment. The full posts are linked below.

I will go out on a limb and make a bold call. The world of genomics is on the verge of seeing another set of major transformations, and many algorithms, tools, pipelines and methodologies developed for short reads over the last 3-4 years will be useless. In my opinion, the era of short-read sequencing is reaching a peak, or to be kind to its users, short read technologies are shining like the full moon. Related to peaking of the short read era, we will see two other changes – (i) end of “genome sequencing and genome paper” era and (ii) end of big data bioinformatics. For further explanation of the last sentence, please read the detailed explanation in the later part of the commentary.

End of Short-Read Era? – (Part I)

End of Short-Read Era? – (Part II)

Nobody believed me at that time, but things are changing in small ways and big ways. As an anecdote of small change, a friend of mine showed the reviews of his recent proposal, where he requested funds for genome assembly using Illumina PE and mate-pair reads. The reviewer asked him to get rid of mate pairs and use Pacbio reads. Speaking of big change, you probably have noticed the words ‘perfect assembly’ in the title of Gene Myers’ talk.

On this developing story, here is what I expect to hear from the talks at Pacbio.


Two battles had been going on in the algorithmic front, and both of them can be described as ‘Gene Myers vs Gene Myers’.

The first one is on alignment, where the goal was to take BWT and SA out to save time from BLASR. The DALIGNER built by Myers solved that problem.

In DALIGN Paper, Gene Myers Delivers a Major Blow to His Biggest Competitor

The second goal is to take Celera assembler out and go to a more direct approach for building string graph and assembly. Jason Chin’s Falcon does that, and I expect Myers to announce a similar algorithm in his talk.


What is the best application of long reads that can place it way ahead of various short read technologies? I display a few random tweets from AGBT to show you where the short people are stuck.



As you can see, they are so bogged down with finding indels and solving coverage issues that developing algorithms to find the sequences of two independent chromosomes would take a long time. The words ‘diploid’ or ‘highly polymorphic’ rarely enter the short read bioinformatics literature. So, in a masterful stroke, Pacbio is focusing on pulling out two copies of the chromosomes and a disease where the information matters.

Oxford Nanopore – are they going to compete?

Previously, I have been critical of Oxford Nanopore, and the main drawback, as I argued, was the absence of someone showing a nanopore-only genome assembly. As I discussed, the entire premise of carrying USB stick-sized sequencer falls apart, if you have to also carry a MiSEQ along. Now that Jared Simpson and collaborators completed E. coli assembly from nanopore data only, the previous objection does not remain valid. So, that is definitely a major step ahead IMHO.

Now the question of which technology is better will be decided based on quality+cost, with portability being an added benefit of nanopore sequencing. Will servers win in the long run or iphones? Will servers win in the long run or palm-pilots? As you know, SGI servers are out of business, Dell servers are hot, Iphone has been a winner and palm-pilots can be found in junkyards. So, do not buy/sell stocks based on anything you hear in this blog. I can only discuss where things stand in the technological front.

Perspectives on rare diseases – from a patient and a researcher

Feb 28th is the Rare Disease Day. What role will genomic technologies play in detection and cure of rare diseases?

The daughter of Ken Weiss and Anne Buchanan, the authors of The Mermaid’s Tale blog, wrote a post on her experience with hypokalemic periodic paralysis, a neuromuscular disease that the readers will find thought-provoking.

The story of a rare disease

Despite being the product of two of the authors of this blog – two people skeptical about just how many of the fruits of genetic testing that we’ve been promised will ever actually materialize – I have been involved in several genetic studies over the years, hoping to identify the cause of my rare disease.

February 28 is Rare Disease Day (well, Feb 29 technically; the last day of February which is, every four years, a rare day itself!); the day on which those who have, or who advocate for those who have, a rare disease publicly discuss what it is like to live with an unusual illness, raise awareness about our particular set of challenges, and talk about solutions for them.

I have hypokalemic periodic paralysis, which is a neuromuscular disease; a channelopathy that manifests itself as episodes of low blood potassium in response to known triggers (such as sodium, carbohydrates, heat, and illness) that force potassium from the blood into muscle cells, where it remains trapped due to faulty ion channels. These hypokalemic episodes cause muscle weakness (ranging from mild to total muscular paralysis), heart arrhythmias, difficulty breathing or swallowing and nausea. The symptoms may last only briefly or muscle weakness may last for weeks, or months, or, in some cases, become permanent.

I first became ill, as is typical of HKPP, at puberty. It was around Christmas of my seventh grade year, and I remember thinking to myself that it would be the last Christmas that I would ever see. That thought, and the physical feelings that induced it, were unbelievably terrifying for a child. I had no idea what was happening; only that it was hard to breathe, hard to eat, hard to walk far, and that my heart skipped and flopped all throughout the day. All I knew was that it felt like something terrible was wrong.

Throughout my high school years I continued to suffer. I had numerous episodes of heart arrhythmia that lasted for many hours, that I now know should’ve been treated in the emergency department, and that made me feel as if I was going to die soon; it is unsettling for the usually steady, reliable metronome of the heart to suddenly beat chaotically. But bound within the privacy teenagers are known for, my parents struggled to make sense of my new phobic avoidance of exercise and other activities as I was reluctant to talk about what was happening in my body.

HKPP is a genetic disease and causal variants have been found in three different ion channel genes. Although my DNA has been tested, the cause of my particular variant of the disease has not yet been found. I want my mutation to be identified. Knowing it would likely not improve my treatment or daily life in any applicable way. I’m not sure it would even quell any real curiosity on my part, since, despite having the parents I have, it probably wouldn’t mean all that much to this non-scientist.

Continue reading here.

You may also read their following related post from Feb 27 of last year. It seems like

Genome sequencing for rare diseases


On the other end of the spectrum are the dedicated researchers looking for diagnosis and cure for rare diseases. I had a few email discussions with Dr. Gholson Lyon of CSHL and am quite fascinated to hear about the research he is doing.


Before I go into specifics of his research, here is something I learned from his comments regarding the difficulties in working on rare diseases. Each disease affects only a small number of people, and that means a researcher needs to the have breadth of knowledge into different fields to work effectively. If he is an expert on only one are of the work (let’s say bioinformatics or large-scale sequencing or cell biology or biochemistry), it is very unlikely that he will find another expert dedicated to study another aspect of the same rare diseases. To a large extent, the same is true for researchers working on rare (non-model) organisms, and I see similar dedication among my collaborators working on electric fish.

Dr. Lyon identified a rare disease among a small group of residents in Ogden, Utah and he tetatively named it Ogden syndrome. The following interview covers his personal account of the discovery.

Lyon, Gholson J. (2011) Personal account of the discovery of a new disease using next-generation sequencing. Interview by Natalie Harrison. Pharmacogenomics, 12(11) pp. 1519-1523.

His full bio from his lab website is given below. You can find his publications here.

Gholson Lyon’s lab focuses on analyzing human genetic variation and its role in severe neuropsychiatric disorders and rare diseases, including Tourette syndrome, attention-deficit hyperactivity disorder (ADHD), obsessive compulsive disorder (OCD), intellectual disability, autism, and schizophrenia. By recruiting large groups of related individuals living in the same geographic location (e.g., Utah), Lyon’s lab can study the breadth and depth of genetic variants in a similar environmental background. Using the exome—the parts of the genome that code for protein—and whole-genome sequencing, his lab looks for mutations that segregate with syndromes in the various populations. A second focus of the Lyon lab is to study the mechanistic basis of a new rare disease that they described in 2011. This is the first human disease involving a defect in the amino-terminal acetylation of proteins, a common modification of eukaryotic proteins carried out by amino-terminal acetyltransferases (NATs). The team has been using several different cellular model systems to better understand the disease pathophysiology and the basic process of amino-terminal acetylation. This year, Lyon collaborated with a team of researchers from other universities and companies to use precision medicine to successfully treat a patient with severe OCD. His symptoms were treated with deep brain stimulation, and the team used whole-genome sequencing to try to understand the molecular basis of his disease. The patient experienced significant relief from his symptoms and his quality of life returned, suggesting that similar methods may hold tremendous promise in the future.

Pevzner and Compeau Are Splitting Their Bioinformatics Course into Three

Here is great news for busy people interested in learning bioinformatics from the masters. This year, Pevzner and Compeau are splitting their Bioinformatics I course in Coursera into three independent courses. Self-learners can also try youtube videos from last year here, and use problem-solving tool Rosalind. We reviewed their excellent book here.

The three-part courses are listed below –

Finding Hidden Messages in DNA (Bioinformatics I)

This course begins a series of classes illustrating the power of computing in modern biology. Please join us on the frontier of bioinformatics to look for hidden messages in DNA without ever needing to put on a lab coat. After warming up our algorithmic muscles, we will learn how randomized algorithms can be used to solve problems in bioinformatics.

Assembling Genomes and Sequencing Antibiotics (Bioinformatics II)

Biologists still cannot read the nucleotides of an entire genome or the amino acids of an antibiotic as you would read a book from beginning to end. However, they can read short pieces of DNA and weigh small antibiotic fragments. In this course, we will see how graph theory and brute force algorithms can be used to reconstruct genomes and antibiotics.

Comparing Genes, Proteins, and Genomes (Bioinformatics III)

After sequencing genomes, we would like to compare them. We will see that dynamic programming is a powerful algorithmic tool when we compare two genes (i.e., short sequences of DNA) or two proteins. When we “zoom out” to compare entire genomes, we will employ combinatorial algorithms.

Everything You Need to Know to Follow #AGBT15


The AGBT conference is taking place at the beautiful Marco Island, Florida between Feb 25, 2015 – Feb 28, 2015.


My friend Keith Robison, who writes Omics!Omics! blog, posted an excellent preview on the conference, listing what to expect (and not expect). You can read it here.

Keith is very up-to-date and knowledgeable about latest genomics technology, and his blog was selected by our readers and judges as the ‘Best of 2013′. Here are two other informative posts from Keith related to AGBT15.

Illumina Launches NeoPrep (#agbt15)

Can Ion Torrent Buzz Again?

Conference information

You can see the talks and posters at this website. One nice aspect of AGBT is that they allow outsiders to see all abstracts on their website. For some strange reason, CSHL conferences have not mastered this technology.

On twitter

You can follow the conference in twitter using hashtag #AGBT15, or simply click on this link. A large number of attendees are live-tweeting the talks. In fact this year, they have made twittering the default option for all talks, unless the speaker requests otherwise.


Is perfect bioinformatician possible?

I am not attending the conference, but if I wanted to list my reasons for being there, listening to Gene Myers would have been at the top. Over the last year, I read almost all of his papers going all the way back to 1980s and found them to be incredibly good.

As you may have noticed, Gene Myers came back to the topic of genome assembly last year to revolutionize the field again.

Myers has two talks –

Is Perfect Assembly Possible?
Gene Myers, Ph.D.
Founding Director, Systems Biology Center, Max Planck Institute

Saturday, February 28

Low Coverage, Correction-Free Assembly for Long Reads
Gene Myers, Max Planck Institute – CBG
11:20 a.m. – 11:40 a.m.
Plenary Session: Genomics II
Islands Ballroom


Pacbio workshop and live-streaming

If you are not at the conference, you can still view Gene Myers’ talk and a number of other talks through live-streaming by PacBio. It requires registration at the Pacbio website.

Pacbio organized a fantastic line-up of speakers for their workshop. They are listed below, and you can see more details on posters from their website.


Palms Ballroom
Friday, February 27, 12:00 p.m. – 2:00 p.m.
Lunch will be served

Towards Comprehensive Genomics – Past Present and Future
Introduction by Michael Hunkapiller
President and Chief Executive Officer, Pacific Biosciences

The Human Genome: From One To One Million
J. Craig Venter, Ph.D.
Co-founder, Chairman, and Chief Executive Officer, Human Longevity Inc.

Is Perfect Assembly Possible?
Gene Myers, Ph.D.
Founding Director, Systems Biology Center, Max Planck Institute

Finishing Genomes: Why Does It Matter?
Deanna Church, Ph.D.
Senior Director of Genomics and Content, Personalis

De Novo Assembly of a Human Diploid Genome for the Asian Genome Project
Jeong-Sun Seo M.D.,Ph.D.
Director Genomic Medicine Institute, Seoul National University College of Medicine
Founder and Chairman, Macrogen Inc.

PacBio Long Read Sequencing and Structural Analysis of a Breast Cancer Cell Line
W. Richard McCombie, Ph.D.
Professor, Cold Spring Harbor Laboratory


Thursday, February 26

Resolving the Complexity of Human Genetic Variation by Single-Molecule Sequencing
Evan Eichler, University of Washington
9:00 a.m. – 9:30 a.m.
Plenary Session: Genomics I
Islands Ballroom

Anchored Assembly: Accurate Structural Variant Detection Using Short-Read Data
Jeremy Bruestle, Spiral Genetics
8:30 p.m. – 8:50 p.m.
Concurrent Session: Informatics
Salons E & F

Sequencing-Based Approaches for Genome-Scale Functional Annotation
Matthew Blow, D.O.E. Joint Genome Institute
8:50 p.m. – 9:10 p.m.
Concurrent Session: Biology
Islands Ballroom

Friday, February 27

Neural Circular RNAs are Derived from Synaptic Genes and Regulated by Development and Plasticity
Wei Chen, Max-Delbruck-Centrum (MDC)
8:50 p.m. – 9:10 p.m.
Concurrent Session: Transcriptomics
Salons G through J

A Genome Assembly of the Domestic Goat from 70x Coverage of Single Molecule Real Time Sequence
Tim Smith, U.S. Meat Animal Research Center
8:50 p.m. – 9:10 p.m.
Concurrent Session: Technology
Islands Ballroom

PacBio Application – Influenza Viral RNA-Seq
Amy Ly, The Genome Institute at Washington University
9:10 p.m. – 9:30 p.m.
Concurrent Session: Technology
Islands Ballroom

Saturday, February 28

Low Coverage, Correction-Free Assembly for Long Reads
Gene Myers, Max Planck Institute – CBG
11:20 a.m. – 11:40 a.m.
Plenary Session: Genomics II
Islands Ballroom

E. coli Genome Assembled using Nanopore Data Only

Finally –

A complete bacterial genome assembled de novo using only nanopore sequencing data

A method for de novo assembly of data from the Oxford Nanopore MinION instrument is presented which is able to reconstruct the sequence of an entire bacterial chromosome in a single contig. Initially, overlaps between nanopore reads are detected. Reads are then subjected to one or more rounds of error correction by a multiple alignment process employing partial order graphs. After correction, reads are assembled using the Celera assembler. We show that this method is able to assemble nanopore reads from Escherichia coli K-12 MG1655 into a single contig of length 4.6Mb permitting a full reconstruction of gene order. The resulting assembly has 98.4% nucleotide identity compared to the finished reference genome.

Github repos are available here and here.

They are using DALIGNER by Gene Myers, as we suggested some six months back. Wondering why HGAP pipeline did not work, given that it would have been the easiest solution.

Details of the method –

(i) Four Minion runs were used and only the 2D reads were considered. Average read length – 4kb-8kb. Accuracy – 78-85%.

“In total, 22,270 2D reads were used comprising 133.6Mb of read data, representing ˜29x theoretical coverage of the 4.6 megabase E. coli K-12 MG1655 reference genome.”

(ii) Assembly method – DALIGNER –> multiple rounds of POA (correction tool using similar method as pbdagcon) –> Celera assembler.

POA is referenced to – “Lee, C., Grasso, C. & Sharlow, M. F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).”

(iii) Entire genome was captured in one contig.

Overall, a clean and well-written paper.

Also coming –



Criticisms of the assembly paper started appearing.



Grab your popcorn. It is going to be fun !!

The Conspiracy of Epigenome?


“How does this conspiracy of genes work?” said Eric Lander to describe the latest epigenome project to New York Times reporter Gina Kolata. Conspiracy seems like a curious choice of word, because we are unsure which definition applies here.

1. An agreement to perform together an illegal, wrongful, or subversive act.
2. A group of conspirators.
3. Law – An agreement between two or more persons to commit a crime or accomplish a legal purpose through illegal action.
4. A joining or acting together, as if by sinister design: a conspiracy of wind and tide that devastated coastal areas.

We would have had to scratch our heads less, if Lander used ‘conspiracy of epigenome’ instead. Nothing managed to derail this expensive boondoggle over the last four years, including powerful critics of the scientific principle behind it, fraud allegation against the leader, public humiliation of its sister project ENCODE, NIH cost-cutting and protest of the scientists, and so on. What appears even more puzzling is that despite all those prior events, the ‘leaders’ of the epigenome project ended up making the same mistakes as ENCODE. Don’t these clowns learn anything?

Speaking of history of the epigenome project –

1. When the original project was launched, Mark Ptashne, Oliver Hobert and Eric Davidson disputed the scientific validity of the project using very strong words (Questions over the scientific basis of epigenome project). In the field of gene regulations, each of them is important in his own right. Oliver Hobert works on gene regulatory mechanisms of neural system, and his 2011 Cell paper – ‘Transgenerational Inheritance of an Acquired Small RNA-Based Antiviral Response in C. elegans’ shows one rare example of epigenetic mechanisms in animals. Eric Davidson is among the most respected developmental biologists, who works on gene regulatory mechanisms in early embryo. Mark Ptashne was the first scientist to demonstrate DNA-protein binding, in his 1967 paper “Specific Binding of the λ Phage Repressor to λ DNA”. Criticism from even one of them would be enough to raise eyebrows about a scientific project, but the epigenome project mysteriously managed to survive all three.

They were not the only ones. Ptashne et al wrote –

A letter signed by eight prominent scientists (not including us), and an associated petition signed by more than 50, went into these matters in greater detail, and expressed serious reservations about the scientific basis of the epigenome project.

2. In another unexpected blow, Manolis Kellis, the leader of the epigenome project, was accused of fraud by Berkeley mathematician Lior Pachter (check here and here). In the scientific world, a fraud allegation is far more serious than someone’s scientific theory being called wrong, yet nothing happened in this case. Given that Pachter is a very good and well-respected scientist, why was his case completely ignored?

3. ENCODE, its sister project, was publicly shamed for its media splash and hyped up claims by respected population geneticist Dan Graur. (Check “On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE”). The ENCODE-backers called Graur’s ‘tone’ rude, but respected scientists like Gary Ruvkun said that he was too gentle and used even stronger words. Here is the comment Ruvkun left in Graur’s blog.

I have enjoyed your devastating critiques of Encode and ModEncode. I forwarded your “On the immortality of television sets” to dozens. But you are too gentle here. I get Nature now for free, since I review quite a bit for them—it is a sign of the decline of journals and the decline of the science that they publish that they now give it away to their reviewers—sort of like the discounted subscriptions for magazines which sell high end merchandise if you live in the right zip code. So, I was well rested today and decided to give the 5 papers of the modEncode a chance to blow me away. Not one interesting finding. Not one. Worse than junk food, which at least satisfy in the moment. Hundreds of millions of dollars on genome scale observations of transcripts and transcription factor binding. Grind up a whole animal and look at the average of hundreds of cell types pooled together. A few weak correlations between this and that transcription factor. A few differentially spliced mRNAs. I would say that about 1% of the modEncode budget was appropriate for getting better transcript annotation, but the rest was misguided genomics. A waste of research dollars! The so-called brain trust of the NIH does not understand the difference between big science that is forever—the thousands of genome sequences that are precise to one part in a billion and will be used for the rest of time—and big science datasets that will simply fade away because they are so imprecise and uninteresting to anyone, like RNA seq. On the other hand, the good news about projects such as these is that the hundreds of authors on those papers are distracted from serious science, and the bizarre sexiness of these papers attracts other marginal intellects to those fields, and thus fields that are important becomes much less competitive, ignoring the bloated budgets of these Soviet-style five-year plans.

4. ENCODE made a big climb-down from its position that 80% of human genome was functional. Given that it was supposed ENCODE’s earth-shattering discovery, ENCODE appeared to have discovered nothing and ended up being the most expensive ‘resource paper’. The statistical paper backing ENCODE’s primary claim was strongly disputed by Nicolas Bray and Lior Pachter.

Ward and Kellis (Reports, September 5 2012) identify regulatory regions in the human genome exhibiting lineage-specific constraint and estimate the extent of purifying selection. There is no statistical rationale for the examples they highlight, and their estimates of the fraction of the genome under constraint are biased by arbitrary designations of completely constrained regions.

Note the name Kellis there? He is the leader of the epigenome project.

5. After the huge waste of $350M by ENCODE, scientists are up in arm against all such expensive its huge price tag of $350M. The controversy section of ENCODE’s wiki page is now bigger than every other section, and you cannot imagine what scientists say in private.


Despite all these, the $250M epigenome project came out with another media splash and public claim of major discovery, as if ENCODE never happened. Reuters reports –

Scientists unveil map of ‘epigenome,’ a second genetic code

A second genetic code???? Please note that these articles are usually handed by university’s internal team to the reporters. So, we should not blame the reporters for misunderstanding science.

If you do not believe it, here is the leader of the project showing he is no better.

“A lifetime of environmental factors and lifestyle factors” influence the epigenome, including smoking, exercising, diet, exposure to toxic chemicals and even parental nurturing, Kellis said in an interview. Not only will scientists have to decipher how the epigenome affects genes, they will also have to determine how the lives people lead affect their epigenome.

Why not add victims of holocaust and slavery as well, like the paper mentioned in this post. Needless to say, not a single claim is backed by any science. We have gone through many of the relevant studies, and they were often based on poor-quality association studies of 30 or 40 persons and no further study of causal mechanism.

The article further says –

“The only way you can deliver on the promise of precision medicine is by including the epigenome,” said Manolis Kellis of the Massachusetts Institute of Technology, who led the mapping that involved scientists in labs from Croatia to Canada and the United States.

Given that ‘precision medicine’ and ‘epigenome’ are both vaporwares, there is no doubt that one will help with delivery of the promise of the other.

In this entire media circus, the only truth is spoken by Anshul Kundaje (emphasis ours) –

But while the researchers are confident that their discoveries will be revelatory, they also see a long road ahead. They will find circuits, another author, Anshul Kundaje, an assistant professor of genetics at Stanford said. But, he added: “Making sense of them is a whole different story.”

In other words, nobody understands what is going on, but they all agree that they made a major discovery (‘second genetic code’). Apart from resorting to conspiracies, as Lander did, can anyone provide a sane explanation of how this freak show is sustained?

Occasionality or Probability – What is the Right Term?

The concept of probability comes from mathematics and it has rigorous mathematical definition. It is a measure of distribution of outcomes of truly independent trials. Wikipedia describes it as

When dealing with experiments that are random and well-defined in a purely theoretical setting (like tossing a fair coin), probabilities can be numerically described by the statistical number of outcomes considered favorable divided by the total number of all outcomes (tossing a fair coin twice will yield head-head with probability 1/4, because the four outcomes head-head, head-tails, tails-head and tails-tails are equally likely to occur).

Do you see the emphasis on ‘truly independent’ or ‘well-defined’? Consider tossing a coin for example. Based on our understanding of physical laws governing its motion, it can be said that the motion is well-defined and each outcome is truly independent. The concept of probability is used extensively in theoretical physics (e.g. wave function as ‘probability amplitude’ in quantum mechanics, or distribution of gas atoms or atoms on vibrating springs in statistical physics to derive the concept of temperature), but the physicists always had to qualify their assumption of truly independent by further experimental confirmation. The probabilistic theory of quantum mechanics confused even heavyweights like Einstein, but nobody managed to come up with an experiment to prove it wrong.

When it comes to genetics, we are increasingly seeing the use of the term ‘probability’ to describe experiments, which are far from well-defined. This is not only confusing, but also harmful when being applied in medical context. Professor Ken Weiss wrote an excellent blog post to argue that use of the term ‘probability’ in genetics should be replaced with ‘occasionality’ to make sure people understand that that they are dealing with a different beast altogether.

Occasionality: a more appropriate alternative concept–where there’s no oh!

When many factors contribute to some measured event, and these are either not all known or measured, or in non-repeatable combinations, or not all always present, so that each instance of the event is due to unique context-dependent combination, we can say that it is an ‘occasional’ result. In the usual parlance, the event occasionally happens and each time the conditions may or may not be similar. This is occasionality rather than probability, and there may not be any ‘o-value’ that we can assign to the event.

This is, in fact, what we see. Of course, regular processes occur all around us, and our event will involve some regular processes, just not in a way to which probability values can be assigned. That is, the occasionality of an event is not an invocation of mystic or immaterial causation. The word merely indicates that instances of the event are individually unique to an extent not appropriately measured, or not measured with knowable accuracy or approximation, by probabilistic statistical (or tractable deterministic) approaches. The assumption that the observations reflect an underlying repeated or repeatable process is inaccurate to an extent as to undermine the idea of estimation and prediction upon which statistical and probabilistic concepts are based. The extent of that inaccuracy is itself unknown or even unknowable.

There are clearly genetic causal events that are identifiable and, while imperfect because of measurement noise and other unmeasured factors, sufficiently repeatable for practical understanding in the usual way and even treated with standard probability concepts. Some variants in the CFTR gene and cystic fibrosis fall into that category. Enough is known of the function of the gene and hence of the causal mechanism of the known allele that screening or interventions need not take into account other contextual factors that may contribute to pathogenesis but in dismissible minor ways. But this seems to be the exception rather than the rule. Based on present knowledge, I would suggest that that rule is occasionality.

There is another problem that he does not mention, and we discussed it in A Sequel to Heng Li’s Mysterious New Program, but say here again.

In the current trend of bioinformatics, the theoretical researchers are resorting to being ‘tools provider’, and then the experimentalists are using those ‘benchmarked tools’ to analyze experimental data. That creates distance between those who are developing ‘software tools’ (note – not theoretical models) and those who are interpreting data. If the tool generates a ‘p-value’, then the p-value becomes scientific truth. Moreover, the science becomes diluted, because tools do not leave much room for discussions about fundamental principles. Try to argue with your computer program and see who wins :)

Chromosome-scale Shotgun Assembly using an in vitro Method for Long-range Linkage

This appears to be a promising paper for chromosome-scale scaffolding. (h/t:@lexnederbragt) The main technology is explained in the following paragraph.

We demonstrate here that DNA linkages up to several hundred kilobases can be produced in vitro using reconstituted chromatin rather than living chromosomes as the substrate for the production of proximity ligation libraries. The resulting libraries share many of the characteristics of Hi-C data that are useful for long-range genome assembly and phasing, including a regular relationship between within-read-pair distance and read count. Combining this in vitro long-range mate-pair library with standard whole genome shotgun and jumping libraries, we generated a de novo human genome assembly with long-range accuracy and contiguity comparable to more expensive methods, for a fraction of the cost and effort. This method, called “Chicago” (Cell-free Hi-C for Assembly and Genome Organization), depends only on the availability of modest amounts of high molecular weight DNA, and is generally applicable to any species. Here we demonstrate the value of this Chicago data not only for de novo genome assembly using human and alligator, but also as an efficient tool for the identification of structural variations and the phasing of heterozygous variants.

Here is the abstract –

Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem. These data dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing information. Here, we describe a simpler approach (“Chicago”) based on in vitro reconstituted chromatin. We generated two Chicago datasets with human DNA and used a new software pipeline (“HiRise”) to construct a highly accurate de novo assembly and scaffolding of a human genome with scaffold N50 of 30 Mb. We also demonstrated the utility of Chicago for improving existing assemblies by re-assembling and scaffolding the genome of the American alligator. With a single library and one lane of Illumina HiSeq sequencing, we increased the scaffold N50 of the American alligator from 508 kb to 10 Mb. Our method uses established molecular biology procedures and can be used to analyze any genome, as it requires only about 5 micrograms of DNA as the starting material.

The technology will most likely be offered as a commercial service.

Competing financial interests
The authors have applied for patents on technology described in this manuscript, and Dovetail Genomics LLC is established to commercialize this technology. R.E.G. is Founder and Chief Scientific Officer of Dovetail Genomics. D.H. and D.S.R are members of the Scientific Advisory Board.

Repetitive Elements May Comprise Over Two-Thirds of the Human Genome

This paper from 2011 is a good example of what scientists do, if they do not get showered with money from various government agencies. (Hint. They end up discovering the truth.)

Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo “clouds”). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%–69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed “element-specific” P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.


Quite the opposite happens with too much money.