Scripting for Biology – Online Virtual Classroom-based Module

Hello all,

I am building a number of online virtual classroom-based modules for researchers working on biological data. The description for the first one is attached below, and I will have a beta test starting Sept 14. Please feel free to pass to anyone interested. The beta test is free, and all course materials (including cloud account) will be provided. I currently have only a small number of spots left for this one. If interested, please email pandora at

This course will be useful for anyone without knowledge of programming, and also for those knowing one language (e.g. python) well, who may want to learn another one (say, golang).


Scripting for Biology – 1

Time: Sept 14 – Oct 9 (four weeks)

The purpose of this four-week module is to help biology researchers learn scripting languages (python, PERL, GO). You are expected to learn only one language, not all three.

Whether you never programmed or have some programming experience, this course will help you greatly improve your skills. Moreover, our examples are chosen from relevant bioinformatics problems so that you can apply them to understand recent papers. Some of them will be covered in the second scripting module and other modules related to RNAseq, assembly, metagenomics, etc.

How the classes work

I will first give you short introduction to each topic and then let you start writing your own codes to solve problems. When you solve one problem, you will get another more difficult one and so on. If you get stuck at any step, I will help you overcome the obstacle.

We will keep the class size small (~10) so that I can monitor the work done by every student. Each student will be solving problem at his own pace without being impacted by the rest of the class. So, if someone learns fast, he can finish the modules quickly or go on to solve more difficult problems.

The course is primarily based on chat. We will keep the provision for using google hangout for video discussions, but I plan to use it sparingly. Also, I am thinking about preparing some introductory videos, but it also appears that text-based method of delivering content is more effective than video-based method.

We will provide cloud account for each student with relevant programs installed. All other course-related materials will be provided.


Week 1 – three 75 minute classes.

D1. i) Introduction to computer hardware, operating system and programming. ii) UNIX practices. iii) Entrance test.

D2. Introduction to python/PERL/GO. ii) Coding practices.

D3. i) Solving the first bioinformatics puzzle. ii) Fun with debugging.

Week 2 – project 1 (k-mer counting).

Week 3 – three 75 minute classes.

D1. i) Strings, ii) Regular expression, iii) Associative arrays.

D2. i) Solving another bioinformatics puzzle.

D3. i) Loops and conditions. ii) Advanced debugging.

Week 4 – project 2 (translation and genetic code) and exit test.

Beautiful Japan


I am reading Morris Berman’s ‘Neurotic Beauty – An Outsider’s Look at Japan‘. It is a great book and very insightful on people and history. Strangely, I cannot find any link at Amazon, although I remember seeing it a few weeks back.

Based on his earlier books, I find Berman far more insightful than George Friedman. In fact, the difference is so great that even making such a comparison would be insulting to Berman. No wonder he packed up and moved outside USA, while Friedman became the chief adviser of various government agencies.

Predictions about the Next 100 Years


If you want to know what the next 100 years will look like, here is the perfect book for you. Remember that you need to take all predictions in ‘contrarian mode’, or expect the exact opposite of what the author said would happen.

George Friedman predicts the collapse and breakdown of Russia in 2020, complete crash of ‘paper tiger’ Chinese economy in 2030, rise of three great empires – Japan, Poland and Turkey, who will all go into world war with USA in 2060 and then finally war between Mexico and USA in 2090 with Mexico claiming a large part of southern USA. Talk about sheer arrogance of trying to predict events in 2090, except that George Friedman was the brain behind the security analysis organization Startfor that gets huge subscriptions money from CIA and other US government bigshots. So, what he writes about is the conventional wisdom. Bet against it and you will be on the right track.

“There is a Black Hole in the Middle of Evolutionary Biology”


I finished reading Nick Lane’s “The Vital Question: Energy, Evolution, and the Origins of Complex Life”. It is a great book, but I encourage readers to first start with the paper “Early bioenergetic evolution by Filipa L. Sousa et al.“.

Nick Lane works on origin of life in collaboration with Bill Martin. In the first chapter of the book, he mentioned about a ‘black hole’ in evolutionary theory. What is this black hole? It is that nobody knows how eukaryotes evolved from prokaryotes. Evolutionary biologists seemed to have many explanations including gradual evolution, serial endosymbiosis, etc., but none of them was supported by genome sequencing. In the meanwhile, most scientists moved on assuming that the evolutionary origin of eukaryotes had been explained.

Removing such a ‘black hole’ in fundamental understanding is not only needed to satisfy scientific curiosity, but is also important for practical reasons. Many human genes are shared by a large number of eukaryotes and were therefore present in the last eukaryotic common ancestor. How can anyone think about tweaking those genes to cure diseases, if they do not know why they were there in the first place?

In his book, Lane expands on his work with Bill Martin that energy is the major factor in the evolution of eukaryotes (“Hydrogen hypothesis”). In this scenario, eukaryotic cells evolved through sudden jump, when a bacterial cell entered the cell of a methanogenic archaea as mitochondria and became its source of energy. That is only a part of the book however, and in the rest, he provides explanations of the other unanswered questions on origin of life (check his “Life Ascending: The Ten Great Inventions of Evolution“) based on energy considerations. In this scenario, life originated in the thermal vents and energy was the major factor in that process. He wrote about the chemiosmotic theory of Peter D. Mitchell and explained how the thermal vents were natural places for such process to originate and evolve.

If the first few chapters do not convince you to visit a thermal vent during the next summer vacation, the last chapter definitely will. In it, he showed electron microscope images of a strange creature discovered in a thermal vent near Japan. The 2012 paper by Yamaguchi et al. was titled – ‘Prokaryote or eukaryote? A unique microorganism from the deep sea’ and it indeed delivered what it promised in the title.

There are only two kinds of organisms on the Earth: prokaryotes and eukaryotes. Although eukaryotes are considered to have evolved from
prokaryotes, there were no previously known intermediate forms between them. The differences in their cellular structures are so vast that the problem of how eukaryotes could have evolved from prokaryotes is one of the greatest enigmas in biology. Here, we report a unique organism with cellular structures appearing to have intermediate features between prokaryotes and eukaryotes, which was discovered in the deep sea off the coast of Japan using electron microscopy and structome analysis. The organism was 10 μm long and 3 μm in diameter, having >100 times the volume of Escherichia coli. It had a large ‘nucleoid’, consisting of naked DNA fibers, with a single nucleoid membrane and endosymbionts that resemble bacteria, but no mitochondria. Because this organism appears to be a life form distinct from both prokaryotes and eukaryotes but similar to eukaryotes, we named this unique microorganism the ‘Myojin parakaryote’ with the scientific name of Parakaryon myojinensis (‘next to (eu)karyote from Myojin’) after the discovery location and its intermediate morphology. The existence of this organism is an indication of a potential evolutionary path between prokaryotes and eukaryotes.

Overall a great book and well worth reading.

Nick Lane’s “The Vital Question: Energy, Evolution, and the Origins of Complex Life”

I am starting to read Nick Lane’s new book – “The Vital Question: Energy, Evolution, and the Origins of Complex Life”.


Nick Lane is a UK-based biochemist, who writes about the unsolved fundamental problems in evolutionary biology. I earlier read his ‘Life Ascending: The Ten Great Inventions of Evolution’ and enjoyed it. Here is a short description of the new book.

To explain the mystery of how life evolved on Earth, Nick Lane explores the deep link between energy and genes.

The Earth teems with life: in its oceans, forests, skies and cities. Yet there’s a black hole at the heart of biology. We do not know why complex life is the way it is, or, for that matter, how life first began. In The Vital Question, award-winning author and biochemist Nick Lane radically reframes evolutionary history, putting forward a solution to conundrums that have puzzled generations of scientists.

For two and a half billion years, from the very origins of life, single-celled organisms such as bacteria evolved without changing their basic form. Then, on just one occasion in four billion years, they made the jump to complexity. All complex life, from mushrooms to man, shares puzzling features, such as sex, which are unknown in bacteria. How and why did this radical transformation happen?

The answer, Lane argues, lies in energy: all life on Earth lives off a voltage with the strength of a lightning bolt. Building on the pillars of evolutionary theory, Lane’s hypothesis draws on cutting-edge research into the link between energy and cell biology, in order to deliver a compelling account of evolution from the very origins of life to the emergence of multicellular organisms, while offering deep insights into our own lives and deaths.

Both rigorous and enchanting, The Vital Question provides a solution to life’s vital question: why are we as we are, and indeed, why are we here at all?

Lane and Martin first proposed the above ideas in their 2010 Nature paper – The energetics of genome complexity.

Slides from Gene Myers


Gene Myers kindly shared his slides for our blog readers. This will help understand the notes from yesterday’s talk better.

Gene Myers at the Pacbio Bioinformatics Workshop – Follow #SMRTBFX in Twitter

A Pacbio bioinformatics workshop is currently going on at NIST in Maryland, where Gene Myers (who is not as smart as David Haussler :) ) is the keynote speaker. Interested readers are encouraged to follow the SMRTBFX hashtag in twitter. The slides from the talk are posted here.


Based on twitter reports, Gene Myers said –

1. The noise in PacBio reads is basic thermodynamic noise -> almost total random.

2. Perfect assembly possible iff: 1) poisson sampling 2) random error 3) reads longer than repeats.

3. Longer reads take away some fun on solving repeat problem.

4. The repeats can be resolved by long reads & leveraging heterogeneity of repeats. No assembler has reached theoretical limits yet.

5. The future is here – right out of the box reference genomes now possible

6. It is easier to share data interfaces than software interfaces. Lets define interfaces so we can play together — good idea Gene! Same principle for the software group at celera

7. GM talked about “classic time-space trade-off” for sequence alignment. Using trace points save time with minimul space overhead

7. Gene Myers’ Dazzler blog with all code.

8. Trace-points scale liberally in both time and space

9. Trace Point Concept better than bam and sam — for pacbio reads. We would need converters to bam and sam for other tools

10. No evidence aligns to A implies A is bad. Vote for the quality of each A segment.

11. It is a statistic to vote or this and it gives information about what the quality is.

12. Votes of consensus between B reads (aligned reads) to A read (anchor read) can be used for intrinsic QV

13. Perfect (near) is within reach. dazzler DB frameowrk for assembly pipline. Trace points for intrinsic quality values.

14. All data needed for assembler is now stored in 2 TB of data.

15. Uses patching to create uniform quality reads its the artifacts to get a good string graph.

16. Scrubbers should remove as little real data as possible while removing all the artifacts from the data

17. No real world string graphs look perfect because of insufficient scrubbing

18. challenge for DAscrub is incorporating repeat analysis. Avail. For collaborative use. Not stable enough for wide distribution

19. DAscrub: assemble @PacBio data without error correction step. Coming soon.

…….and possibly a snub at the doctor, who is currently running NIH and was his competitor in the Human Genome Project.


Thanks to @infoecho and others for covering the workshop for us.

High-throughput Pairing of T Cell Receptor Alpha and Beta Sequences

Readers may find this paper by Bryan Howie and colleagues interesting. It solves an important problem (finding pairing pattern of TCR alpha and beta chains) using high-throughput sequencing and an elegant mathematical model.

The T cell receptor (TCR) protein is a heterodimer composed of an alpha chain and a beta chain. TCR genes undergo somatic DNA rearrangements to generate the diversity of T cell binding specificities needed for effective immunity. Recently, high-throughput immunosequencing methods have been developed to profile the TCR alpha (TCRA) and TCR beta (TCRB) repertoires. However, these methods cannot determine which TCRA and TCRB chains combine to form a specific TCR, which is essential for many functional and therapeutic applications. We describe and validate a method called pairSEQ, which can leverage the diversity of TCR sequences to accurately pair hundreds of thousands of TCRA and TCRB sequences in a single experiment. Our TCR pairing method uses standard laboratory consumables and equipment without the need for single-cell technologies. We show that pairSEQ can be applied to T cells from both blood and solid tissues, such as tumors.

Their probabilistic model is very familiar to me, because twelve years back I used it in a different context – for finding clustered proteins from noisy large-scale protein-protein interaction data. Here is the basic idea. If A has 10 friends and B has 10 friends, what is the probability that they have 9 common friends (by chance alone)? If the computed probability is very low and the actual measurement shows that they indeed have 9 common friends, that means A and B are strongly associated. In case of TCR alpha and beta chains in their experimental set-up, that strong association implies their combining to form heterodimers. For large-scale protein-protein interaction data in yeast, significant association appeared to show functional similarity.

The work for the above TCR paper was done by Adaptive Technologies, a very creative Seattle-based company that was founded by scientists from Fred Hutch Cancer Research Institute.

How Much of Evolution is Influenced by Extinction?

I finished reading a wonderful book that I recommend to everyone trying to understand evolution. It was written by paleontologist David M. Raup, who passed away last month. My curiosity was piqued by the Sandwalk blog, which recommended it as one of the top five books on evolution.


Raup argued that given the high rate (99.9%) of species ever living on earth dying out (going extinct), the mode of extinction should play a dominant role in evolution, but what if a large fraction of those extinctions took place due to rare natural catastrophes (such as rocks hitting the earth from sky)? Shouldn’t then the biologists pay less attention to fitness of genes and more to the unpredictable natural events? For a quick introduction to the topic covered, readers may take a look at Raup’s 1994 PNAS paper – “The role of extinction in evolution”.

The extinction of species is not normally considered an important element of neodarwinian theory, in contrast to the opposite phenomenon, speciation. This is surprising in view of the special importance Darwin attached to extinction, and because the number of species extinctions in the history of life is almost the same as the number of originations; present-day biodiversity is the result of a trivial surplus of originations, cumulated over millions of years. For an evolutionary biologist to ignore extinction is probably as foolhardy as for a demographer to ignore mortality. The past decade has seen a resurgence of interest in extinction, yet research on the topic is still at a reconnaissance level, and our present understanding of its role in evolution is weak. Despite uncertainties, extinction probably contains three important elements. (i) For geographically widespread species, extinction is likely only if the killing stress is one so rare as to be beyond the experience of the species, and thus outside the reach of natural selection. (ii) The largest mass extinctions produce major restructuring of the biosphere wherein some successful groups are eliminated, allowing previously minor groups to expand and diversify. (iii) Except for a few cases, there is little evidence that extinction is selective in the positive sense argued by Darwin. It has generally been impossible to predict, before the fact, which species will be victims of an extinction event.

Achilles Heel of ‘Big Data’ Science

The promoters of ‘Big Data’ science argue that by collecting increasingly large amount of data and by processing the data with clever algorithms, they can make fundamental scientific discoveries (or other social contributions). Many others point out the lack of discoveries compared to what the same people had been promising for years, to which the ‘Big Data’ supporters say that they have not collected enough data yet.

In this article, I present three rules to show that the basic premise of ‘Big Data’ is faulty and explain them with many examples. The rules are-

Rule 1.

Quality does not scale, but noise scales.

Rule 2.

The noise can only be reduced by high-quality algorithms.

Rule 3.

Rules 1 and 2 are valid at all scales.


Let me start with the simple example of genome assembly from short reads using de Bruijn graphs. What happens, when one throws in tons and tons of reads to get higher coverage? The number of high-quality k-mers (i.e. those truly matching the genome) remain the same, but the number of noisy k-mers scale with more data. As you add more coverage, the number of noisy k-mers start to overwhelm the system. First thing you see is that your computer RAM space getting filled, leading to periodic crashes. At even higher coverage, the de Bruijn graph has all kinds of tips and bubbles formed in addition to real contigs.

Both problems are solvable, but they need high-quality algorithms.


Think carefully about the implication of the last sentence within the context of Rule 3. Rule 3 says that Rule 1 and 2 apply at all scales. That means they apply for de Bruijn graph, but they also apply for researchers developing high-quality algorithms.

Let us say, the government throws in a lot of money to get high-quality algorithms. What happens? Well, high-quality algorithms reach a plateau, but noise scales with money. As a result, the space of new bioinformatics tools look similar to de Bruijn graph at 200x coverage. How to figure out what is good and what is not? Well, maybe you need high-quality algorithms to figure out which algorithms are worth using :)


I thought about the mentioned rules for months and could not find any way to get out of the constraints imposed by them. In the meanwhile, the scientific world keeps marching to the tune of ‘Big Data’ in all aspects. Every aspect of it, including ranking papers based on citation (or God forbid – retweets) is vulnerable. The same goes for automated annotation of public databases based on existing data. This last point will be the focus of another forthcoming post.



Stephano Lonardi posted a link to his paper, and believe it or not, I was looking for the same paper in my directory, while writing this blog post today, but could not locate it. That is no coincidence, because my initial thinking six months back was influenced by his paper posted at biorxiv. At that time, I was working on 1000x E. coli data posted by the SPAdes group and made similar observation in the assembly stats. The explanation (dBG getting more noise) seemed obvious, but it is also known that SPAdes manages to produce a good assembly from 1000x data. That observation inspired the rest of the thinking about need for high-quality algorithm to overcome noise.

You may also realize that throwing more money to solve assembly problem would not have obtained the high-quality solution, but instead polluted the space with too much noise (i.e. low-quality algorithm). It was rather Pevzner’s work of over two decades that got us there. That is the essence of Rule 3 in one human context.