Categories Reports from Asia – ‘Bangkok – a Wild Ride’


Bangkok – a wild ride

The crazy rain barely stopped, when I got out of the airport express train at Phaya Thai station in Bangkok. This was my last day of two weeks’ trip to various Asian countries to meet with NGS researchers. The day was reserved for buying small gifts for my kids, but the rain held me inside the hotel until mid-day.

I tried to find my way through the flooded sidewalks, when a motorcycle taxi spotted me.

“Hey, where do you want to go?”
Me: “Any shopping mall nearby?”
“What do you like to buy?”
Me: “Clothes and small things for kids.”
“Pixie, you need to go to Pixie. 40 Bahts – I take you.”

For 40 bahts (~$1), I not only got a ride but a truly adventurous wild ride that no theme park can match. Traffic moved slowly on all lanes of the highway, but the motorcycle taxi raced fast. It managed its way through by swerving in and out of lanes and taking advantage of everything from the sidewalk to the lanes for oncoming traffic. A big bus coming toward us nearly kissed my bag, when the taxi managed to get out of the oncoming lane.

“Hey, do you need a lady for tonight?” – he asked rather nonchalantly after this feat. “I am preparing my mind for 72 virgins” – I had to say. Thankfully, ‘Pixie’ mall (=Big C) arrived by then and I got out alive.

Although I did not plan to meet any researcher in Bangkok itself, I passed through the city at least three times during the trip. Bangkok is fast becoming a major international hub in Asia by playing the role of Singapore in the older days. The airport is well-connected to all over Asia through many discount airlines. As an added feature, you can venture into the city to buy cheap clothes and electronic goods. Moreover, growing Thailand is pulling all hinterlands in Laos, Cambodia and Myanmar forward. That part of the world is ready for dramatic transformations in the years ahead.

[coming up]

Hong Kong – will it become the bioinformatics capital?


Singapore – fast forward of Lee Kuan Yew’s vision

[visits to A*Star Research Institute and National University of Singapore, and general comments about Singapore]

Kolkata – a new beginning?

[visit to newly established NGS research center in Kalyani near Kolkata]

Seoul – what will the unification bring?

[comments on transit through Seoul and upcoming Korean unification]

Back to the land of exceptionalism

[the day started with exceptionally arrogant immigration officers, as usual]

The Current Status of the Introductory Book on Genome Assembly


Dear readers,

In mid-February, I made a somewhat finished draft of the book available for purchase here, but did not announce in the blog, because I was not happy with the quality. Since then, I have been working on improving a number of sections. Over the next few days, I will post the texts of those sections as separate blog commentaries and then publish an updated version of the book around April 15th by incorporating all those sections. I am quite satisfied with the changes and expect the finalized book to be useful for newcomers.

I am planning to set the price at $20 after the next release, but if you grab the book now at the leanpub site, you will be able to get it for $15. As I mentioned earlier, once you make the initial purchase from leanpub, you will always be able to access the latest version from their site at no extra cost. They give pdf, mobi and epub formats.

If you like to see a copy of the released version, please feel free to email me at [email protected] and I will send you a pdf file.


Earlier posts –

An Easy-to-follow Introductory Book on NGS Assembly Algorithms
An Update on the Introductory Book on Genome Assembly

rnaQUAST: Quality Assessment Tool for Transcriptome Assemblies

Algorithmic Biology Lab in St. Petersburg of SPAdes fame developed a new tool for evaluating quality of transcriptome assemblies using reference genome and annotation (h/t: anton). Stay tuned for more information.

3 Options

3.1 Input data options

To run rnaQuast one needs to provide either FASTA files with transcripts (recommended), or align transcripts to the reference genome manually and provide the resulting PSL files. rnaQUAST also requires reference genome and optionally an annotation.
-r , –reference
Single file with reference genome containing all chromosomes/scaffolds in FASTA format.

-gtf , –annotation
File with annotation in GTF/GFF format.

-c , –transcripts
File(s) with transcripts in FASTA format separated by space.

-psl , –alignment
File(s) with transcripts alignments in PSL format separated by space.

3.2 Basic options

-o , –output_dir
Directory to store all results. Default is rnaQUAST_results/results_.

Run rnaQUAST on the test data from the test_data folder, output directory is rnaOUAST_test_output.

-d, –debug
Report detailed information, typically used only when detecting problems.

-h, –help
Show help message and exit.

3.3 Advanced options

-t , –threads
Maximum number of threads. Default is the number of CPU cores (detected automatically).

-l , –labels
Names of assemblies that will be used in the reports separated by space.

-ss, –strand_specific
Set if transcripts were assembled using strand specific RNA-Seq data in order to benefit from knowing whether the transcript originated from the + or – strand.

Minimal alignment size to be used, default value is 50.

Do not draw plots (makes rnaQUAST run a bit faster).

FASTAG Viewer Bandage

This FASTAG viewer appears very impressive (h/t: Anton). Have you used it?

De novo assembly graphs contain assembled contigs (nodes) but also the connections between those contigs (edges), which are not easily accessible to users. Bandage is a program for visualising assembly graphs using graph layout algorithms. By displaying connections between contigs, Bandage opens up new possibilities for analysing de novo assemblies that are not possible by looking at contigs alone.


    A layout algorithm is used to automatically position graph nodes.
    Manually reposition and reshape nodes.
    Zoom, pan and rotate the view using either mouse or keyboard controls.
    View the entire assembly graph or only a region of interest.
    Copy node sequences to the clipboard or save them to file.
    Nodes can be coloured using built-in colour schemes or user-defined colours.
    Nodes can be labelled using node number, length, coverage or a user-defined label.
    Find nodes quickly in a large graph using node numbers.
    Specify the thickness of nodes and allow thickness to reflect the node’s coverage.
    Define the relationship between the length of a node and the length of its sequence.
    Two possible styles for handling reverse complement nodes:
    Single: nodes and their reverse complements are drawn as one object with no direction.
    Double: nodes are drawn using arrow heads to indicate direction and reverse complement nodes are drawn separately with arrow heads pointing the opposite direction.
    Integrated BLAST search allows for highlighting specific sequences in an assembly graph.

Ultrafast SNP Analysis using the Burrows-Wheeler Transform of Short-read Data

This recent paper appears quite interesting (h/t: Ruibang). It starts with the BWT of short read library (e.g. BCR), and skips the alignment step altogether to go straight to SNP determination.

Motivation: Sequence-variation analysis is conventionally performed on mapping results that are highly redundant and occasionally contain undesirable heuristic biases. A straightforward approach to SNP analysis, using the Burrows-Wheeler transform (BWT) of short-read data, is proposed.

Results: The BWT makes it possible to simultaneously process collections of read fragments of the same sequences; accordingly,SNPs were found from the BWT much faster than from the mapping results. It took only a few minutes to find SNPs from the BWT (with supplementary data, FDC) using a desktop workstation in the case of human exome or transcriptome sequencing data and twenty minutes using a dual-CPU server in the case of human genome sequencing data. The SNPs found with the proposed method almost agreed with those found by a time-consuming state-of-the-art tool, except for the cases in which the use of fragments of reads led to sensitivity loss or sequencing depth was not sufficient. These exceptions were predictable in advance on the basis of minimum length for uniqueness (MLU) and fragment depth of coverage (FDC) defined on the reference genome. Moreover, BWT and FDC were computed in less time than it took to get the mapping results, provided that the data was large enough.

Availability: A proof-of-concept binary code for a Linux platform is available on request to the corresponding author.
Contact: [email protected]


The authors are not new to using BWT and suffix arrays for analyzing genomic data. Here are a few of their previous papers –

2009 –

Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences

2009 –

Localized suffix array and its application to genome mapping problems for paired-end short reads

2011 –

Seed-set construction by equi-entropy partitioning for efficient and sensitive short-read mapping

Outrageous Prediction 2015 – USA Will Start to See Exodus of Russian Scientists

Over the years, our blog has made a number of forecasts, which seemed outrageous when we made them, but turned out to be correct over time. Please ponder over the today’s one and feel free to comment. I am leaving for airport and will add full post after finding a good wifi somewhere.



Who is your ruler?

One of the first things newcomers to USA notice is that the state capitals are not located in the largest and most well-accessed cities within the states. The capital of New York state is in Albany, not New York. The capital of Illinois is in unknown Evansville, not Chicago and the capital of California is neither in San Francisco or Los Angeles, but in Sacramento. Boston, Massachusettes is possibly the only exception. The reason for this, as is explained, is that the largest cities are already burdened with trade, commerce, manufacturing and transportation. Therefore, it is prudent to decentralize and take away the extra load of government activities on top of existing ones.

That explanation presents one with another dilemma – why did not anyone else think of it before? The act of moving the government away from the largest city ended in disastrous failures in two rare examples of the last thousand years of Indian history. One of them got the ruler labeled as a crazy guy forever, and the other one marked the beginning of the end of British Raj in India. Rulers and their associates love to be the center of attention and stay right in the middle of the largest cities.

It took me a decade to figure out the real answer. As it turns out, the rulers of US states and nation are indeed located right at the centers of the biggest cities, but we do not (or rather did not) know them as such. If you go to the centers of the big cities in USA, you see large banks. USA is a country run by banks and bankers, whereas the politics is subordinate to them. That is not an unique insight any more in 2015, and many have come to the same conclusion since the financial crisis of 2008.


Peculiarities of banking-led empires

Within US, the northern states followed the above model since early days of the country and south was fully brought in after the civil war. Post-WWII, US has been trying to take over most other countries by expanding this banking empire. The process involves –

(i) control of central bank through organizations like IMF

(ii) control of money and credit through central bank

(iii) control of media

(iv) support of ‘democracy’ without losing control of (i-iii).

Once the banks are controlled, the remaining industries can be easily taken over through a process of credit inflation, deflation and selective violence.

One can write an entire book on the above process, but I do not need to because many such books are already out there. John Perkins’ Confessions of an Economic Hit Man is fairly good.

According to his book, Perkins’ function was to convince the political and financial leadership of underdeveloped countries to accept enormous development loans from institutions like the World Bank and USAID. Saddled with debts they could not hope to pay, those countries were forced to acquiesce to political pressure from the United States on a variety of issues. Perkins argues in his book that developing nations were effectively neutralized politically, had their wealth gaps driven wider and economies crippled in the long run. In this capacity Perkins recounts his meetings with some prominent individuals, including Graham Greene and Omar Torrijos. Perkins describes the role of an economic hit man as follows:

Economic hit men (EHMs) are highly paid professionals who cheat countries around the globe out of trillions of dollars. They funnel money from the World Bank, the U.S. Agency for International Development (USAID), and other foreign “aid” organizations into the coffers of huge corporations and the pockets of a few wealthy families who control the planet’s natural resources. Their tools included fraudulent financial reports, rigged elections, payoffs, extortion, sex, and murder. They play a game as old as empire, but one that has taken on new and terrifying dimensions during this time of globalization.


Violence in a banking-led empire

Banking-led empires cyclically go through violent phases, coinciding with the terminal phases of debt deflation. They are essentially a part of the debt-collection process. The way it works is that the bankers, through their control of politicians, pass the debt on to people. The people eventually revolt and are brought under control by use of force. A country can also externalize violence and WW II is a classic example for the banks of New York.


Russia in the above context

The above introduction is necessary to provide context for the main topic of this blog post. The banking-led empire is going through another cycle of debt deflation, and everyone realizes that it will end in a war. The question is where.

US learned from the experience of WW II that they became strong, when the Europeans fought each other. So, US has been trying hard to start a war between European countries and Russia. Russian government, on the other hand, has figured out the game-plan of the banking-led empire and is trying neutralize them. One such action, which will surprise everyone when it is announced in 2-3 months, is unification of Korea.

Those failures in starting a large scale war in Europe are frustrating Americans immensely, and they are increasing the amount of propaganda. As an example of extreme propaganda, in the video attached in the top of the post, a Fox news analyst announced that US government should start killing Russians. They realize that if they cannot externalize the deflationary war to Europe, then it will come to US. Either way, Russian scientists and engineers, who emigrated to US after 1990, will increasingly feel pressurized to move to Russia.


Russia as a country

At present, Russia is one of the strongest countries culturally, socially and scientifically. Therefore, moving to Russia for Russians under politically humiliation in USA will not be the same as Somalian Muslims moving to Somalia after being harassed after 9/11. The population growth in Russia will soon exceed US, and a growing country always offers more opportunities.


To summarize, the combination of above dynamics – (i) deflation and declining quality of life in US, (ii) hostility toward Russians due to US elites’ desire to externalize war and (iii) improving quality of life in Russia – will result in exodus of Russians from US.

‘Coding’ in Genetics is not the Same as ‘Coding’ in Computing

While going through various discussions on non-coding DNA, I realized that some of the confusion is possibly caused by dual use of the word ‘coding’. Yan Wong explained it elegantly in the sandwalk blog

I think the root is a grammatical confusion between “coding” in the sense of computer code, and “coding” in the sense of the 3-letter mapping between DNA and aa (the “genetic code”). It was perhaps an unfortunate choice to call this mapping “the genetic code”, as it implies to the layman that all functionality comes through this route. Perhaps a less leading phrase would be “translated DNA”?

Those coming from computer world may not realize that with the genetics definition of ‘coding’, ‘non-coding’ is not synonymous to non-functional. Many components of non-coding genome are functional and that is known for decades.

Known components of non-coding DNA (from the sandwalk blog) –

    ribosomal RNA genes
    tRNA genes
    genes for small RNAs (e.g spliceosome RNAs, P1 RNA, 7SL RNA, linc RNA etc.)
    5′ and 3′ UTRs in exons
    SARs (scaffold attachment regions)
    origins of DNA replication
    regulatory regions of DNA
    transposons (SINES, noncoding regions of LINES, LTRs)
    defective transposons

In fact, those components are so well known that sandwalk blog even includes the relative percentages of various categories of non-coding sequences in the human genome in another post.

When Francis Collins and the ENCODE clowns funded by him make statements like the following, they make complete mockery of the scientific process.

In January, Francis Collins, the director of the National Institutes of Health, made a comment that revealed just how far the consensus has moved. At a health care conference in San Francisco, an audience member asked him about junk DNA. “We don’t use that term anymore,” Collins replied. “It was pretty much a case of hubris to imagine that we could dispense with any part of the genome — as if we knew enough to say it wasn’t functional.” Most of the DNA that scientists once thought was just taking up space in the genome, Collins said, “turns out to be doing stuff.”

The consensus has moved based on ENCODE science or junk science, and not real science. One can easily go through various components in sandwalk’s list and their relative percentages to see whether any of them is shown to be functional unlike before. Transposons, members of the biggest group, are ‘doing stuff’, and that ‘stuff’ is to invade other parts of the genome to grow in number. Has anyone shown any other function for a large percentage of transposons? The scientific evidence is not there, but that does not Collins from making false statements.

Nanopore Assembly Improves with HMM Polishing of Signal Level Data

A few days back, we posted about assembly of E. coli genome from using nanopore-only data. At that time, Jared Simpson, the corresponding author, tweeted about including information on error-correcting based on the actual electrical signals instead of nucleotide data in a later version of the paper. Today he notified that the new version has been uploaded in arxiv. The changes in abstract are emphasized below, and it is great that the accuracy is now at 99.4%.

A complete bacterial genome assembled de novo using only nanopore sequencing data

A method for de novo assembly of data from the Oxford Nanopore MinION instrument is presented which is able to reconstruct the sequence of an entire bacterial chromosome in a single contig. Initially, overlaps between nanopore reads are detected. Reads are then subjected to one or more rounds of error correction by a multiple alignment process employing partial order graphs. After correction, reads are assembled using the Celera assembler. Finally, the assembly is polished using signal-level data from the nanopore employing a novel hidden Markov model. We show that this method is able to assemble nanopore reads from Escherichia coli K-12 MG1655 into a single contig of length 4.6Mb permitting a full reconstruction of gene order. The resulting draft assembly has 98.4% nucleotide identity compared to the finished reference genome. After polishing the assembly with our signal-level HMM, the nucleotide identity is improved to 99.4%. We show that MinION sequencing data can be used to reconstruct genomes without the need for a reference sequence or data from other sequencing platforms.

Junk DNA Debate – T. Ryan Gregory versus Francis Collins


The readers may take a look at the New York Times article written by Carl Zimmer – Is Most of Our DNA Garbage?. It features the work of T. Ryan Gregory, whose name should be familiar those following our evolutionary biology section (e.g check Evolution of Genome Size – Scientific and Religious Analysis). Dr. Gregory’s research consists of measuring the genome sizes of various organisms, and looking for any evolutionary pattern among those sizes. Here is a brief summary of what he found after decades of work.

1. Based on data collected on ~5000 animal genomes so far, genome sizes were found to vary 7000-fold. That is an enormous range.

2. Animals with the largest genomes – lungfish (~80-120Gbp), salamanders (similar order), sharks (~10-20Gbp), grasshoppers, flatworms and crustaceans.

3. The only phenotypic links with genome size are in larger cell size and longer time to do cell division for organisms with large genomes.

4. Ecological correlation – “An emerging trend from animal (and more specifically, crustacean) genome size studies is the positive relationship between genome size and latitude.”

5. Correlation with intron size – “Intron size and genome size are known to be positively correlated between species of Drosophila (Moriyama et al. 1998), within the class of mammals (Ogata et al. 1996), and across eukaryotes in general (Vinogradov 1999).”

6. No relationship between genome size and animal complexity has ever been found. Researchers have been looking into this for over four decades.

The entirety of collected evidence points to only one explanation for large genome size – ‘junk dna’. That means if organism A and B are evolutionarily close and have similar level of complexity (as defined by the number of different cell types), and the genome of organism B is much larger than organism A, then a large part of the genome of organism B consists of nonfunctional DNA. As a good example, fugu fish and zebrafish are evolutionarily related, but the genome sizes are – fugu (390Mb), zebrafish (1.7 Gb). Therefore, the genome of zebrafish likely has 1.3 Gb more junk DNA than fugu. In fact, humans are evolutionarily not that distant from either of those two fish species. That suggests that human genome has even more junk DNA. That is the core of Gregory’s argument, and nobody has refuted it so far except Francis Collins and ENCODE clowns funded by him.

In January, Francis Collins, the director of the National Institutes of Health, made a comment that revealed just how far the consensus has moved. At a health care conference in San Francisco, an audience member asked him about junk DNA. “We don’t use that term anymore,” Collins replied. “It was pretty much a case of hubris to imagine that we could dispense with any part of the genome — as if we knew enough to say it wasn’t functional.” Most of the DNA that scientists once thought was just taking up space in the genome, Collins said, “turns out to be doing stuff.”

Sadly he or the clowns he backs have not provided any solid evidence of ‘turns out to be doing stuff’. They back away from hyped up claims, when challenged.

John Rinn is one such hype-star featured in the NY Times article. Between 2002-2007, my co-authors and I wrote a number of papers showing that the noncoding regions of the genome were differentially expressed in several organisms under different conditions. Tiling array experiment was our mode of measurement in those experiments. However, the main criticism was that those expressions were likely transcriptional noise, and one needed to actually show the actual function to claim that those expressed regions were functional. I did that in 2006 in yeast for one noncoding RNA, and Sid Altman, who received Nobel prize for discovering catalytic noncoding RNA, confirmed our work in a later paper. Still our result showed the function of only one novel RNA, and we could not say anything about the remaining expressed region.

Neither did John Rinn, our ex-collaborator. In 2009, he published couple of papers showing the functionality of only one non-coding RNA in the human genome. That was not much extra achievement beyond ours, but he has an amazing talent of hyping things and saying what NHGRI likes to hear. His one functional RNA seemed to have ‘proved’ the functionality of the entire human genome, and now it is time to search for diseases there as well !!

Interestingly, this ‘scientific consensus’ has its effect on the corresponding wikipedia page as well. It is now full of contrasting scientific and pseudo-scientific claims, and you may feel like being in the middle of a war-zone. For example, check this paragraph written by a scientifically trained person –

The term “junk DNA” became popular in the 1960s.[26][27] It was formalized in 1972 by Susumu Ohno,[28] who noted that the mutational load from deleterious mutations placed an upper limit on the number of functional loci that could be expected given a typical mutation rate. Ohno predicted that mammal genomes could not have more than 30,000 loci under selection before the “cost” from the mutational load would cause an inescapable decline in fitness, and eventually extinction. This prediction remains robust, with the human genome containing approximately 20,000 genes.

You learn that a scientist named Ohno came up with the description ‘junk dna’ and made a prediction that remains robust. Yet, the introduction says –

Initially, a large proportion of noncoding DNA had no known biological function and was therefore sometimes referred to as “junk DNA”, particularly in the lay press. However, it has been known for decades[citation needed] that many noncoding sequences are functional.

That block makes absolutely no sense, because Ohno did not write for ‘lay press’ and his claim of large proportion of noncoding DNA to be junk was not invalidated by many noncoding sequences being functional. In fact, tRNA was discovered in 1960s and Francis Crick, who hypothesized its existence, was also credited for coming up with the concept of junk DNA, as mentioned in the NY Times piece.

Faced with this paradox, Crick and other scientists developed a new vision of the genome during the 1970s. Instead of being overwhelmingly packed with coding DNA, the genome was made up mostly of noncoding DNA. And, what’s more, most of that noncoding DNA was junk — that is, pieces of DNA that do nothing for us.

Therefore, clearly the nonsensical block in wikipedia is inserted by an ENCODE pseudo-scientist. There is no better evidence of that than the section it led to –

The Encyclopedia of DNA Elements (ENCODE) project[3] suggested in September 2012 that over 80% of DNA in the human genome “serves some purpose, biochemically speaking”.[4]

Morals of the story –
(i) The emperor has no cloth, but you can afford to say that only if you are living in Canada like T. Ryan Gregory,
(ii) Be careful about trusting ‘science’ discussed in wikipedia.

Perfect Genome Assembly – BGI’s Solution

We came across a paper co-authored by members of Complete Genomics (bought by BGI) and BGI.

Next generation sequencing (NGS) technologies, primarily based on massively parallel sequencing, have touched and radically changed almost all aspects of research worldwide. These technologies have allowed for the rapid analysis, to date, of the genomes of more than 2,000 different species. In humans, NGS has arguably had the largest impact. Over 100,000 genomes of individual humans (based on various estimates) have been sequenced allowing for deep insights into what makes individuals and families unique and what causes disease in each of us. Despite all of this progress, the current state of the art in sequence technology is far from generating a “perfect genome” sequence and much remains to be understood in the biology of human and other organisms’ genomes. In the article that follows, we outline why the “perfect genome” in humans is important, what is lacking from current human whole genome sequences, and a potential strategy for achieving the “perfect genome” in a cost effective manner.

Strategy –

FIGURE 1. The concept of read co-barcoding for advanced whole genome sequencing (WGS). All four critical requirements are depicted. (1) A genomic library is prepared from long DNA (e.g., 30–300 kb) representing 10 or more cells. Multiple staggered long DNA fragments for each genomic region are generated as a result of random fragmenting during cell lysis (three fragments depicted under each parental chromosome). In the co-barcoded read libraries these redundant long fragments allow variant phasing, a more accurate assembly of the genome, and ultimately de novo assembly. In this example a pair of long proximate repeat elements, longer than the read and mate-pair length, is shown by the large gray boxes. A and C denote single base differences between copies of these repeat elements. Long, overlapping, staggered genomic fragments allow for the proper placement of these repeats in the final assembly by exclusive linking of repeat members to surrounding unique sequences provided by the long DNA fragments that start or end between repeats. (2) Sequence reads generated from each long fragment (i.e., subfragments used to produce these reads) are tagged (small colored curved lines) with the same barcode (co-barcoded). There are many (usually 10s–100s) of reads per long DNA fragment, most if not all having the same barcode. Reads belonging to related (i.e., overlapped) long fragments mostly have different barcodes. Consequently maternal (red) and paternal (blue) fragments for a genomic region have different barcodes as indicated by the distinct barcode numbers (253, 112, and X for mom, 27, 367, and Y for dad). After MPS, barcodes are used to aggregate reads from the original long fragment. Such read aggregation, even without sequence assembly per long fragment, provides information for variant phasing and repeat resolving when reads from overlapping long fragments, representing the same chromosome, are used together in the assembly process. (3) Sequence reads must cover >30% and preferably the majority of bases in each long fragment. Consecutive continuous reads (depicted here) or overlapping mate-pair reads (two shorter reads from the ends of the same subfragment) can provide the needed coverage. Sequencing the majority of bases of each fragment with co-barcoded reads links alleles in haplotypes as, on average, 10 or more heterozygous sites occur per long DNA fragment. (4) The read or mate-pair length is longer than the frequent dispersed repeats (e.g., Alu, depicted by the small gray boxes) and are correctly assembled primarily using read level data.

Cost mentioned in the paper – $200/genome