Archives

The Genome Assembly Process and How Pac Bio can Help

After playing with the Pac Bio data for a while, we figured out how to make the best use of it in our assembly pipeline. I am posting something I wrote for a collaborator, in case it helps others. I would also recommend you to go through the links at the bottom of my previous commentary. The slides from Michael Schatz are particularly helpful.

To understand the assembly process, you need to visualize an eukaryotic genome. The simplest picture consists of a linear chain of unique regions joined by repeat/low complexity regions. Unique regions appear only once in the genome. Repeat regions appear many times in the genome. We can add other complications like haplotype difference to the above simple picture later, but they are secondary for most genomes.

The assembly process has three steps -

i) Assemble all pieces of unique regions. These are called contigs.
ii) Connect them up. The end product is called scaffold.
iii) Fill in the gaps in between with repetitive sequences.

The first step is relatively easy with Sanger reads, but gets complicated with Illumina reads. It is generally accepted as the most difficult step in a short-read assembly process. One needs to be careful about an aggressive assembler that connects pieces which should not be connected. The possibility of making this kind of error goes up, when the reads are short.
Continue reading The Genome Assembly Process and How Pac Bio can Help

Pacific Bio Sequences

Many apologies to regular readers for not updating our blog. We got bogged down with two major tasks -

i) We got a new batch of Pacific Bio data and are trying to digest it,

ii) Our electrical engineering bug took control of our brain as we got into studying FPGA/embedded systems for using in high-throughput bioinformatics.

In this post, we will add some information on Pacific Bio sequences that may help other fellow passengers.

Pacific bio sequences fall at other extreme from Illumina Solexa data. The reads are very long (~5kB), but they are also very noisy. Typically reads have 85% accuracy, which means every 100 nucleotides have 15 errors. Eeeks !!

We are just beginning, but let me walk through the experience so far step by step. You will receive the following files from the sequencer (X is some arbitrary file name) -

Top_folder:
X.mcd.hd5
X.metadata.xml
Analysis_Results:
X-02.log
X-03.log
X-04.log
X.bas.h5
X.ccs.fasta
X.ccs.fastq
X.fasta
X.fastq
X.sts.csv
X.sts.xml

What are the files?
The top folder contains raw data in hd5 format, which is PacBio’s native data format. The sequencer also performs introductory analysis to generate some results, and those results are stored in Analysis_Results folder.

One important file in the Analysis_Results folder is x.ccs.fasta. This is the one you need to first look into oo gain some confidence into the generated data. Tt contains the cleanest assembled reads from the sequencing run. If you have a reference genome, you need to first align the ccs reads on to your reference genome. If that step fails, something went wrong with the sequencing run.

In our case, the size of ccs file is only 1% of the raw fasta sequence files. So, lot more sequences have not been included in ccs file, and to properly use them, you need to use SMRT software from PacBio. You will also find PacBioToCA tools useful.

Ideally one would like to run a de novo assembly with Pac Bio data, but that is very hard because of high error rate of the raw sequences. So, researchers are looking into two other possibilities -

i) Doing de novo assembly using Illumina or other short read technologies, and using PacBio to do scaffolding/extending,

ii) Running an error-correction routine on Pac Bio data using Illumina sequences, and then use those error-corrected reads for assembly. This is where PacBioToCA can help. For example, you can use Velvet to generate contigs from Pac Bio data, run error correction and then use error-corrected Pac Bio to improve Velvet assembly.

You may find the following resources handy for making further progress-

i) Michael Schatz’s talk on SMRT assembly,

ii) Pac Bio’s webpage on how they assembled E. coli sequence from their reads,

iii) A very helpful Seqanswers thread,

iv) Pacbio’s slides on SMRT technology.

v) A closer look at Pac Bio sequence dataset.

Enjoy !!

The Best Way to Understand Algorithms

Reader KC asked:

I also tried to learn alignment algorithms, and I read some papers from Nature Reviews. But I was still confused after reading those papers.

I couldn’t learn alignment algorithms as good as you. You know the development of alignment algorithms, and you know intuitive feeling of the algorithm development. I am a lot interested in how your learn, how you find the ideas behind alignment algorithms.

Thanks Kevin for the compliment !!

The best way to understand the algorithms is to use Feynman approach, which is to stop reading papers and try very hard to design your own algorithm. Imagine there is nothing out there and you are the first person to solve the problem.

Maybe you will find something really clever or maybe you will not, but the process will orient your brain in such a way that when you go back to read papers, you will appreciate the cleverness of other solutions very quickly.

FPGA-based Hardware Accelerators for NGS Analysis

Advances in next-generation sequencing generates so much data that researchers are looking into all types of technological solutions for rapid analysis. So far we mostly covered software-based algorithmic approaches. However, a scale up in speed of several orders of magnitude can be achieved by designing custom hardware for specific tasks.

What is a hardware-based approach? In our day-to-day life as bioinformaticians, we mostly deal with computers as black boxes. We send commands to that black box using a keyboard and the computer returns us answers through the monitor. Rarely do we open up the box to see what it inside or try to tweak its performance. In fact, for work on cloud computers, we do not even know where the box resides.

In a hardware-based approach, the things inside the box are redesigned to solve problems in bioinformatics. More specifically, a special card is designed for BLAST search, short read search or other problem. Conceptually, it is similar to the wireless card or ethernet card that we plug into the computers as add ons. The process involves –
(i) opening the box to plug the card in,
(ii) add a software driver to the operating system,
(iii) writing computer codes to use the card.

Designing the card itself is the real challenge. There are two options -
(i) using FPGAs, which are reprogrammable computer chips. These chips can be programmed for various tasks. FPGAs are inexpensive and can give 10 to 100x increase in speed compared to software solutions.

(ii) designing custom chips for the tasks and manufacturing them. Custom chips can give another 100x increase in speed compared to FPGAs, but their manufacturing is expensive.

We tend to focus too much on speed, but hardware-based solutions are very useful for another reason. They can solve the problems using only a fraction of power compared to software-based solution using a large server. The other day, we were going through the specs of a Dell server and one of us commented – ‘hey, this computer will use same amount power as my entire house, even when it is idle’. Cost of electricity is a big concern for companies with too many servers, so much so that they even end up buying huge wind farms. So, using low energy is as important a metric for choice of technology as the increase in speed.

FPGA-based solutions are being tried by various groups. Following two reports are helpful, if you want to further explore this topic.

Short-Read DNA Sequence Alignment with Custom Designed FPGA-based Hardware

Hardware Acceleration of Short Read Mapping

Economic Recovery is on the Back of Students

Our long-time readers know that we make brief e-conomic commentaries from time to time to distract you from serious issues in bioinformatics. Our e-con section had been empty for a while, because we got too bogged down into our own research.

Recently we came across a chart that is very instructive about current conditions of USA, and thought it would be worth sharing. The chart shows that USA had not seen any economic recovery over the last 4 years, after excluding student loans.

Source


Continue reading Economic Recovery is on the Back of Students

Burrows Wheeler Transform in Animation

For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.

Burrows Wheeler Transform is an important component of alignment algorithms like Bowtie, BWA, and assembly algorithm such as String Graph Assembler.

Few months back, we explained how Burrows Wheeler Transform works through an example. This time we decided to present the same information as an animation. Please check the following links. The word in red box represents Burrows Wheeler transform of the original text.

Link for HOMOLOG.US$ animation

Link for TATATATATA$ animation

The second animation was described in text form in this earlier commentary.

Please tell us what you think. If you like them, there will be more.

Is 1000 Genome Project an Example of Modern Day Alchemy?

For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.

Even though alchemy got a bad name now, it used to be respectable practice for centuries. Alchemists were primarily inspired by one goal – finding a method to turn base metals into gold and silver. Why gold and silver? Because that is where the money is.

At the end, the only method to turn lead into gold was found by studying something far different from gold. During the early part of 20th century, physicists developed quantum theory by modeling hydrogen atom, a gas consisting of only one electron and one proton. If funding agencies of the era restricted physicists to study primarily gold, silver and platinum, we would not have had periodic table, semiconductors, electronic gadgets and Watson-Crick structure.

As we sift through volumes of transcriptomic data, one pattern is very clear. Researchers are generating far more data on humans than anything else. This trend makes perfect sense from a commercial point of view, because human disease is where the money lies. The 1000 genome project extends this philosophy one step further by not even sequencing transcriptomes, but genomes of hundreds of individuals.

Are we missing the big picture and looking too close to where the money is?

Genome Sequencing Isn’t Predictive of Most Diseases, Study Says

Link

Sequencing the genomes of patients to reveal what ailments might mar their futures isn’t the best predictor for the most common diseases, according to a study involving thousands of identical twins.

Researchers found that most people would get negative results from having their genome sequenced for all but one of 24 identified conditions that includes heart disease, diabetes and Alzheimer’s. While the process can help spot many rare genetic disorders, it doesn’t appear to be a good predictor of who will suffer from the majority of illnesses, the authors wrote.

Digital Normalization from C. Titus Brown

For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.

From time to time, we referenced commentaries from professor C. Titus Brown of Michigan State University, who writes an informative blog on next-gen sequencing algorithms. Today I received a ping from him in our comment section about ‘digital normalization’ for data reduction that their group is ready to publish. This approach is important for anyone trying to cope with large volume of next-gen data and likes to reduce the data without losing any useful information.

Here is the abstract of their paper -

Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes enable the sensitive investigation of a wide range of biological phenomena. However, it is increasingly difficult to deal with the volume of data emerging from deep short-read sequencers, in part because of random and systematic sampling variation as well as a high sequencing error rate. These challenges have led to the development of entire new classes of short-read mapping tools, as well as new de novo assemblers. Even newer assembly strategies for dealing with transcriptomes, single-cell genomes, and metagenomes have also emerged. Despite these advances, algorithms and compute capacity continue to be challenged by the continued improvements in sequencing technology throughput. We here describe an approach we term digital normalization, a single-pass computational algorithm that discards redundant data and both sampling variation and the number of errors present in deep sequencing data sets. Digital normalization substantially reduces the size of data sets and accordingly decreases the memory and time requirements for de novo sequence assembly, all without significantly impacting content of the generated contigs. In doing so, it converts high random coverage to low systematic coverage. Digital normalization is an effective and efficient approach to normalizing coverage, removing errors, and reducing data set size for shotgun sequencing data sets. It is particularly useful for reducing the compute requirements for de novo sequence assembly. We demonstrate this for the assembly of microbial genomes, amplified single-cell genomic data, and transcriptomic data. The software is freely available for use and modification.

The paper is actually published as far as I am concerned, because it is available from arxiv.org and his website. Titus has been writing about his work in bits and pieces in his blog over the last 18 months. So, regular readers of his blog are 12-18 months ahead of journal readers about his cutting-edge research.

Can Trinity be Used for Genome Assembly?

For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.

A reader asked by private email whether Trinity can be used for genome assembly. With his permission, we are posting his question (edited to remove personal details) and our response here.

We should mention that his ‘newbie’ description caught us a bit off-guard at first. Based on his follow up question, it appeared that he used well-reasoned argument to try the mentioned approach. Few months back, we used the same rationale to run Trinity on one of our genomic libraries.

Question:

I am a newbie in transcriptome and genome assembly and
analysis. While browsing regarding the genome assembly i found your Homologous blog with many interesting topics and clear explanations.

I have a query regarding the usage of trinity and please spare me if i am not able to put it clearly.

I have genome data from Illumina platform of around ~8 Million PE reads and of 125bp length. I am supposed to assemble the reads for the further analysis. Since, genome and transcriptome assembly are not same, should not i use the RNA-seq assembler Trinity for making the DNA transcripts and remove the so called “alternative isoforms” from the Trinity.fasta and make use of contig file?

Would be thankful to your answer.

Continue reading Can Trinity be Used for Genome Assembly?

Using Bloom Filter - A Simple Introduction for Bioinformaticians

For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.

Among various techniques for computing k-mer distribution, bloom filter method is the most memory efficient. We set to write an article about bloom filter, but found several excellent sources online. Please follow the links at the bottom of the page to find them. Instead in this space, we shall give you a very simple analogy to explain the positives and negatives of bloom filter.

Suppose you are writing a paper and want to include references about other papers in your article. There is no point in including title, authors, year, journal and every other detail about every reference. You like your reference to be short, yet informative. One common method is to choose the last name of the first author and year of publication. This what ‘hashing’ approach does.

At times, you encounter two papers by two authors with last name Smith published in the same year 2002. Hashing method will try to break the tie by including the first name, such as Smith, John 2002 and Smith, Adams 2002. Instead bloom filter method will not break the tie and refer to both of them as Smith, 2002.
Continue reading Using Bloom Filter – A Simple Introduction for Bioinformaticians