Archives

For those Running Core Facilities

[A quick note - We edited yesterday's commentary and added two 'rules' to simplify bioinformatic analysis. If you follow the rules as a pre-processing step, you will not have to think about strandedness of the reads in the rest of the analysis.]

Over the last few months, we discussed various topics ranging from transcriptomics, assembly algorithms, hardware technologies such as Hadoop, color space data, etc. Today we like to take a step back to go over the entire forest of NGS bioinformatics. Although the headline of this topic mentions those running core facilities, the following discussions should be helpful for anyone planning and performing analysis of NGS data. Mostly they are provided as food for thought and not definitive answers. Please feel free to share your experience in the comment section.

Choice of Sequencing Technology

Often the choice of sequencing technology (454, Illumina, ABI SOLiD, etc.) is done solely based on the amount of sequence to be available along with cost per base. In our opinion, four factors should be taken into consideration -

(i) Sequencing volume and cost per base,

(ii) Length of individual reads (after removing low quality tails),

(iii) Maximum possible insert size for mate-paired reads,

(iv) Quality of reads.

Regarding the second factor, people often talk about 75nt reads versus 65nt reads, but if stretching the technology for higher read length results in many erroneous bases (typically near 3′ end), it may be better to stick with shorter reads.

Factor (iii) is especially important for assembling reads. After the small contigs from a library are assembled locally, their placement on larger scaffolds depend on distances between mate pairs. Therefore, it is more cost-effective to push for higher distance between read pairs than using technologies with longer read sizes.

A fifth factor – availability of software tools – often gets mentioned in choice of sequencing technology. This issue becomes important for color space data from ABI SOLiD, because most new programs are written and tested for nucleotide space data before the color space versions become available. We believe if a new technology wins on (i-iv), software gap gets filled in a reasonable time. In fact, a technology superior on (i-iv) eventually becomes the platform of choice for new developers, and those with other technologies need to catch up. Our biggest complaint about SOLiD data is with (iv). Those machines churn out an order of magnitude more data than typical Illumina instrument, but the reads contain significant amount of ‘color space noise’ and other errors. It is true that ‘color space noise’ can be cleaned and ignored for mapping analysis, but it adds a layer of complexity for assembly-type problems, where correct and incorrect bases are not known a priori.

Choice of computer hardware

Two types of analysis are done on NGS data – mapping (if reference genome exists) and assembly (if reference genome does not exist). Assembly problem is more challenging, and requires more powerful computers.

The simplest and most expensive solution is to buy a computer with large amount of RAM. This solution does not scale well for mammalian-sized genomes, and researchers look for two alternatives -

(i) Torque-based distributed systems (distributed computing over many machines sharing one file system),
(ii) Hadoop-based distributed systems (distributed computing over many machines with distributed file system).

A third option being pursued by the community is to design better algorithms (Bowtie, String Graph Assembler etc.), so that more computing can be done with the same amount of RAM. From a strictly technical viewpoint, it is true that those better algorithms can be further scaled using distributed computing (Torque or Hadoop). From a practical viewpoint, writing code for different computing architectures relies primarily on the interests of the developer working on the algorithm. Everyone is not an expert on every technology and can do only so much in his limited time, given that coming up with new algorithm itself is a Herculean task. Add the complexity related to color-space data.
In this space, we are trying to provide information so that these technology gaps can be easily closed.

Network bandwidth and Storage

The network needs to be designed properly, if we plan to move gigabytes to terabytes of data between computers. When the transfer is being done between institutions or organizations, it appears that no network is fast enough and people are resorting to low tech solutions such as simply mailing the disks through postal service.

Storage, and especially backup storage, is another important factor to pay attention to. We suggest copies of primary data, code and important secondary (derived) files, to be stored off-site to avoid natural calamities and machine failure. It is true that earthquakes do not happen very frequently, but hard-drive crashes and file corruption are often the ways of life for someone dealing with large volume of data.

We will finish the above discussion by covering the following three topics in the next commentary -
Algorithms for Low Level Analysis

Algorithms for High Level Analysis

Presenting and Accessing the Data

No related posts.

5 comments to For those Running Core Facilities

  • Rick Westerman

    I would emphasize the need for speed and robustness of the storage disks (SAN). Or at least the working disks. I happen to have access to 7.5 TB of very expensive disk space (about $2000/TB when we purchased it a couple of years ago) and would not give it up even for a more powerful computer. I can hit that SAN disk with multiple programs using multiple datasets and it does not even blink at handling them all. File corruption and hard-drive crashes are non-issues.

    Of course off-site storage and long-term storage are required but they do not need to be that expensive nor that fast.

    We will probably go to a Hadoop system eventually but only if it is equally fast and robust. Given the size of NGS data putting it on a slow disk is just asking for slowness throughout the pipeline.

  • Rick Westerman

    I am going to disagree with the recommendation to go with a cluster and instead suggest buying the high memory machine. Clusters are nice and you may not need the large memory that often but when you do (e.g., for Trinity) then there are little alternatives to get around it. And high memory does not need to cost very much. For $16,500 USA I can purchase (university pricing)

    One 512-GB, 48 node machine
    or
    Two and a bit 192-GB, 48 node machines
    or
    Three 96-GB, 48 node machines

    While the latter two options give me more nodes (96+ and 144) I am willing to run with less nodes just to have that memory available. If nothing else it can be used for disk caching.

  • admin

    All good points. Thanks Rick.

    “We will probably go to a Hadoop system eventually but only if it is equally fast and robust.”

    I think we are mixing two different issues here. Hadoop distributed system is primarily for computational efficiency (amount of computation/price), not for storage. For example, Amazon cloud provides Amazon S3 service, a highly robust system for storage of data (“Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year, Designed to sustain the concurrent loss of data in two facilities”).

    They also give users a Hadoop-based system for computations, but the results of those computations get saved into S3. Hadoop is only a transient infrastructure for loading data from S3, running the computation and sending output back to S3.

  • Rick Westerman

    Yes, I made a mistake in my original post. We are in the process of going to a Lustre system instead of our BlueArc system for fast storage. Got my terms mixed up!

    The sysadmins are promising a tiered system later this year that will provide slow but large storage, fast storage and blindingly fast storage all being transparent to each other. At the moment I have to copy the resultant files from our fast BlueArc and Lustre systems to our much slower storage disks and then copy these to backup tapes. A pain!

    I haven’t used Amazon cloud nor S3 yet. In part because the computational resources I have and in part because they provide “only” 64 GB (the last time I looked). I think that Amazon could be useful for many people who are working on, and paying for, a single project.

  • Really informative article.Thanks Again.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>