We are putting together an introductory book to help biologist/bioinformatician (the terms are increasingly being synonymous) understand what is going on inside the RNAseq or genome assembly programs. It is built on top of our popular ‘De Bruijn Graphs for NGS Assembly’ tutorials and several other blog posts. Following the same style as the tutorials, we are focusing on the algorithms and the big picture, and not just how to run programs.
In addition to presenting the basic concepts as in the tutorials, we are adding an entire chapter to delve into the latest cutting-edge algorithms and explain them in simple language. The goal is to help you understand conceptually where those new algorithms fit in the big picture and why they increase speed by 10/100x or improve the assembly quality.
Most of the presentation do not talk about any specific program, but in the last chapter, we go through three commonly used programs (SOAPdenovo2, SPAdes and Trinity), explain how to run them and discuss their code/algorithm. In addition, we mention a number of other commonly used programs and explain what they do.
We are also developing small software modules to explain some of the concepts better. They will be available from our website.
Is assembly an obscure topic most biologists should stay away from?
We do not see how biologists can do any bioinformatics without having rudimentary sense of the assembly algorithms, and this observation is based on interactions with collaborators and their students working on RNAseq data. They all realize that by being able to assemble short read RNAseq libraries, they can explore many different organisms apart from handful of those with high-quality genome sequences. However, how to interpret the Trinity or Oases assemblies is a big challenge for them. From time to time, we receive questions about why certain unrelated genes got fused together or what to do with 100 splice forms of a gene or whether to trust the Trinity or Oases or SOAPdenovo-trans result for a gene. It is impossible to explain why a particular result can be an assembly artifact without drawing the graph structure from the intermediate steps of the assembly.
Timeline and Request Page
The book will be available in electronic form in 2-3 months. If you find the topics interesting, please sign up here and we will keep you posted on our progress. You can also give us feedback on the choice of topics and suggest something we may have missed.
We definitely like to develop a paper version, if there is enough interest from the readers. Printing has additional costs involved, but for those getting the electronic version, we will make sure they only pay for the difference between paper price and electronic price to get the paper version.
Readers interested in doing genome assembly with low memory will find the following two github repositories useful. They are related to Christina Boucher et al’s recent ‘variable-Order de Bruijn graphs’ paper, but research along the same line started a long time back. For example, check our Succinct de Bruijn Graphs from Tetsuo Shibuya’s Group posted in 2012 and Alex Bowe’s Succinct de Bruijn Graphs.
1. Succinct Data Structure Library 2.0 (simongog/sdsl-lite) can be accessed here.
SDSL – Succinct Data Structure Library
What is it?
The Succinct Data Structure Library (SDSL) is a powerful and flexible C++11 library implementing succinct data structures. In total, the library contains the highlights of 40 research publications. Succinct data structures can represent an object (such as a bitvector or a tree) in space close the information-theoretic lower bound of the object while supporting operations of the original object efficiently. The theoretical time complexity of an operations performed on the classical data structure and the equivalent succinct data structure are (most of the time) identical.
2. Cosmo (alexbowe/cosmo) is a fast, low-memory DNA assembler using a succinct (variable order) de Bruijn graph and can be accessed here.
Programmers often use ‘compare and swap’ or other hardware provided options to get atomicity , but another alternative is to use software transactional memory. Please check the folliwing link for a good introduction.
The Basics of Software Transactional Memory
As I see it, STM is based on two fundamental concepts:
Optimism. In software terms, by optimism, we mean that we’re going to plow ahead and assume that there aren’t any errors; when we’re done, we’ll check if there was a problem, and clean up if necessary. A good example of this from my own background is source code control systems. In the older systems like RCS, you’d lock a source file before you edited it; then you’d make your changes, and check them in, and release the lock. That way, you know that you’re never going to have two people making changes to the same file at the same time. But the downside is that you end up with lots of people sitting around doing nothing, waiting to get the lock on the file they want to change. Odds are, they weren’t going to change the same part of the file as the guy who has the lock. But in order to make sure that they can’t, the locks also block a lot of real work. Eventually, the optimistic systems came along, and what they did was say: “go ahead and edit the file all you want. But before I let you check in (save) the edited file to the shared repository, I’m going to make sure that no one changed it in a way that will mess things up. If I find out that someone did, then you’re going to have to fix it.”
Transactions. A transaction is a concept from (among other places) databases. In a database, you often make a collection of changes that are, conceptually, all part of the same update. By making them a transaction, they become one atomic block – and either the entire collection all succeedd, or the entire collection all fail. Transactions guarantee that you’ll never end up in a situation where half of the changes in an update got written to the database, and the other half didn’t.
A more technically oriented introduction is in the following link.
Beyond Locks: Software Transactional Memory
And when everything else fails, you can check what various people at stackoverflow suggest.
Non-Toy Software Transactional Memory for C or Java
Implementation and Language support
STM is built into Haskell and Clojure as core language feature and is supported in gcc 4.7 onward. Python programmers are recommended to quit or join the Haskell community.
The implementation of transactional memory is transparent to the program and most of it resides in a runtime library (libitm in GCC). Transactions thus always provide the same guarantees, even though performance at runtime might be different depending on the implementation that is being used. In general, implementations come in two forms: a Software Transactional Memory (STM) system uses locks or other standard atomic instructions to do its job. A Hardware Transactional Memory (HTM) system uses multi-word synchronization operations of the CPU to implement the requirements of the transaction directly (e.g., see the Rock processor). Because most HTM systems are likely to be best effort facilities (i.e., not all transactions can be executed using HTM), practical TM implementations that incorporate HTM also have a STM component and are thus termed Hybrid Transactional Memory systems.
or you can look into the following libraries.
TinySTM a time-based STM and Tanger to integrate STMs with C and C++ via LLVM.
The Lightweight Transaction Library (LibLTX), a C implementation by Robert Ennals focusing on efficiency and based on his papers “Software Transactional Memory Should Not Be Obstruction-Free” and “Cache Sensitive Software Transactional Memory”.
LibCMT, an open-source implementation in C by Duilio Protti based on “Composable Memory Transactions”. The implementation also includes a C# binding.
TARIFA is a prototype that brings the “atomic” keyword to C/C++ by instrumenting the assembler output of the compiler.
Intel STM Compiler Prototype Edition implements STM for C/C++ directly in a compiler (the Intel Compiler) for Linux or Windows producing 32 or 64 bit code for Intel or AMD processors. Implements atomic keyword as well as providing ways to decorate (declspec) function definitions to control/authorize use in atomic sections. A substantial implementation in a compiler with the stated purpose to enable large scale experimentation in any size C/C++ program. Intel has made four research releases of this special experimental version of its product compiler.
stmmap An implementation of STM in C, based on shared memory mapping. It is for sharing memory between threads and/or processes (not just between threads within a process) with transactional semantics. The multi-threaded version of its memory allocator is in C++.
CTL An implementation of STM in C, based on TL2 but with many extensions and optimizations.
The TL2 lock-based STM from the Scalable Synchronization research group at Sun Microsystems Laboratories, as featured in the DISC 2006 article “Transactional Locking II”.
Several implementations by Tim Harris & Keir Fraser, based on ideas from his papers “Language Support for Lightweight Transactions”, “Practical Lock Freedom”, and an upcoming unpublished work.
RSTM The University of Rochester STM written by a team of researchers led by Michael L. Scott.
G++ 4.7 now supports STM for C/C++ directly in the compiler. The feature is still listed as “experimental”, but may still provide the necessary functionality for testing.
Anybody with experience with the above libraries?
A twitter discussion got us curious about the manual of the MIRA assembler and we found it exactly as promised. The author did a lot of work on helping researchers do their experiments properly to get high-quality assembly. Bioinformatics does not work without the ‘bio’ part done well, and bioinformatics programs cannot cause miracles, when the underlying data are of bad quality.
Here are two examples out of many informative sections –
For de-novo assemblies, do NOT (never ever at all and under no circumstances) use the Nextera kit, take TruSeq. The non-random fragmentation behaviour of Nextera leads to all sorts of problems for assemblers (not only MIRA) which try to use kmer frequencies as a criterion for repetitiveness of a given sequence.
13.6.1. Do not sample DNA from bacteria in exponential growth phase!
The reason is simple: some bacteria grow so fast that they start replicating themselves even before having finished the first replication cycle. This leads to more DNA around the origin of replication being present in cells, which in turn fools assemblers and mappers into believing that those areas are either repeats or that there are copy number changes.
Sample. In. Stationary. Phase!
For de-novo assemblies, MIRA will warn you if it detects data which points at exponential phase. In mapping assemblies, look at the coverage profile of your genome: if you see a smile shape (or V shape), you have a problem.
This is undoubtedly the most informative bioinformatics manual we came across, and is way better than many actual papers.
This arxiv paper appears to be promising (h/t: @lexnederbragt). We are not sure whether having optical map data is an absolute requirement. (Edit. Please see comment from senior author).
A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method that will enhance the quality of draft genomes by identifying and removing misassembly errors using paired short read sequence data and optical mapping data. We apply our method to various assemblies of the loblolly pine and Francisella tularensis genomes. Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembed contigs in an assembly of Francisella tularensis, and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembed contigs in the assemblies of loblolly pine. MISSEQUEL can be downloaded at this http URL.
If you are wondering how it relates to other available tools, the following section is helpful.
Related Work. Both amosvalidate  and REAPR  are capable of identifying and correcting misassembly errors. REAPR is designed to use both short insert and long insert paired-end sequencing libraries, however, it can operate with only one of these types of sequencing data. Amosvalidate, which is included as part of the AMOS assembly package , was developed speci^Lcally for first generation sequencing libraries . iMetAMOS  is an automated assembly pipeline that provides error correction and validation of the assembly. It packages several open-source tools and provides annotated assemblies that result from an ensemble of tools and assemblers. Currently, it uses REAPR for misassembly error correction.
Many optical mapping tools exist and deserve mentioning, including AGORA , SOMA , and Twin . AGORA  uses the optical map information to constrain de Bruijn graph construction with the aim of improving the resulting assembly. SOMA  uses dynamic programming to align in silico digested contigs to an optical map. Twin  is an index-based method for align-ing contigs to an optical map. Due to its use of an index data structure it is capable of aligning in silico digested contigs orders of magnitude faster than competing methods. Xavier et al.  demonstrated misassembly errors in bacterial genomes can be detected using proprietary software. Lastly, there are special purpose tools that have some relation to misSEQuel in their algorith-mic approach. Numerous assembly tools use a finishing process after assembly, including Hapsem-bler , LOCAS , Meraculous , and the \assisted assembly” algorithm . Hapsembler  is a haplotype-specific genome assembly toolkit that is designed for genomes that are highly-polymorphic. Both RACA , and SCARPA  perform paired-end alignment to the contigs as an initial step, and thus, are similar to our algorithm in that respect.
The following program by Heng Li is also worth noting.
Abreak: evaluating de novo assemblies
“Abreak” is a subcommand of htsbox, which is a little toy forked from on the lite branch of htslib. It takes an assembly-to-reference alignment as input and counts the number of alignment break points. An earlier version was used in my fermi paper to measure the missassembly rate of human de novo assemblies. A typical output looks like:
Number of unmapped contigs: 239
Total length of unmapped contigs: 54588
Number of alignments dropped due to excessive overlaps: 0
Mapped contig bases: 2933399461
Mapped N50: 6241
Number of break points: 102146
Number of Q10 break points longer than (0,100,200,500)bp: (28719,7206,4644,3222)
Number of break points after patching gaps short than 500bp: 94298
Number of Q10 break points longer than (0,100,200,500)bp after gap patching: (23326,5320,3369,2194)
The Reddit AMA of Weiss and Buchanan is full of many interesting comments and discussions. The highest rating of the following question suggests that people are coming to realize that the scientists over-sold and over-hyped what could be achieved through the human genome sequencing project.
Why hasn’t genetics led to the groundbreaking cures initially promised?
mermaidstale: Basically, because the diseases that most of us will get as we age turn out to be more complex than people had thought. I think we can blame the idea that we will find genes ‘for’ these kinds of diseases on Mendel, whose work with traits in peas led people to think that most traits were going to behave in the same predictable way that wrinkled, or yellow, or green peas do. Instead, it is turning out that there are many pathways to most traits, usually including many genes and environmental factors as well. So, diseases won’t be as predictable as we’ve been promised, because everyone’s genome is different, and environmental risk factors are hard to identify (is butter good or bad for us?), and future environments impossible to predict.
In this context, Gholson Lyon pointed out that he wrote a book chapter that the readers will find useful. The article is well-researched, well-written and is available open-access from biorxiv.
There are ~12 billion nucleotides in every cell of the human body, and there are ~25-100 trillion cells in each human body. Given somatic mosaicism, epigenetic changes and environmental differences, no two human beings are the same, particularly as there are only ~7 billion people on the planet. One of the next great challenges for studying human genetics will be to acknowledge and embrace complexity. Every human is unique, and the study of human disease phenotypes (and phenotypes in general) will be greatly enriched by moving from a deterministic to a more stochastic/probabilistic model. The dichotomous distinction between simple and complex diseases is completely artificial, and we argue instead for a model that considers a spectrum of diseases that are variably manifesting in each person. The rapid adoption of whole genome sequencing (WGS) and the Internet-mediated networking of people promise to yield more insight into this century-old debate. Comprehensive ancestry tracking and detailed family history data, when combined with WGS or at least cascade-carrier screening, might eventually facilitate a degree of genetic prediction for some diseases in the context of their familial and ancestral etiologies. However, it is important to remain humble, as our current state of knowledge is not yet sufficient, and in principle, any number of nucleotides in the genome, if mutated or modified in a certain way and at a certain time and place, might influence some phenotype during embryogenesis or postnatal life.
A Biostar discussion started with –
Uri Laserson from Cloudera:
“Computational Biologist are reinventing the wheel for big data biology analysis, e.g. Cram = column storage, Galaxy = workflow sheduler, GATK = scatter gather. Genomics is not special.” “HPC is not a good match for biology since it HPC is about compute and Biology is about DATA”.
Uri might be biased because of his employer and background but I do think he has a point. It keeps amazing me that the EBI’s and Sangers of this world are looking for more Perl programmers to build technology from the 2000’s instead of hiring Google or Facebook level talents to create technology of the 2020’s using the same technology Google or Facebook are using.
In 2010, we were stuck with several big assembly problems (big for those days) and associated biological questions. For assembly, Hadoop was the first thing we looked into and long-time readers may remember that Hadoop-related posts were among the first topics in our blog. Like the above author, we also expected to revolutionize biology by using technology of Google/Facebook.
Using Hadoop for Transcriptomics
Contrail – A de Bruijn Genome Assembler that uses Hadoop
Over time, we realized that Hadoop was not the right technology for assembling sequences. For assembly, one has a fixed final product and the goal of the algorithm is to combine pieces to get there. There are faster and more economic ways to solve that problem even using commodity hardware, and Hadoop did not turn out to be our best option. We already covered many of those efficient algorithms in our blog.
Today’s discussion is not about assembly but on the second and larger problem – figuring out the biology and whether big data can help. Figuring out of the biology is the final goal of all kinds of computational analysis, but increasingly that notion seems to get forgotten. This point has been highlighted by a comment by Joe Cornish in the original biostar thread, but let us elaborate based on our experiences.
Is human being like a computer program?
Two decades of publicity about human genome and genomic medicine made people think that human beings work like computer programs. To them, the genome has the software code or ‘blueprint’ and by changing that code, every aspect of a person can be changed. A culmination of that one-dimensional focus of genome results in junk papers like the following, where the authors try to find genomic code for romantic relations.
The association between romantic relationship status and 5-HT1A gene in young adults
Individuals with the CG/GG genotype were more likely to be single than individuals with the CC genotype.
What gets forgotten is that a person is more like an analog machine built from cells, where all kinds of chemicals work together to give it functions like digesting food, walking, maintaining body temperature and pursuing romantic relationships. The genome only provides the informational content, but until a researches shows the causal mechanism between a genome-based association and actual workings of the biochemical components, a genome-based association is just a statistical observation. Speaking of statistical observations, the following one shared by @giggs_boson looks pretty good.
Is human being like a car then?
A car consumes fuel, moves around, can maintain the inside temperature and can even pursue romantic relationships in Pixar movies. Jokes apart, is a human being like a car or other machines?
One would immediately point out that the cars need human drivers, but that is not true with modern robot-driven cars. Another objection is that the cars do not spawn out new cars by themselves, whereas self-reproduction is the main difference between living beings and modern sophisticated machines. We can work around that objection by doing a ‘thought experiment’, where we put a robot inside a car with a sophisticated digital printer and code to produce all kinds of car parts by taking raw materials from outside. Once ready, the robot inside pushes the parts out and another robot attached to the exterior assembles them into a new car. Is the car-robot combination like a living being?
There is another level of complexity
Even though they do not tell you, many researches working on human medicine follow the car-robot mechanical model of living beings. That is not without reason, because the mechanical model worked very well in 20th century medicine. It has been possible to replace limbs with artificial ones (analogous to changing flat tires?), controlling hearts with pacemakers (‘rebuilding engine?’), fighting off invaders with chemicals and even doing brain surgery by using a mechanical view of the system, where the surgeon is a highly-skilled mechanic.
However, the mechanical view fails completely, when researchers try to understand the human genome or any genome for that matter. Here is the biggest difference between living beings and the replicating car+robot combination described above. The living organisms, unlike the replicating car+robot combination, is self-built. The system constructed itself from absolutely nothing through an iterative process lasting for 3-4 billion years, and many traces of that evolutionary process can be determined from (i) fossil records, (ii) comparing cell structures and biochemistry of different living organisms and (iii) comparing informational contents (genomes and genes) of the same organisms.
By staying focused on only one or two of those three aspects, a researcher can severely handicap himself from getting the full picture. A human genome has ~20,000 protein-coding genes and some of them code for functional blocks, which stabilized in the earliest bacteria ~3 billion years back, some of them stabilized in earliest eukaryotes over 2.5 billion years back and some are more recent. The only way to understand each functional block is to compare organisms, which took different paths since their divergence. That means using different models for different groups of genes. This kind of comparative analysis is often missing from most analysis.
Going back to the original point, adding evolutionary context with genomic, biochemical and developmental data is the biggest challenge and missing block from bioinformatics, not the availability of computing power or programs to analyze bigger data sets. The experimental biologists, out of necessity, focus on one organism (one machine), because it is impossible to maintain a lab with ten different species. Bioinformaticians, on the other hand, are free to compare and analyze, but quite a bit of that analysis requires thinking about functions, development and evolution of the entire machines (living organisms) rather than only the genomes, but that is often missing from the bioinformatics literature. Those, who do it, isolated themselves into the subfield of evolutionary molecular biology.
In a small and insignificant way, we tried to do the entire exercise in our electric fish paper and found it quite challenging. At first, we started with genome assembly of electric eel, but realized that the genome was not very useful to infer about functionality of the electric organ. We also compared RNAseq data from electric organ, muscle and other organs of electric eel. RNAseq data adds a bit more functional information beyond the genome, but it is still limited to one organism. Then we decided to compare RNAseq data from different electric fish species, which evolved electric organs independently and that added the evolutionary context to the whole picture.
Our entire exercise greatly narrowed down the list of genes on which biologists can further do their lab tests. We also worked on theoretically connecting those genes based on their known functions, and these connections can be used as hypotheses to direct the lab tests. This last part (connecting the dots through a narrative) is nearly impossible to do with software, but even if one can do that, there is another level of difficulty. The narrative needs to be updated iteratively based on small-scale experiments and additional data. Which experiments to do to challenge the story is a scientific process and we do not believe any software can replace the process.
Long story short, computational biologists need to solve real puzzles consisting of determining of functional or disease mechanisms, and then build tools based on a handful of successful paths. The current approach of focusing on human genomes and throwing all kinds of data and statistical tools (GWAS) in the hope of discovering disease mechanisms (or romantic partners) is too limited and wasteful.
Kudos to Leroy Hood, Stuart Kim and colleagues for not publishing yet another hyped up GWAS study.
Whole-Genome Sequencing of the World’s Oldest People
Supercentenarians (110 years or older) are the world’s oldest people. Seventy four are alive worldwide, with twenty two in the United States. We performed whole-genome sequencing on 17 supercentenarians to explore the genetic basis underlying extreme human longevity. We found no significant evidence of enrichment for a single rare protein-altering variant or for a gene harboring different rare protein altering variants in supercentenarian compared to control genomes.
Here is what being honest gets you – being ridiculed by a busy Berkeley math professor, who could not look up the rest of the abstract.
The authors of GWAS study indeed tried to follow up with their best hit on another 99 long-lived individuals.
We followed up on the gene most enriched for rare protein-altering variants in our cohort of supercentenarians, TSHZ3, by sequencing it in a second cohort of 99 long-lived individuals but did not find a significant enrichment.
The following part of the abstract rings death knell to the nascent industry trying to predict future diseases of healthy people by from their genomes.
The genome of one supercentenarian had a pathogenic mutation in DSC2, known to predispose to arrhythmogenic right ventricular cardiomyopathy, which is recommended to be reported to this individual as an incidental finding according to a recent position statement by the American College of Medical Genetics and Genomics. Even with this pathogenic mutation, the proband lived to over 110 years. The entire list of rare protein-altering variants and DNA sequence of all 17 supercentenarian genomes is available as a resource to assist the discovery of the genetic basis of extreme longevity in future studies.
UK’s Independent in 2000 –
Snowfalls are now just a thing of the past
Britain’s winter ends tomorrow with further indications of a striking environmental change: snow is starting to disappear from our lives.
Sledges, snowmen, snowballs and the excitement of waking to find that the stuff has settled outside are all a rapidly diminishing part of Britain’s culture, as warmer winters – which scientists are attributing to global climate change – produce not only fewer white Christmases, but fewer white Januaries and Februaries.
The first two months of 2000 were virtually free of significant snowfall in much of lowland Britain, and December brought only moderate snowfall in the South-east. It is the continuation of a trend that has been increasingly visible in the past 15 years: in the south of England, for instance, from 1970 to 1995 snow and sleet fell for an average of 3.7 days, while from 1988 to 1995 the average was 0.7 days. London’s last substantial snowfall was in February 1991.
Global warming, the heating of the atmosphere by increased amounts of industrial gases, is now accepted as a reality by the international community. Average temperatures in Britain were nearly 0.6Â°C higher in the Nineties than in 1960-90, and it is estimated that they will increase by 0.2C every decade over the coming century. Eight of the 10 hottest years on record occurred in the Nineties.
However, the warming is so far manifesting itself more in winters which are less cold than in much hotter summers. According to Dr David Viner, a senior research scientist at the climatic research unit (CRU) of the University of East Anglia,within a few years winter snowfall will become “a very rare and exciting event”.
“Children just aren’t going to know what snow is,” he said.
The effects of snow-free winter in Britain are already becoming apparent. This year, for the first time ever, Hamleys, Britain’s biggest toyshop, had no sledges on display in its Regent Street store. “It was a bit of a first,” a spokesperson said.
Fen skating, once a popular sport on the fields of East Anglia, now takes place on indoor artificial rinks. Malcolm Robinson, of the Fenland Indoor Speed Skating Club in Peterborough, says they have not skated outside since 1997. “As a boy, I can remember being on ice most winters. Now it’s few and far between,” he said.
Michael Jeacock, a Cambridgeshire local historian, added that a generation was growing up “without experiencing one of the greatest joys and privileges of living in this part of the world – open-air skating”.
Warmer winters have significant environmental and economic implications, and a wide range of research indicates that pests and plant diseases, usually killed back by sharp frosts, are likely to flourish. But very little research has been done on the cultural implications of climate change – into the possibility, for example, that our notion of Christmas might have to shift.
Professor Jarich Oosten, an anthropologist at the University of Leiden in the Netherlands, says that even if we no longer see snow, it will remain culturally important.
“We don’t really have wolves in Europe any more, but they are still an important part of our culture and everyone knows what they look like,” he said.
David Parker, at the Hadley Centre for Climate Prediction and Research in Berkshire, says ultimately, British children could have only virtual experience of snow. Via the internet, they might wonder at polar scenes – or eventually “feel” virtual cold.
Heavy snow will return occasionally, says Dr Viner, but when it does we will be unprepared. “We’re really going to get caught out. Snow will probably cause chaos in 20 years time,” he said.
The chances are certainly now stacked against the sortof heavy snowfall in cities that inspired Impressionist painters, such as Sisley, and the 19th century poet laureate Robert Bridges, who wrote in “London Snow” of it, “stealthily and perpetually settling and loosely lying”.
Not any more, it seems.
Fast forward to 2014 –
Winter 2014 set to be ‘coldest for century’ Britain faces ARCTIC FREEZE in just weeks
Meanwhile in US –
Another de Bruijn graph magic by Raluca Uricaru, Rayan Chikhi, Guillaume Rizk and co-authors.
Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, DISCOSNP, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, DISCOSNP ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, DISCOSNP requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.