SOAPdenovo2 Demystified - part (a)

SOAPdenovo2 Demystified - part (a)


Library Functions

The ‘library files’ contain set of functions being reused in many parts of the program. They should be reusable even outside SOAPdenovo2. If you are writing a new bioinformatics program, you can borrow them or otherwise have to write similar functions on your own.

Here are the general description of various library files and functions. We are working through their wiki pages, and will update this commentary, if we find something new.

i) darray.c, stack.c

These two files have functions to create dynamic arrays and stacks. Most computer scientists know the purpose of those data structures.

darray.c: It has functions createDarray to create dynamic array of given size, darrayPut dynamically allocates space to add element at given index, darrayGet to get an element from given index, emptyDarray _ to empty the dynamic array and _freeDarray to free the dynamically allocated space. Structure of the array is defined in darray.h.

`typedef struct dynamic_array

{

void * array;

long long array_size;

size_t item_size;

long long item_c;

} DARRAY;`

stack.c: Dynamically allocated stack with functions createStack, emptyStack, freeStack, stackBackup, stackPop, stackPush.

ii) fib.c, dfib.c, fibHeap.c, dfibHeap.c

Fibonacci was an Italian mathematician of 11th century, whose name was not Fibonacci and is famous for a number series that he did not discover !! Isn’t that cool? Here is the more interesting part. The implications of Fibonacci numbers are under-appreciated, because they are not part of mathematical underpinning of our physical sciences. For example, stock market data often show repetitions of Fibonacci distribution and fractal patterns, but standard economics texts never analyze such patterns.

Distraction aside, in computer science, Fibonacci heaps are expected to have faster access time than binary heaps.

fib.c: Well- written set of Fibonacci heap functions from John-Mark Gurney. They are distributed under MIT-type license. Do not forget to also grab fib.h, dfib.h and dfibpriv.h.

dfib.c: Another of Gurney’s creations.

fibHeap.c: BGI’s wrapper on Gurney’s functions in fib.c.

dfibHeap.c: BGI’s wrapper on Gurney’s functions in dfib.c.

iii) lib.c, check.c, mem_manager.c

lib.c: Functions in this file are used to check the read libraries.

check.c: Functions here are used to check whether files exist before opening them, or there is enough memory before allocating more space.

mem_manager.c: Some kind of memory manager to dynamically allocate space for hashes, heaps, etc.

iv) kmer.c, kmerhash.c, newhash.c, hashFunctions.c

Following files should be self-explanatory for anyone working on de Bruijn graph-based algorithms.

kmer.c: Processes kmers. The important part here is that BGI’s library can handle kmers of length up to 127 nucleotides (2^7-1). It is done by splitting higher and lower halves of a kmers into two separate spaces.

kmerhash.c: Various kmer-related operations.

newhash.c: Replacement of kmerhash.c mentioned above.

hashFunctions.c: Describes the hash function. This file must be the center of universe :), because it has some embedded assembly language code. What is crc table? Cyclic redundancy check or something else?

v) seq.c, readseq1by1.c

seq.c: This file has functions to write A/T/G/C sequences into tightstring format.

readseq1by1.c: Reads sequence ‘one by one’ - the purpose of this file should be very clear from its name. It has functions to read input files of different forms - fastaq, bam, sam, gzip, etc.



Written by M. //