C. Titus Brown asked in his blog - What’s the equivalent of prime number calculations for biologists learning to program?. New students of bioinformatics should check his commentary and responses to find interesting ideas.
If we had to design a course on programming for biologists, it will have the following progression.
It will start with GC-counting of a genome as ‘Hello World’ example.
Warm up questions
b) Early sessions will have the following examples -
i) Writing code to extract nucleotides 12000-14500 on chromosome 3 from human genome,
ii) Finding reverse complement of a sequence,
iii) Translating a nucleotide sequence in six frames and finding longest peptide,
Counting k-mers in sequences is equivalent to identifying all prime numbers up to 10 million in C-programming. Why do we say so? Anyone can write a lousy program for k-mer counting or prime number counting that does not scale well. So, both examples are good for teaching the differences between O(N), O(N logN), O(N^2) types of algorithms.
For the above reason, when we first wrote about Hadoop, we used the k-mer counting example. Here we described other efficient methods for counting k-mers. The good thing about k-mer counting is that the problem itself is so easy to state that it can be used to explain aspects of hardware, memory and hashing.
In our first course of programming in computer science classes, we started with simple sort algorithms like bubble-sort, quick-sort, etc.
For bioinformaticians, the equivalent topics are Smith-Waterman alignment and various other alignment programs.
After mastering search and sort, we learned few graph-algorithms. If you read our blog for the last few months, you know what the bioinformatics equivalent would be.
Use of bio libraries**
We will bring the bioperl, biopython, etc. libraries at the end. Only after someone appreciates how time-consuming it is to write code for simple-sounding tasks (finding every match from a BLAST output), it is good to learn that others already solved the problems.
Question for readers:
If you want to estimate GC in a 300 million nucleotide genome and computing is so expensive that checking each nucleotide costs $3, what is the most logical solution?