Archives

Categories

An Easy-to-follow Introductory Book on NGS Assembly Algorithms

Dear Readers,
Continue reading An Easy-to-follow Introductory Book on NGS Assembly Algorithms

Quality scores for 32,000 microbial genomes, no eukaryotes

Link (h/t: @hengli)

Background
More than 80% of the microbial genomes in GenBank are of ‘draft’ quality (12,553 draft vs. 2,679 finished, as of October, 2013). We have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences.

Results
Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have nearly all the 102 essential genes.

Conclusions
The score can be used to set thresholds for screening data when analyzing “all published genomes” and reference data is either not available or not applicable. The scores highlighted organisms for which commonly used tools do not perform well. This information can be used to improve tools and to serve a broad group of users as more diverse organisms are sequenced. Unexpectedly, the comparison of predicted tRNAs across 15,000 high quality genomes showed that anticodons beginning with an ‘A’ (codons ending with a ‘U’) are almost non-existent, with the exception of one arginine codon (CGU); this has been noted previously in the literature for a few genomes, but not with the depth found here.

—————————————

Edit.

For eukaryotes, the assemblers are characterized instead of the genomes. For example –

Assessment of de novo assemblers for draft genomes: a case study with fungal genomes

Background
Recently, large bio-projects dealing with the release of different genomes have transpired. Most of these projects use next-generation sequencing platforms. As a consequence, many de novo assembly tools have evolved to assemble the reads generated by these platforms. Each tool has its own inherent advantages and disadvantages, which make the selection of an appropriate tool a challenging task.
Results
We have evaluated the performance of frequently used de novo assemblers namely ABySS, IDBA-UD, Minia, SOAP, SPAdes, Sparse, and Velvet. These assemblers are assessed based on their output quality during the assembly process conducted over fungal data. We compared the performance of these assemblers by considering both computational as well as quality metrics. By analyzing these performance metrics, the assemblers are ranked and a procedure for choosing the candidate assembler is illustrated.
Conclusions
In this study, we propose an assessment method for the selection of de novo assemblers by considering their computational as well as quality metrics at the draft genome level. We divide the quality metrics into three groups: g1 measures the goodness of the assemblies, g2 measures the problems of the assemblies, and g3 measures the conservation elements in the assemblies. Our results demonstrate that the assemblers ABySS and IDBA-UD exhibit a good performance for the studied data from fungal genomes in terms of running time, memory, and quality. The results suggest that whole genome shotgun sequencing projects should make use of different assemblers by considering their merits.

Today’s Rainbow Chasing Paper by ENCODE Leader-in-Exile Ewan Birney

A new paper in biorxiv reminded us of a story shared by Ken Weiss.

Survivorship bias
Ellenberg begins his book with an illustration of how abstract logical thinking can solve important real-world problems in subtle ways. In WWII a mathematics research group was asked by the Army to help them locate armor plating on fighter aircraft. The planes were returning to base with scattered bullet holes from enemy fire and the idea was to put some protective plating where it would do the most good without adding cumbersome mileage-eating weight. The mathematician suggested to put the plating where the bullet holes weren’t. This seemed strange until he explained that this was because the bullet holes that were observed hadn’t done much damage: bullets hitting elsewhere had brought the plane down so it was never observed because the plane never returned to base. The engine compartment was the case in point: a shot to the engine was fatal to the aircraft, but to the wings and body, much less so.

If you think about the genome as the body of the plane and the variants as the bullets, survivorship bias would suggest that the places with variants are less important than the places with no change. This is nicely explained by Weiss in his commentary.

How can a gene be central to the development of the basis of a trait, and yet not be found in mapping to identify variation that causes failures of the trait? Indeed, the basic finding of GWAS and most other mapping approaches is that the tens or hundreds or thousands of genome ‘hits’ have individually trivial effects.

The answer may lie in survivorship bias. Like the lethality of bullets to the engine of a fighter, most variation in the main genes, those whose sequence is more highly conserved, is lethal to the embryo or manifest in pathology so clear that it never is the subject of case-control or other sorts of Big Data mapping. In other words, genome mapping may systematically be inevitably constrained to find small effects! That’s exactly the opposite of what’s been promised, and the reason is that the promises were, psychologically or strategically, based on extrapolation of the findings of strong, single-gene effects causing severe pediatric disease–a legacy of Mendel’s carefully chosen two-state traits.

To the extent this is a correct understanding, then genomewide mapping as it’s now being done is, from an evolutionary genomic perspective, necessarily rainbow-chasing. Indeed, a possibility is that most adaptive evolution is itself also due to the effects of minor variants, not major ones. Once the constraining interaction of the major genetic factors is in place, mostly what can nudge organisms in this direction or that, whether adaptively or in relation to complex, non-congenital disease, is based on assembled effects of individually very minor variants. In turn, that could be why slow, gradualism was so obviously the way evolution worked to Darwin, and why it generally still seems that way today.

Someone needs to explain that to ENCODE clown Ewan Birney, who developed a new tool to remedy a ‘frustration’ that is due to lack of understanding of science and not due to lack of tools. The new paper is available here .

Genome wide association studies provide an unbiased discovery mechanism for numerous human diseases. However, a frustration in the analysis of GWAS is that the majority of variants discovered do not directly alter protein-coding genes. We have developed a simple analysis approach that detects the tissue-specific regulatory component of a set of GWAS SNPs by identifying enrichment of overlap with DNase I hotspots from diverse tissue samples. Functional element Overlap analysis of the Results of GWAS Experiments (FORGE) is available as a web tool and as standalone software and provides tabular and graphical summaries of the enrichments. Conducting FORGE analysis on SNP sets for 260 phenotypes available from the GWAS catalogue reveals numerous overlap enrichments with tissue–specific components reflecting the known aetiology of the phenotypes as well as revealing other unforeseen tissue involvements that may lead to mechanistic insights for disease.

The abstract starts with a lie, which is unfortunate but is typical of Birney. GWAS has not helped in discovery of mechanism of any complex human disease. In fact, it failed miserably in trying to describe supposedly simple traits like height (check height of folly by Weiss), yet is being actively promoted to find genetic causes of diabetes, obesity, intelligence and what not. For example, check this post by Dan Graur – GWAS Excrement Again: PNAS Paper Explains 0.02% of the Variation in an Ill-Defined Trait.

Compression of High Throughput Sequencing Data with Probabilistic de Bruijn Graph

@RayanChikhi mentioned in twitter about this new paper

“A super-efficient de novo read compression algorithm (Leon) from @GuillaumeRizk et al. Paper on #arXiv”

The method is to build de Bruijn graph and then store each read as a path within the graph. A Bloom-filter based data structure is chosen to store the graph. Full abstract is shown below.

Motivation: Data volumes generated by next-generation sequencing technolo- gies is now a major concern, both for storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip. Most reference-free tools developed for NGS data compression still use general text compression methods and fail to benefit from algorithms already designed specifically for the analysis of NGS data. The goal of our new method Leon is to achieve compression of DNA sequences of high throughput sequencing data, without the need of a reference genome, with techniques derived from existing assembly principles, that possibly better exploit NGS data redundancy. Results: We propose a novel method, implemented in the software Leon, for compression of DNA sequences issued from high throughput sequencing technologies. This is a lossless method that does not need a reference genome. Instead, a reference is built de novo from the set of reads as a probabilistic de Bruijn Graph, stored in a Bloom filter. Each read is encoded as a path in this graph, storing only an anchoring kmer and a list of bifurcations indicating which path to follow in the graph. This new method will allow to have compressed read files that also already contain its underlying de Bruijn Graph, thus directly re-usable by many tools relying on this structure. Leon achieved encoding of a C. elegans reads set with 0.7 bits/base, outperforming state of the art reference-free methods. Availability: Open source, under GNU affero GPL License, available for download at this http URL

Rob patro pointed out similarities with the following paper –

Compression of short-read sequences using path encoding

Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale transcriptome sequencing. Our approach offers a new direction by sitting between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs — a common task in genome assembly — and context-dependent arithmetic coding. Supporting this method is a system, called a bit tree, to compactly store sets of kmers that is of independent interest. Using these techniques, we are able to encode RNA-seq reads using 3% — 11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than recent competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

The main difference between two papers is in the form of data structure (Bloom filter vs bit vector). Also, the later paper uses some kind of reference, although it can be of very poor quality, whereas the former paper simply uses the dBG.

Also note that the authors of KMC developed a minimizer-based compression method –

Minimizer Success – Disk-based Genome Sequencing Data Compression

Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.376 bits per base as the compression ratio.

A perceptual hash function to store and retrieve large scale DNA sequences

Link (h/t: @lexnederbragt)

This paper proposes a novel approach for storing and retrieving massive DNA sequences.. The method is based on a perceptual hash function, commonly used to determine the similarity between digital images, that we adapted for DNA sequences. Perceptual hash function presented here is based on a Discrete Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray level intensity pixel and the hash is calculated from its significant frequency characteristics. This results to a drastic data reduction between the sequence and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes are not affected by “avalanche effect” and thus can be compared. The similarity distance between two hashes is estimated with the Hamming Distance, which is used to retrieve DNA sequences. Experiments that we conducted show that our approach is relevant for storing massive DNA sequences, and retrieving them.

Nice concept. Wish they checked with real genomic sequence from E. coli instead of simulated sequence.

‘Best of 2014′ – My Personal Opinion

Last year, I assembled a panel of judges, who chose the best bioinformatics contributions of the year from a number of suggestions made by the readers. You can see the results in Announcing the Results of ‘Best of 2013′.

The entire process took quite a bit of time of everyone involved. Moreover, we had an early start to make the effort visible and get the most feedback from readers. This year I got busy and did not find enough time to notify the judges and the readers. So, instead of having the ‘Best of 2014′ in the same way as last year, I will share the ideas I found most interesting and encourage you to do the same in comment section.

1. Succint de Bruijn Graph

Variable-Order de Bruijn Graphs

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Very recently, two papers came out implementing the succint de Bruijn graph idea of Alex Bowe et al. I found both of them quite interesting. The idea of succint de Bruijn graph merges two concepts – (i) de Bruijn graphs for assembly, (ii) XBW transform for graphs with its associated rank and select structure. The paper on variable order de Bruijn graph showed how succint de Bruijn graph could be used to naturally progress to de Bruijn graphs of multiple kmer sizes and I found that quite elegant.

2. Minimizer

On the representation of de Bruijn Graph

KMC 2: Fast and resource-frugal k-mer counting

Chikhi et al’s merger of assembly and minimizer was quite elegant. The aspects of KMC2 paper I found most interesting were – (i) treatment of homopolymer minimizers to adjust bucket sizes, (ii) highly parallel code.

3. Cache-efficient common kmer finding in DALIGNER

DALIGNER: Fast and sensitive detection of all pairwise local alignments

In DALIGNER paper, Gene Myers came up with a cache-efficient method for finding common k-mers from two reads, and the method scaled almost linearly with the number of cores. I found that quite elegant. Also, his method to use Pacbio stats to cut down unlikely paths in O(ND) alignment to make it useful for long reads was very good.

————————————

The above is only a subset of a large number of unusual contributions made by bioinformatics researchers. A bigger list of interesting papers on NGS can be found in ASHG/GA4GH Special – A Recap of Interesting NGS Algorithms of the Year. Please share in comment section any idea, paper, blog or teaching tool that you enjoyed.

Dear BGI, Please Stop ‘Revealing’ Any More ‘Insights’ :)

Long long time ago, titles of papers told you exactly what was inside. For example, Alexander Fleming’s Penicillin paper (1928) was titled – On the Antibacterial Action of Cultures of a Penicillium, with Special Reference to Their Use in the Isolation of B. influenzae, not “here is the greatest drug to save humanity”.

The practice continued until late 1990s, when the first few genome papers were named as banally as “Analysis of the genome sequence of the flowering plant Arabidopsis thaliana” or “The nucleotide sequence of Saccharomyces cerevisiae chromosome XV and its evolutionary implications.” Then, somewhere along the line, someone got the idea that the genome paper needed to be titled colorfully, and have to use the word ‘insight’ and other primates writing genome papers copied it. For a full list of insight-full genome papers, check –

Shocking Finding that a Genome by Itself Provides Little Insight

and

Clearly, the purpose of genome sequencing is to “provide insights”.

On the other hand, ‘reveal’ is the common word to title the press releases of genome papers and the practice likely started with Watson’s 2001 write-up –

Watson, J.D. 2001. The human genome revealed. Genome Research 11: 1803-1804.

Fast forward by thirtee years and we have BGI with their mastery of words and revealing insights. Here are the titles of latest BGI papers –

Comparative genomics reveals insights into avian genome evolution and adaptation

Two Antarctic penguin genomes reveal insights into their evolutionary history and molecular changes related to the Antarctic environment

Oh well ! Where do we go from here?

Alternative Splicing – the New Snake Oil to Explain ‘Human Complexity’

We have been noticing increasingly that various bioinformaticians are chasing alternative splicing in RNAseq data. That qualifies as harmless curiosity just by itself, but when one combines it with additional observation that various ENCODE and GTEx clowns are using alternative splicing to ‘explain’ ‘human complexity’, the situation gets worrisome. On the later point, look no further than a paper published by leading ENCODE clown Mike Snyder last year (covered in Human Transcriptome is Extremely Complex and Snyderome is the Most Complex of All).

The problem started in 2001, when the number of protein-coding genes in human genome fell far short of the expected 100,000 to only 25,000. That was comparable to 20,000 genes in measly worm. The observation deflated the ego of a few bad scientists, who had been looking for other explanations of human complexity ever since. Using alternative splicing to turn that 20,000 to 100,000 splice forms is one method, others being declaring the entire genome as ‘biochemically functional’ and talking incessantly about human long non-coding RNA as something marvelous. Sandwalk blog has a full list.

Here’s the latest list of the sorts of things that may salvage your ego if it has been deflated.

1. Alternative Splicing: We may not have many more genes than a fruit fly but our genes can be rearranged in many different ways and this accounts for why we are much more complex. We have only 25,000 genes but through the magic of alternative splicing we can make 100,000 different proteins. That makes us almost ten times more complex than a fruit fly. (Assuming they don’t do alternative splicing.)

2. Small RNAs: Scientists have miscalculated the number of genes by focusing only on protein encoding genes. Our genome actually contains tens of thousands of genes for small regulatory RNAs. These small RNA molecules combine in very complex ways to control the expression of the more traditional genes. This extra layer of complexity, not found in simple organisms, is what explains the Deflated Ego Problem.

3. Pseudogenes: The human genome contains thousands of apparently inactive genes called pseudogenes. Many of these genes are not extinct genes, as is commonly believed. Instead, they are genes-in-waiting. The complexity of humans is explained by invoking ways of tapping into this reserve to create new genes very quickly.

4. Transposons: The human genome is full of transposons but most scientists ignore them and don’t count them in the number of genes. However, transposons are constantly jumping around in the genome and when they land next to a gene they can change it or cause it to be expressed differently. This vast pool of transposons makes our genome much more complicated than that of the simple species. This genome complexity is what’s responsible for making humans more complex.

5. Regulatory Sequences: The human genome is huge compared to those of the simple species. All this extra DNA is due to increases in the number of regulatory sequences that control gene expression. We don’t have many more protein-encoding regions but we have a much more complex system of regulating the expression of proteins. Thus, the fact that we are more complex than a fruit fly is not due to more genes but to more complex systems of regulation.

6. The Unspecified Anti-Junk Argument: We don’t know exactly how to explain the Deflated Ego Problem but it must have something to do with so-called “junk” DNA. There’s more and more evidence that junk DNA has a function. It’s almost certain that there’s something hidden in the extra-genic DNA that will explain our complexity. We’ll find it eventually.

7. Post-translational Modification: Proteins can be extensively modified in various ways after they are synthesized. The modifications, such as phosphorylation, glycosylation, editing, etc., give rise to variants with different functions. In this way, the 25,000 primary protein products can actually be modified to make a set of enzymes with several hundred thousand different functions. That explains why we are so much more complicated than worms even though we have similar numbers of genes.

What, then, is the explanation of ‘human complexity’? The answer is none. Humans are no more complex than most other animals.

Getting back to original point, I would be very cautious about reading too much from alternative splicing data in RNAseq, unless backed by other biochemical confirmations.

Taiwanese Nanotech Paper Heading for Retraction, Professor/Student Punished

When we wrote about a controversial Nanopore paper last year (see Nanopore Sequencing by Divine Intervention?), we expected the authors to either receive multiple Nobel prizes or get elevated to the status of deity. The second option seemed more likely after additional details came out about the project (see Bizarro Details Coming out on ‘Nanopore Sequencing by Divine Intervention’ Story). Apparently the ‘lab’, where those cutting-edge experiments were done, did not have any suitable equipment and was located far from the university near a well-known temple.

National Chiao Tung University, whose motto is 知新致遠 崇實篤行 (“Learn New Knowledge and Reach Far; Honor the Truth and Work Hard”), did not appreciate being a laughing stock any more and pulled the plug on this project. Those, who understand Chinese, can check about this latest development at this link. The jumbled up text from google translate text is not worth reproducing and the picture makes one think the professor/student got some big reward, not kicked out of the university.

110574wewyeimages6-

This is a terrible tragedy for our blog, because we will have to restrict ourselves with making fun of domestic clowns, such as the ENCODE leaders and the positivity lady.

Excellent Software Engineering in KMC

I did not have much respect for ‘Boosted’ C++ programmers in bioinformatics until I went through KMC code. KMC, or rather its latest version KMC2, is currently the fastest (and leanest) kmer counting program. A kmer counting program is conceptually very naive, and that means the faster programs are reflections of programmers’ knowledge about various hardware and data structure concepts. For example, Jellyfish uses compare-and-swap feature of modern processors to make hash table updates near concurrent. BFcounter and a number of other programs use Bloom filter, a data structure concept unrelated to bioinformatics.

KMC, the earlier program, was disk-based and memory-light just like Chikhi et al.’s DSK. However, KMC showed significant speed improvement over DSK by using extensive parallelization in various steps. KMC2 has further improvement over KMC due to its use of minimizer concept. In fact, KMC2 is also an improvement over MSPcounter, another minimizer-based kmer counting program, because it modifies minimizers to ‘signatures’ to avoid too many subreads going to the same bucket.

Those algorithmic contributions aside, KMC2 code is a pleasure to read. The code is organized into four subdirectories (kmer_counter, kmc_dump, kmc_dump_sample, kmc_api), among which ‘kmer_counter’ is most important. The following figure from the kmc2 paper displays the flow diagram of the code.

Capture

KMC code is multithreaded and uses a number of queues to handle parallelization. What I found quite amazing is that it completed kmer counting in about the same time another test code written by me barely went through a giant FASTQ file. That means the biggest bottleneck of KMC2 is in reading fastq file from the disk and not in actual processing and counting of kmers. I am working through various classes within the code, and as I proceed, the following blocks will be updated. If you have any further insight, please feel free to let me know in the comment section below.

kmer_counter.cpp: The main code is in kmer_counter.cpp. It defines the class ‘CApplication’ that calls various classes through CKMC class.

kmc.h – defines the CKMC class that parallelizes the code and calls the classes in the files below.

CKMC class is where all action takes place. So, let me describe it in a bit more detail.

i) The public function SetParams(CKMCParams &_Params) sets all parameters.

ii) The public function Process() does all work by synchronizing all threads, etc. This is the most important function in the entire program.

iii) Here is a complete list of all other private and public functions. The names are quite descriptive of what they do.

template class CKMC {
bool initialized;

CStopWatch w0, heuristic_time , w1, w2;

// Parameters (input and internal)
CKMCParams Params;

// Memory monitor and queues
CKMCQueues Queues;

// Thread groups
vector gr0_1, gr0_2;
vector gr1_1, gr1_2, gr1_3, gr1_4, gr1_5; // thread groups for 1st stage
vector gr2_1, gr2_2, gr2_3; // thread groups for 2nd stage

uint64 n_unique, n_cutoff_min, n_cutoff_max, n_total, n_reads, tmp_size, n_total_super_kmers;

// Threads
vector w_stats_fastqs;
vector*> w_stats_splitters;
vector w_fastqs;
vector*> w_splitters;
CWKmerBinStorer *w_storer;

CWKmerBinReader* w_reader;
vector*> w_sorters;
CWKmerBinCompleter *w_completer;

void SetThreads1Stage();
void SetThreads2Stage(vector& sorted_sizes);

bool AdjustMemoryLimits();
void AdjustMemoryLimitsStage2();

void ShowSettingsStage1();
void ShowSettingsStage2();

public:
CKMC();
~CKMC();

void SetParams(CKMCParams &_Params);
bool Process();
void GetStats(double &time1, double &time2, uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total, uint64 &_n_reads, uint64 &_tmp_size, uint64& _n_total_super_kmers);

kmer.h – defines classes Ckmer and CKmerQuake for handling Kmers without and with quality scores.

fastq_reader.h – FASTQ reader classes CFastqReader and CWStatsFastqReader. The second class is a wrapper for the first one created to perform multi-threading.

splitter.h – defines the class CSplitter that splits reads into variuous bins. This is where minimizers are computed.

kb_storer.h – defines the class CKmerBinStorer that stores bins of kmers.

kb_sorter.h – defined the class CKmerBinSorter which sorts kmers within bins.

kb_reader.h – defines the class CKmerBinReader, which reads bins from distribution phase.

kb_completer.h – defines the class CKmerBinCompleter, which complete the sorted bins and store in a file.

kb_collecter.h – defines the class CKmerBinCollector, which collects kmers belonging to a single bin.

The Tragic Demise of BLAST

BLAST is one of the most well-known bioinformatics programs. In fact, if I remove ‘one of’ from the previous sentence, that will not be a mistake. The scientists developing it received numerous well-deserved awards. For early history of how the program was written, please check the blog post – Once upon a BLAST .

Only the successful hypotheses and the successful experiments are mentioned in the text — a small minority — and the painful intellectual labor behind discoveries is omitted altogether. Time is precious, and who wants to read endless failure stories? Point well taken. But this unspoken academic pact has sealed what I call the curse of research. In simple words, the curse is that by putting all the emphasis on the results, researchers become blind to the research process because they never discuss it. How to carry out good research? How to discover things? These are the questions that nobody raises (well, almost nobody).

————————————-

But that is all history from early 1990s. What most researchers do not know is that BLAST is dying at present in the hand of NCBI. It is not dying due to lack of funding, but from over-funding. Early innovators have all left and God knows who has taken over.

Last week, I downloaded the source codes of several bioinformatics programs we use frequently – BLAST, BWA, samtools, SOAPdenovo2, Minia, Trinity, KMC2, SPAdes, DAZZLER, etc. – and compiled each of them. NCBI BLAST turned out to be THE worst among all of them.

Here are the specific complaints –

(i) BLAST compiled to executable directory (ReleaseMT) of size ~890MB ! Why would an aligner need almost 1GB of space? For comparison, bwa compiles to only 1.4MB.

(ii) Configure and make of BLAST took over one hour in a reasonably high-end server. BWA compiled in exactly 12 seconds.

(iii) The BLAST executable directory seemed to have all kinds of ‘sophisticated’ modules, including unit tests and other junk. That is all fine and dandy, except (see next point) –

(iii) After one hour of waiting and getting my directory filled, I desperately looked for a ‘make clean’, but could not find such option. Every other program had ‘make clean’ to help me get back to the ground state. Why not BLAST?

If BLAST gets any more funding, it will turn into Obamacare website software and stop working altogether. NCBI should urgently place BLAST into github, start derivative projects (aka wasting money) under new names. BLAST does not need any further development and definitely no more funding to destroy it.