It is time to find the best bioinformatics contributions of 2013 just like we did in 2012 (Top Bioinformatics Contributions of 2012). The original idea came to us after noticing that the yearly reviews in Science and Nature celebrated the large experimental projects, whereas bioinformatics tools like BLAST, BWA or SOAPdenovo rarely got mentioned despite their immense contribution to biology. More importantly, papers discussing elegant computational algorithms got recognized years after their publication (Pevzner’s dBG, Myers’ string graph) or never got recognized (Ross Lippert’s 2005 papers on using Burrows Wheeler Transform in genomics). So, we wanted to give recognition to the major computation discoveries in biology and try to bring attention to under-appreciated contributions with potential long-term benefit.
For this year’s effort, we assembled an outstanding panel of judges.
Continue reading Top Bioinformatics Contributions of 2013
The title is a bit of over-simplification to make a point. However, the strong connection between dBG and BWT becomes clear, when you understand the ‘succint de Bruijn graph’ method presented by Alex Bowe. We will encourage you to read the linked well-written commentary first. Following discussion presents the conceptual details that may help you understand what is going on.
Burrows Wheeler Transform
We covered Burrows Wheeler transform in the following two commentaries.
Finding us in homolog.us
Finding us in homolog.us – part II
Essentially you start with a word and keep rotating it by moving one character from front to back until all possibilities are exhausted. You end up with a table like this.
Sort all the entries in the table and take the last column. That is your Burrows Wheeler transform of the original word.
Continue reading Burrows-Wheeler Transform (BWT) is de Bruijn Graph is BWT
The following narrative is a beautiful demonstration of how natural forces work more powerfully than human attempts to micromanage them. It has all components including banned chemicals, honey bee colony collapse, widely disliked ants, government-funded efforts to keep them at bay and nature’s hammer on them. It even presents a new threat to the highest form of human ascendancy – namely innovation of electronic gadgets, toys, video games and other circuitry. Readers are warned that the words ‘intelligent design’ are broadly defined to include the activities of various central planners, environment agencies, government bureaucrats and various other demigods of contemporary society.
Fipronil is a modern pesticide that targets the central nervous system of insects. More specifically, it blocks the passage of chloride ions through the GABA receptor. The chemical does not affect humans, because its target, glutamate-gated chloride (GluCl) channel, is not present in mammals. Therefore, the discovery of the chemical (~1987) was greatly helped by modern understanding of cell biology. Fipronil was approved for use as insecticide after extensive testing between 1987-1996.
The most important effect of Fipronil is that it can act as a slow poison to kill an entire colony of bugs, and not only a few bugs on which the insecticide is sprayed. From the wiki -
Fipronil is a slow acting poison. When used as bait, it allows the poisoned insect time to return to the colony or harborage. In cockroaches, the feces and carcass can contain sufficient residual pesticide to kill others in the same nesting site. In ants, the sharing of the bait among colony members assists in the spreading of the poison throughout the colony. With the cascading effect, the projected kill rate is about 95% in three days for ants and cockroaches. Fipronil serves as a good bait toxin not only because of its slow action, but also because most, if not all, of the target insects do not find it offensive or repulsive.
Toxic baiting with fipronil has also been shown to be extremely effective in locally eliminating German wasps (commonly called yellow jackets in North America). All colonies within foraging range are completely eliminated within one week.
Readers should note that although Fipronil does not affect mammals, other vertebrates are fair game. From a paper titled -
Fipronil: environmental fate, ecotoxicology, and human health concerns
One of its main degradation products, fipronil desulfinyl, is generally more toxic than the parent compound and is very persistent. There is evidence that fipronil and some of its degradates may bioaccumulate, particularly in fish.
Fipronil is highly toxic to bees (LD50 = 0.004 microgram/bee), lizards [LD50 for Acanthodactylus dumerili (Lacertidae) is 30 micrograms a.i./g bw], and gallinaceous birds (LD50 = 11.3 mg/kg for Northern bobwhite quail), but shows low toxicity to waterfowl (LD50 > 2150 mg/kg for mallard duck).
Honey Bee Colony Collapse Disorder
Six years back, when we were working on the honey bee genome paper, colleagues often mentioned one strange and recent observation. You can read the wiki on colony collapse disorder:
Colony collapse disorder (CCD) is a phenomenon in which worker bees from a beehive or European honey bee colony abruptly disappear. While such disappearances have occurred throughout the history of apiculture, and were known by various names (disappearing disease, spring dwindle, May disease, autumn collapse, and fall dwindle disease), the syndrome was renamed colony collapse disorder in late 2006 in conjunction with a drastic rise in the number of disappearances of Western honeybee colonies in North America. European beekeepers observed similar phenomena in Belgium, France, the Netherlands, Greece, Italy, Portugal, and Spain, and initial reports have also come in from Switzerland and Germany, albeit to a lesser degree while the Northern Ireland Assembly received reports of a decline greater than 50%.
The cause of colony collapse disorder is still not known, and everything from starvation, pathogens, mites to modern insecticides were blamed. No matter what the cause is, bees are the types of bugs you do not like to see disappear. Disappearance of bees will cause havoc to farming, because there will be less pollination and thus less food production.
Some researchers blamed Fipronil as the potential cause of colony collapse disorder.
Fipronil is one of the main chemical causes blamed for the spread of colony collapse disorder among bees. It has been found by the Minutes-Association for Technical Coordination Fund in France that even at very low nonlethal doses for bees, the pesticide still impairs their ability to locate their hive, resulting in large numbers of forager bees lost with every pollen-finding expedition. A 2013 report by the European Food Safety Authority identified fipronil as a “a high acute risk to honeybees when used as a seed treatment for maize and on July 16, 2013 the EU voted to ban the use of Fipronil on corn and sunflowers within the EU. The ban will take effect at the end of 2013.”
Also read -
BASF challenges EU ban on fipronil pesticide
German chemicals group BASF said it launched a legal challenge against the European Commission’s ban of BASF’s insecticide fipronil, imposed in July on concern its use as seed treatment is linked to declining bee populations.
The European Union in July added fipronil to its blacklist of substances suspected of playing a role in declining bee populations.
The ban follows similar EU curbs imposed in April on three of the world’s most widely-used pesticides, known as neonicotinoids, and reflects growing concern in Europe over a recent plunge in the population of honeybees critical to crop pollination and production.
Let us now switch gear to a different kind of insect that nobody loves – the fire ants. Fire ants came to the USA from Brazil through trade and spread all over the southeastern states. They are found in everyone’s backyard, front yard, school playground and other places inside and around the house. Every once in a while, you come across stories like this – “Texas Boy Dead After Fire Ant Bites“.
Fire ants are usually removed by pesticides, but some people go on to take extreme measures. Do not try this at home !!
Fly Eating the Brains of Fire Ants
How to control the fire ants? Researchers and grant agencies came up with an elegant ‘natural’ solution. An entomologist named Sanford Porter observed that a type of flies go inside fire ants’ head and lay eggs there. When the eggs hatch, they eat the heads of the ants.
Absurd Creature of the Week: This Fly Hijacks an Ant’s Brain — Then Pops Its Head Off
So Porter searched for a natural enemy that might be keeping southern populations in check. Following a tip from a colleague, he began seeking out fire ants fending off attacks from tiny flies. He gathered some of these besieged individuals and returned to the United States, where he soon began finding maggots in the ants’ bodies. “And around about two weeks [after that] I found that the heads would fall off,” he told WIRED, “and lo and behold I could see the pupa inside the ant’s head.”
The flies he’d observed weren’t hunting the ants. They were much too small for that. Apparently not to be bothered with the stresses of parenthood, they were infesting the creatures with their young. Here, take this for me, the flies seemed to say, I’ve got a lot going on in my life right now.
Here’s how it works. Attracted by the smell of the fire ant’s alarm pheromone, the female ant-decapitating fly hovers a few millimeters from her target. “When they get into just the right position, they dive in,” said Porter, who is now with the USDA Agricultural Research Service. The fly has a sort of lock-and-key ovipositor, the shape of which varies widely between species, “and once that’s fit onto the ant’s body, around the legs somewhere, then what happens is that there’s an internal ovipositor that looks like a hypodermic needle, and that hits probably in the membranes in between the legs,” firing a tiny torpedo-shaped egg into the ant.
The following video shows the process. It is the weirdest thing that we have seen in a while!!
Bringing a bug from south America and putting into Texas was no easy task. The researchers had to make sure the flies do not become nuisance by themselves. Finally, humans found a solution to the fire ant problem, or have they?
Rasberry Crazy Ants
Nature had a different plan to take care of fire ants unlike the ‘natural’ solution introduced by humans. Enter ‘Raspberry crazy ants’. About 6-7 years back, a new kind of ants arrived in Texas from South America following the same route as originally taken by fire ants, namely trade from south America. These ants are aptly named ‘crazy ants’, because they are driving the fire ants crazy !!
Apparently, they are also driving the Texans crazy, because anyone experiencing crazy ant attack wants the fire ants back as the more humble bugs !!
A few observations about crazy ants -
(i) they do not respond to commonly used insecticides,
(ii) their colonies have more than one queens. So, it is much harder to destroy their colonies by killing one queen ant.
(iii) they grow very fast and are attracted to electronic circuits. Therefore, they can damage computer gadgets very fast.
The ‘crazy ants’ are nothing like people of USA have seen before. To understand how unusual they are, readers are encouraged to go through the following two articles -
There’s a Reason They Call Them ‘Crazy Ants’
Soon ants were spiraling up the tongues of my sneakers, onto my sock. I tried to shake them off, but nothing I did disturbed them. Before long, I was sweeping them off my own calves. I kept instinctively taking a step back from some distressing concentration of ants, only to remember that I was standing in the center of an exponentially larger concentration of ants. There was nowhere to go. The ants were horrifying — as in, they inspired horror. Eventually, I scribbled in my notebook: “Holy [expletive] I can’t concentrate on what anyone’s saying. Ants all over me. Phantom itches. Scratching hands, ankles, now my left eye.” Then I got in my car and left.
The 5 craziest moments from the Times’ feature on “crazy ants”
Response of Central Planners
This part is the most hilarious. The arrival of crazy ants in Texas was noticed by Mr. Tom Rasberry, who never completed anything beyond high school but knew his bugs well through his profession as an exterminator. He alerted the central planners about arrival of this new kind of ants in 2002, but they were too busy fighting the last year. Within five years, the crazy ants spread all over Texas, Louisiana, Mississippi, Florida and Georgia and started replacing fire ants.
Finally the central planners took notice, and the first thing they did was to write a grant. But the grant did not get approved, because -
This meeting took place on Oct. 9, 2008, just as the American econ- omy was crumbling. Six days earlier, President Bush signed over $700 billion to the new Troubled Asset Relief Program. “I don’t think the fed- eral government had a lot of money to spend on bugs,” one task-force participant remembered. In fact, very quickly, the conversation foun- dered in a maddening Catch-22: the government preferred not to release any money to research or combat the crazy ants until it knew what species it was dealing with. The scientists insisted that they need- ed funding to figure that out.
Finally, one man spoke up. “I said: ‘You know? You all sound like a bunch of idiots,’ ” he recalls. He was 52, with a graying, bristly mustache and leathery skin, and on paper at least, he had no business being there. He wasn’t a bureaucrat or a scientist. He’d never even gone to college. He was just an exterminator — the kind who drives around in a truck and sprays stuff. But he was the exterminator who discovered the ants. His name was Tom Rasberry. He’d named them after himself.
That turned out to be a problem. Central planners decided to give it a different name – ‘tawny crazy ants’. For a long time, they continued this name-giving game, while the ants marched on to new territories !!
From the NY Times story -
Tom Rasberry collected samples of the ant at the Pasadena chemical plant in 2003 and sent them off to a lab at Texas A&M to be identified. But taxonomy — the process of ordering living things into species — is arguably more an art than a science, and figuring out what species the ants were, and where they came from, quickly became vexing. Academics from other institutions swarmed in to debate, for example, the significance of four tiny hairs on the ant’s thorax. For years, they hurtled through a series of wrong answers, but the consensus eventually leaned toward a certain invasive ant, called Nylanderia pubens, which has been in Florida since the 1950s.
Rasberry was convinced this couldn’t possibly be the same ant. “It’s just common sense,” he said. His ant was ripping through Texas like a violent dust storm; their ant had been entrenched in Florida for more than 50 years, barely dispersing or causing any trouble. Why would the bug suddenly behave so differently? Rasberry began his own, amateur taxonomic investigation, spending thousands of hours out in the field or examining samples with a microscope in the back room of the Rasberry’s Pest Professionals office. “It was a nightmare,” he told me. He’d never had any interest or aptitude for science, didn’t find bugs that fascinating and hates reading. But he willed his way through the entomological research, looking for answers. (“It was an obsession,” his daughter, Mandy Rasberry-Ganucheau, said. For years, Rasberry would come over once a week to see his grandkids and end up talking about crazy ants.) Still, the science kept creeping toward its own conclusion. And as long as there was evidence that the ants in Texas were pubens and not something new, the government felt it was reasonable not to act. Roger E. Gold, a veteran Texas A&M entomologist working on the species, told me that the scientific uncertainty became “almost a reason for the federal government not to get involved,” even as the situation in Texas grew catastrophic. “The taxonomy thing was almost a joke,” Gold added, “if it weren’t so serious.”
State and federal agencies have now financed a very limited amount of research, and the E.P.A. has tweaked its regulations to allow the use of a high-powered pesticide against the ant. The taxonomy question was settled only in September 2012, when scientists led by a fellow at the Smithsonian looked at the molecular sequencing of a broad range of specimens and concluded that the Rasberry crazy ant is not the same ant that was collected in Florida in the 1950s. It’s Nylanderia fulva, a species native to Brazil. Rasberry, in other words, was vindicated. And yet, so many speculative plot twists and Latin names have accumulated around the ant that it’s still easy to get confused. A policy manager at the U.S.D.A.’s Animal and Plant Health Inspection Service recently explained to me that because the ants in Texas are “the same species” — pubens — that has been long established in Florida, the pest has “become too widespread to take effective action.” In short, the ant is already out of the bag.
Then, last winter, the federal research entomologist David Oi and the researcher who led the taxonomy study, Dietrich Gotzek, complicated the story further. They gave fulva a common name, via a petitioning process administered by the Entomological Society of America. Everyone was already calling it Rasberry crazy ant, but that hardly mattered: Naming a bug after a person is strongly frowned on. Besides, Oi told me, the name was too confusing: “People thought it was supposed to be the fruit.” He and his colleague rechristened it the Tawny crazy ant, a name almost no one in Texas appears to use — and especially not Tom Rasberry, who took Oi’s maneuver as a personal attack. “It may sound arrogant,” Rasberry told me, “but I think they’re totally irritated that someone without a college degree one-upped all the Ph.D.s.”
Bring back the Fipronil
Now that the real nature is on the march, humans have very few weapons to fight against it, other than those supposed to ‘damage nature and environment’. ‘Crazy ants’ do not respond to regular insecticides and therefore the big guns are needed.
Pesticide for SE Texas ‘crazy’ ants approved by EPA
Acting on a request by the Texas Department of Agriculture, the U.S. Environmental Protection Agency on Tuesday approved a crisis exemption for use of fipronil (Termidor SC) on crazy ant infestations. The crisis exemption is in effect until the EPA rules on the state’s request for a specific exemption so the pesticide could be used for three years.
Crazy Rasberry ants, called “crazy” because of their zigzag march and named after Tom Rasberry, the Pasadena exterminator who discovered them in 2002, now infest Harris, Brazoria, Galveston, Jefferson, Liberty, Montgomery and Wharton counties.
The rice-grain-size ants, which can bite but not sting, have a penchant for infesting electrical devices and have been blamed for the failure of computers, sewage pumps and electric gate motors.
How about the bees? The beekeepers are so threatened by ‘crazy ant’ attack that colony collapse disorder appears to be minor problem in comparison.
We are not sure whether this convoluted story has any simple conclusion. Humans studied biology and came up with a poison that kills only bugs and not mammals. The pesticide was tested ‘thoroughly’ and released on to the market. It worked so well on the bugs that honey bee colonies started to disappear almost overnight. The chemical was banned, while a more ‘natural’ and benign solution to fire ant problem was introduced. In the meanwhile, nature released its own terror – crazy ant, which displaced fire ants, munched on computer circuits and threatened human civilization so much that the banned pesticide was brought back as an ‘emergency measure’.
(source of figure: AFP/File, Cesar Manso)
Another day, another picture of people digging under cave, another ‘incredible breakthrough’, another sex story !! After Denisovans and extremely ancient African (discussed in Denisovans, Extremely Ancient Africans – the Role Cheap Sequencing Plays in Rewriting Human History) comes today’s ‘baffling finding’ that pushes the record by four times.
Using a thigh bone from the cave, Matthias Meyer from the Max Planck Institute for Evolutionary Anthropology has sequenced the almost complete mitochondrial genome of one of Sima de los Huesos’ inhabitants, who likely lived around 400,000 years ago. That is at least four times older than the previous record-holder—a small 100,000-year-old stretch of Neanderthal DNA.
Discovery channel pushes the sex angle. Sex sells, even in next-gen sequencing.
Ancient Humans Had Sex with Mystery Relatives
They even go to the extent of concocting an amorous image.
Sci-news reports that the ‘genome is sequenced’, even though it is not.
Sima de los Huesos: Scientists Sequence Genome of Enigmatic Hominin
Truth: Nuclear genome is very unlikely to get sequenced according to the authors. Even the mitochondrial sequence was full of contamination.
Nature sells you the paper for $32 (or $199, if you are wealthy).
A Mitochondrial Genome Sequence of a Hominin from Sima de los Huesos
Excavations of a complex of caves in the Sierra de Atapuerca in northern Spain have unearthed hominin fossils that range in age from the early Pleistocene to the Holocene1. One of these sites, the ‘Sima de los Huesos’ (‘pit of bones’), has yielded the world’s largest assemblage of Middle Pleistocene hominin fossils2, 3, consisting of at least 28 individuals4 dated to over 300,000 years ago5. The skeletal remains share a number of morphological features with fossils classified as Homo heidelbergensis and also display distinct Neanderthal-derived traits6, 7, 8. Here we determine an almost complete mitochondrial genome sequence of a hominin from Sima de los Huesos and show that it is closely related to the lineage leading to mitochondrial genomes of Denisovans9, 10, an eastern Eurasian sister group to Neanderthals. Our results pave the way for DNA research on hominins from the Middle Pleistocene.
Only Dan Graur (@dangraur) gives you what matters -
i) the full paper
ii) Healthy dose of skepticism.
Mike White (@genologos) is a very creative person. We already covered his paper to refute ENCODE claim by showing that random DNA sequence mimicked similar binding behavior, or rather ENCODE experiments did not have a proper control. Apart from science, he expressed his creativity by building the beautiful ‘The Finch and Pea’ blog, and also writes informative columns at the Pacific Standard site.
If we had a billion dollars to spare (say from building and selling houses made on bomb-testing sites), we would definitely make him the head of a research center. Unfortunately we do not. The only other option is to give him a shoutout so that many people join together to pay for his research. That is where we face a problem under the existing model of centralized funding.
Continue reading The Tragedy of Centrally Funded Research
Both Rayan Chikhi (an author of Minia assembler) and Anton Korobeynikov (an author of SPAdes assembler) commented in our earlier post on a recent PLOS One paper that benchmarked a number of low-memory genome assembly programs. The comments are quite informative for those not familiar with the intricate details of such programs, and therefore we decided to present them in a new commentary.
The results are definitely misleading. People should stop comparing to Velvet as single “gold reference” assembler. GAGE-B clearly shown that state-of-the-art assemblers can easily beat Velvet by 20x in terms of N50. E.g. for R. sphaeroides dataset from GAGE-B Velvet achieved N50 of 24kb (3 kbp in the paper with plenty of misassemblies), MaSuRCA achieved 130 kbp and both Ray and SPAdes were able to produce contigs of N50 with more than 500 kbp (Ray was not included into GAGE-B, this is our internal run with parameter tuning like in GAGE-B).
So… really, the results of the paper need to be redone using both recent data and recent assemblers.
There is the usual time / memory / quality tradeoff. However, for me, the paper looks like the indirect way propagating two their own approaches (DiMA, ZeMA, etc.), rather than compare the assemblers properly.
For example, the authors claim that “… Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods”
This is not true. The quality of of the results (look for N50 and misassemblies in Tables 3-5) is below the level of acceptance these days. They effectively reported the results 10 times lower than in GAGE study and ignored all the results from GAGE in order to deduce the conclusions. Note also the differences in the methodologies – GAGE tried to derive best assemblers selecting the best parameters. The authors simply fixed k-mer length to 31 and that’s it.
Combing the authors’ provided tables and GAGE tables, we can instead easily say “… Our experiments prove that it is NOT possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. In order to achieve the reasonable quality the proper assembly methods requiring more memory are necessary”.
So, the conclusions of the paper would be the opposite! However, judging from how they worked with GAGE results and methodologies, I won’t trust the authors anymore, instead I demand all the work to be thoroughly redone, and the true data to be put in the tables.
Perhaps, we can ask Rayan to run Minia on GAGE / GAGE-B data and report the results? This way we at least will be sure that Minia results are of GAGE-level quality.
Rayan replied -
All right Anton, good idea! Here’s my informal Minia assembly of the GAGE chromosome 14 dataset.
Minia version: 1.5938
Reads taken: all the raw reads, i.e. fragments and short jump and long jump.
Parameters k=53 and min_abundance=5 were given by Kmergenie (command line: ./kmergenie list_reads).
Command line: minia list_reads 53 5 100000000 chr14_k53_m5
Peak mem: 147 MB
Time: 73 mins
Excerpt from Quast output:
# contigs (>= 0 bp) 82433
Total length (>= 0 bp) 87199867
Largest contig 27533
# misassemblies 18
# local misassemblies 17
All those metrics look much better than in Table 5..
Here is what differs from their assembly: 1) parameters are optimized 2) that new Minia version yields better genome coverage than older ones. 3) I used all the GAGE reads, they used only the fragments.
A few more notes on the paper in general:
1) Many of the low-memory assemblers tested in the PLOS One paper do not include a scaffolder nor a gap-filler. Those tested in GAGE and GAGE-B at least have a scaffolder. Thus it is not surprising that the GAGE/GAGE-B contiguity stats look much better.
2) GAGE used the best possible error-corrected read dataset for each organism. The Plos One paper apparently used the raw data, which in my experience gives worse assemblies.
3) GAGE also picked reasonable k-mer sizes for each dataset separately.
In summary, the benchmarks in this paper are fair, in the sense that they ran all the programs in similar conditions. But it’s too bad that those conditions led to assemblies that were either very poor (e.g. for chr14), or that looked poor in comparison to GAGE, because of the lack of tuning.
By the way, Anton, not sure if the following statement is an accurate comparison:
“E.g. for R. sphaeroides dataset from GAGE-B Velvet achieved N50 of 24kb (3 kbp in the paper with plenty of misassemblies), MaSuRCA achieved 130 kbp and both Ray and SPAdes were able to produce contigs of N50 with more than 500 kbp”
That 24 kbp N50 figure for Velvet/GAGE-B is for the MiSeq data, it drops to 13 kbp with the HiSeq data. But, is the R. sphaeroides Hiseq dataset from GAGE-B the same as from GAGE? In my experience, the raw GAGE dataset is quite challenging to assemble. Also for the MiSeq dataset, Spades N50 is at 118 kbp in the GAGE-B paper. A newer version can do 500 kbp?
When reading such exchanges, readers should not assume that one person is right and therefore the other has to be wrong. A genome assembly problem is multi-faceted and both of them can be right in covering different aspects. Given that the exchange is based on GAGE vs PLOS One comparison, the same can be said about various benchmarking papers. There are so many variables in comparing two assemblers that asking which assembler is the best is too simplistic, unless one qualifies it with machine type, RAM size, library type and a number of other parameters. On the other hand, it is often impossible to test for all those parameters and still come up with a useful metric. Especially the ‘machine type’ variable is the hardest to quantify, because it includes disk speed and RAM size, which can change the speed of execution quite a bit. Throw in different read sizes and different types of libraries (PacBio?) and you are in a complete mess.
The exchange should rather be taken as the types of things a bioinformatician should keep in mind in working on his problem. Often conventional wisdom fails, as we describe below.
a) Hardware Limitations:
A few days back, we were trying to benchmark DSK, a k-mer counting program written by Rayan. We ran it with default options and the program finished in 3 hours. The machine happened to have 32 cores and fairly big RAM. So, we thought of running it with multi-threading option with lot of RAM to finish the run in fraction of that time. Lo and behold, the program continued to run for 8 hours with no result. What was going on?
With more resources, the processor cores, RAM and disk started to compete with each other to move data from here to there, and spent less time on actual computation (which is minimal for k-mer counting). So, essentially we had Craig Barrett’s breakfast problem described earlier. Too many cooks were opening and closing the fridge, and very few actually got time to fry omelettes.
We write RAM separately, because in many servers it is the biggest limiting factor. Amazon charges you an arm and a leg, if you want to rent a computer with high RAM. What can you do? You can design your data structure more efficiently, but beyond that the only solutions are – (i) crashing, (ii) shifting data between RAM and hard-drive during assembly. The speed of the second solution can be increased quite a bit by replacing the hard disk with SSD, and SSD storage does not cost that much these days. After all, you can design a server with one smaller SSD storage for computation and another slower disk for permanent storage. That is small increase in cost with huge added benefit.
Many bioinformatics programs are written to use multiple cores, but their scale up is not always linear as we explained earlier. How do you figure out what is going on? You may have to think about what the program is doing and break it down into smaller problems to identify the bottleneck.
d) Cache Size – Problem with Small Pieces:
A few days back, we were trying to benchmark BWA-MEM. To understand the performance of its various steps, we decided to replace the human genome with a tiny genome. At that point, everything started to behave abnormally and we got results much faster than we expected. What happened?
In our usual model of computing, a processor gets its data from the RAM. However, when the genome is too small, the processor fitted the entire memory block into its cache memory (memory within the chip) and did not need to access RAM any more. Very few bioinformatics programs use this aspect of processing, even though considerable speed gain can be achieved from using the cache effectively.
Those are just four aspects in producing machine-to-machine difference in performance of bioinformatics programs and we have not gone into algorithmic difference, library differences or contig assembly vs scafold assembly.
Conventional wisdom says larger sample size will make experiments more accurate, but that does not work in many situations. More samples can result in more errors, as explained by the following example.
Ask 1 million monkeys (~2^20) to predict the direction of the stock market for the year. At the end of the year, about 500K monkeys will be right and about 500K monkeys will be wrong. Remove the second group and redo the experiment for the second year. At the end of the year, you will be left with 250K monkeys, who correctly called the market for two years in a row. If you keep doing the same experiment for 20 years, you will be left with one monkey, who predicted the stock market correctly for 20 years in a row. Wow !
The above paragraph is taken from a youtube talk of best-selling author Rolf Dobeli, whose best-selling book appears to be a plagiarized version of N. N. Taleb’s writing. Taleb goes deeper into explaining the implication of the monkey experiment. Suppose you increase the sample size by asking 30x more monkeys about the stock market. At the end of twenty years, you will have 30 ‘smart’ monkeys instead of one. Now you have a large enough set to even search for that intelligence gene, which helps monkeys call the stock market correctly for 20 years in a row. A bigger wow !!!
Summarazing in Taleb’s words -
The winner-take-all effects in information space corresponds to more noise, less signal. In other words, the spurious dominates.
Information is convex to noise. The paradox is that increase in sample size magnifies the role of noise (or luck); it makes tail values even more extreme. There are some problems associated with big data and the increase of variable available for epidemiological and other “empirical” research.
You can read the rest of his chapter with all mathematical details here.
In another commentary along similar line, Lior Pachter wrote -
23andme Genotypes are all Wrong
The commentary is quite informative, but we will pick up one part that describes the ‘winner-take-all’ impact (or ‘loser-take-all’ in case of sickness) on users.
But the way people use 23andme is not to look at a single SNP of interest, but rather to scan the results from all SNPs to find out whether there is some genetic variant with large (negative) effect.
Whereas a comprehensive exam at a doctor’s office might currently constitute a handful of tests– a dozen or a few dozen at most– a 23andme test assessing thousands of SNPs and hundreds of diseases/traits constitutes more diagnostic tests on an individual at one time than have previously been performed in a lifetime.
In plain English, suppose you walk into a doctor’s office and ask for your brain, heart, lungs, kidney, teeth, tongue, eye, nose and one hundred other body parts to be tested. Doctor comes back to you and reports that 107 out of 108 tests were within limit, but your kidney test reported some problems. The ‘winner-take-all’ impact will make you remember only the result that reported problem, even though more tests the doctor conducts, more likely he is to find a problem by random chance. Next you will go through more invasive tests of your kidney and maybe some hospital stay, making your body vulnerable in some other ways. Paraphrasing Taleb, the only way to legally murder a person is to assign him a personal doctor, who will keep monitoring (‘testing’) his health 24×7.
Presenting this well-known problem of multiple testing did not win Lior Pachter many friends. He was immediately called a ‘troll’ and other names by those with vested interests.
Calling people trolls, when they present different scientific argument, has become a new fashion. We have been through similar experiences, when we wrote a set of commentaries questioning the effectiveness of genome-wide association studies.
Battle over #GWAS: Ken Weiss Edition
Study History and Read Papers Written by ‘Dinosaurs’ (#GWAS)
Genome Wide Association Studies (#GWAS) – Are They Replicable?
Mick Watson immediately called us trolls and, and both he and Daniel MacArthur immediately blocked our twitter accounts from following them. Readers should note that it is one extra step of censoring, as explained below.
For those unfamiliar with Twitter, it is designed in such a way that you do not read things you are not interested to read. For example, we do not read what Kim Kardashian is doing every day by simply choosing not to follow her channel. So, why do these two gentlemen (‘open science advocates’) take the extra step of blocking us to follow them? It is done to make sure that our comments do not reach their audience, or is a form of twitter censoring. We wonder what they have to fear.
On the plus side, the above exchange got us familiar with the blog of Ken Weiss and co-authors (@ecodevoevo on twitter), which is very thoughtfully written and has become our daily read. Readers may enjoy their today’s commentary on big data in medicine.
The ‘Oz’ of medicine: look behind the curtain or caveat emptor!
They highlight six problems with the ‘big data’ approach. The following list is only an abbreviated version of their very detailed commentary.
Problem 1: Risks are estimated retrospectively–from the past experience of sampled individuals, whether in a properly focused study or in a Big Data extravaganza. But risks are only useful prospectively: that is, about what will happen to you in your future, not about what already happened to somebody else (which, of course, we already know).
Problem 2: We are usually not actually applying any serious form of ‘theory’ to the model or to the results. We are just searching for non-random associations (correlations) that may be just chance, may be due to the measured factor, or may be due to some other confounding but unmeasured factors.
Problem 3: Statistical analysis is based on probability concepts, which in turn are (a) based on ideas of repeatability, like coin flipping, and (b) that the probabilities can be accurately estimated. But people, not to mention their environments, are not replicable entities (not even ‘identical’ twins).
Problem 4: Competing causes inevitably gum up the works. Your risk of a heart attack depends on your risk of completely unrelated causes, like car crashes, drug overdoses, gun violence, cancer or diabetes, etc.
Problem 5: Theory in physics is in many ways the historic precedent on which we base our thinking……But life is not replicable in that way.
Problem 6: Big Data is proposed as the appropriate approach, not a focused hypothesis test. Big Data are uncritical data–by policy! This raises all sorts of issues such as nature of sample and accuracy of measurements (of genotypes and of phenotypes).
Oh, this is hilarious !!
If you do not have time to watch the entire video, chemjobber blog covers the juiciest part.
Not everyday when science Nobelists get to take shots at the Economics prizewinners — it happened this week at the Swedish embassy compound at the gathering of this year’s Nobelists:
Then Martin Karplus, a Harvard University chemist, interjected, “ “What understanding of the stock market do you really have?”
Economics – “if one wants to call it a science” – seemed unable to explain the oscillations of the market, he said.
“I see these fluctuations and they make zero sense to me,” Professor Karplus declared. “Maybe they make sense to you.”
Professor Fama dismissed the question as unsophisticated, declaring its premise “factually incorrect.”
The hard scientists, more amused than chastened, turned to mocking the economists.
“You’re asking about a very fundamental question, on what the nature of life is,” James Rothman, a professor of cell biology at Yale University and one of the three newly minted laureates in medicine, told one questioner. “I don’t think there’s anyone here — even the economists – who would have an opinion on that for sure.”