Misassembly Detection using Paired-End Sequence Reads and Optical Mapping Data

This arxiv paper appears to be promising (h/t: @lexnederbragt). We are not sure whether having optical map data is an absolute requirement.

A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method that will enhance the quality of draft genomes by identifying and removing misassembly errors using paired short read sequence data and optical mapping data. We apply our method to various assemblies of the loblolly pine and Francisella tularensis genomes. Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembed contigs in an assembly of Francisella tularensis, and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembed contigs in the assemblies of loblolly pine. MISSEQUEL can be downloaded at this http URL.

If you are wondering how it relates to other available tools, the following section is helpful.

Related Work. Both amosvalidate [31] and REAPR [32] are capable of identifying and correcting misassembly errors. REAPR is designed to use both short insert and long insert paired-end sequencing libraries, however, it can operate with only one of these types of sequencing data. Amosvalidate, which is included as part of the AMOS assembly package [33], was developed speci^Lcally for first generation sequencing libraries [31]. iMetAMOS [34] is an automated assembly pipeline that provides error correction and validation of the assembly. It packages several open-source tools and provides annotated assemblies that result from an ensemble of tools and assemblers. Currently, it uses REAPR for misassembly error correction.

Many optical mapping tools exist and deserve mentioning, including AGORA [35], SOMA [36], and Twin [30]. AGORA [35] uses the optical map information to constrain de Bruijn graph construction with the aim of improving the resulting assembly. SOMA [36] uses dynamic programming to align in silico digested contigs to an optical map. Twin [30] is an index-based method for align-ing contigs to an optical map. Due to its use of an index data structure it is capable of aligning in silico digested contigs orders of magnitude faster than competing methods. Xavier et al. [37] demonstrated misassembly errors in bacterial genomes can be detected using proprietary software. Lastly, there are special purpose tools that have some relation to misSEQuel in their algorith-mic approach. Numerous assembly tools use a finishing process after assembly, including Hapsem-bler [38], LOCAS [39], Meraculous [40], and the \assisted assembly” algorithm [41]. Hapsembler [38] is a haplotype-specific genome assembly toolkit that is designed for genomes that are highly-polymorphic. Both RACA [42], and SCARPA [43] perform paired-end alignment to the contigs as an initial step, and thus, are similar to our algorithm in that respect.

Human Genetics and Clinical Aspects of Neurodevelopmental Disorders

The Reddit AMA of Weiss and Buchanan is full of many interesting comments and discussions. The highest rating of the following question suggests that people are coming to realize that the scientists over-sold and over-hyped what could be achieved through the human genome sequencing project.

Why hasn’t genetics led to the groundbreaking cures initially promised?

mermaidstale: Basically, because the diseases that most of us will get as we age turn out to be more complex than people had thought. I think we can blame the idea that we will find genes ‘for’ these kinds of diseases on Mendel, whose work with traits in peas led people to think that most traits were going to behave in the same predictable way that wrinkled, or yellow, or green peas do. Instead, it is turning out that there are many pathways to most traits, usually including many genes and environmental factors as well. So, diseases won’t be as predictable as we’ve been promised, because everyone’s genome is different, and environmental risk factors are hard to identify (is butter good or bad for us?), and future environments impossible to predict.

In this context, Gholson Lyon pointed out that he wrote a book chapter that the readers will find useful. The article is well-researched, well-written and is available open-access from biorxiv.

There are ~12 billion nucleotides in every cell of the human body, and there are ~25-100 trillion cells in each human body. Given somatic mosaicism, epigenetic changes and environmental differences, no two human beings are the same, particularly as there are only ~7 billion people on the planet. One of the next great challenges for studying human genetics will be to acknowledge and embrace complexity. Every human is unique, and the study of human disease phenotypes (and phenotypes in general) will be greatly enriched by moving from a deterministic to a more stochastic/probabilistic model. The dichotomous distinction between simple and complex diseases is completely artificial, and we argue instead for a model that considers a spectrum of diseases that are variably manifesting in each person. The rapid adoption of whole genome sequencing (WGS) and the Internet-mediated networking of people promise to yield more insight into this century-old debate. Comprehensive ancestry tracking and detailed family history data, when combined with WGS or at least cascade-carrier screening, might eventually facilitate a degree of genetic prediction for some diseases in the context of their familial and ancestral etiologies. However, it is important to remain humble, as our current state of knowledge is not yet sufficient, and in principle, any number of nucleotides in the genome, if mutated or modified in a certain way and at a certain time and place, might influence some phenotype during embryogenesis or postnatal life.

Evolution, not Hadoop, is the Biggest Missing Block in Bioinformatics

A Biostar discussion started with –

Uri Laserson from Cloudera:

“Computational Biologist are reinventing the wheel for big data biology analysis, e.g. Cram = column storage, Galaxy = workflow sheduler, GATK = scatter gather. Genomics is not special.” “HPC is not a good match for biology since it HPC is about compute and Biology is about DATA”.…/genomics-is-not-special-towards…

Uri might be biased because of his employer and background but I do think he has a point. It keeps amazing me that the EBI’s and Sangers of this world are looking for more Perl programmers to build technology from the 2000’s instead of hiring Google or Facebook level talents to create technology of the 2020’s using the same technology Google or Facebook are using.

In 2010, we were stuck with several big assembly problems (big for those days) and associated biological questions. For assembly, Hadoop was the first thing we looked into and long-time readers may remember that Hadoop-related posts were among the first topics in our blog. Like the above author, we also expected to revolutionize biology by using technology of Google/Facebook.

Using Hadoop for Transcriptomics

Contrail – A de Bruijn Genome Assembler that uses Hadoop

Over time, we realized that Hadoop was not the right technology for assembling sequences. For assembly, one has a fixed final product and the goal of the algorithm is to combine pieces to get there. There are faster and more economic ways to solve that problem even using commodity hardware, and Hadoop did not turn out to be our best option. We already covered many of those efficient algorithms in our blog.

Today’s discussion is not about assembly but on the second and larger problem – figuring out the biology and whether big data can help. Figuring out of the biology is the final goal of all kinds of computational analysis, but increasingly that notion seems to get forgotten. This point has been highlighted by a comment by Joe Cornish in the original biostar thread, but let us elaborate based on our experiences.


Is human being like a computer program?

Two decades of publicity about human genome and genomic medicine made people think that human beings work like computer programs. To them, the genome has the software code or ‘blueprint’ and by changing that code, every aspect of a person can be changed. A culmination of that one-dimensional focus of genome results in junk papers like the following, where the authors try to find genomic code for romantic relations.

The association between romantic relationship status and 5-HT1A gene in young adults

Individuals with the CG/GG genotype were more likely to be single than individuals with the CC genotype.

What gets forgotten is that a person is more like an analog machine built from cells, where all kinds of chemicals work together to give it functions like digesting food, walking, maintaining body temperature and pursuing romantic relationships. The genome only provides the informational content, but until a researches shows the causal mechanism between a genome-based association and actual workings of the biochemical components, a genome-based association is just a statistical observation. Speaking of statistical observations, the following one shared by @giggs_boson looks pretty good.


Is human being like a car then?

A car consumes fuel, moves around, can maintain the inside temperature and can even pursue romantic relationships in Pixar movies. Jokes apart, is a human being like a car or other machines?

One would immediately point out that the cars need human drivers, but that is not true with modern robot-driven cars. Another objection is that the cars do not spawn out new cars by themselves, whereas self-reproduction is the main difference between living beings and modern sophisticated machines. We can work around that objection by doing a ‘thought experiment’, where we put a robot inside a car with a sophisticated digital printer and code to produce all kinds of car parts by taking raw materials from outside. Once ready, the robot inside pushes the parts out and another robot attached to the exterior assembles them into a new car. Is the car-robot combination like a living being?


There is another level of complexity

Even though they do not tell you, many researches working on human medicine follow the car-robot mechanical model of living beings. That is not without reason, because the mechanical model worked very well in 20th century medicine. It has been possible to replace limbs with artificial ones (analogous to changing flat tires?), controlling hearts with pacemakers (‘rebuilding engine?’), fighting off invaders with chemicals and even doing brain surgery by using a mechanical view of the system, where the surgeon is a highly-skilled mechanic.

However, the mechanical view fails completely, when researchers try to understand the human genome or any genome for that matter. Here is the biggest difference between living beings and the replicating car+robot combination described above. The living organisms, unlike the replicating car+robot combination, is self-built. The system constructed itself from absolutely nothing through an iterative process lasting for 3-4 billion years, and many traces of that evolutionary process can be determined from (i) fossil records, (ii) comparing cell structures and biochemistry of different living organisms and (iii) comparing informational contents (genomes and genes) of the same organisms.

By staying focused on only one or two of those three aspects, a researcher can severely handicap himself from getting the full picture. A human genome has ~20,000 protein-coding genes and some of them code for functional blocks, which stabilized in the earliest bacteria ~3 billion years back, some of them stabilized in earliest eukaryotes over 2.5 billion years back and some are more recent. The only way to understand each functional block is to compare organisms, which took different paths since their divergence. That means using different models for different groups of genes. This kind of comparative analysis is often missing from most analysis.

Going back to the original point, adding evolutionary context with genomic, biochemical and developmental data is the biggest challenge and missing block from bioinformatics, not the availability of computing power or programs to analyze bigger data sets. The experimental biologists, out of necessity, focus on one organism (one machine), because it is impossible to maintain a lab with ten different species. Bioinformaticians, on the other hand, are free to compare and analyze, but quite a bit of that analysis requires thinking about functions, development and evolution of the entire machines (living organisms) rather than only the genomes, but that is often missing from the bioinformatics literature. Those, who do it, isolated themselves into the subfield of evolutionary molecular biology.

In a small and insignificant way, we tried to do the entire exercise in our electric fish paper and found it quite challenging. At first, we started with genome assembly of electric eel, but realized that the genome was not very useful to infer about functionality of the electric organ. We also compared RNAseq data from electric organ, muscle and other organs of electric eel. RNAseq data adds a bit more functional information beyond the genome, but it is still limited to one organism. Then we decided to compare RNAseq data from different electric fish species, which evolved electric organs independently and that added the evolutionary context to the whole picture.

Our entire exercise greatly narrowed down the list of genes on which biologists can further do their lab tests. We also worked on theoretically connecting those genes based on their known functions, and these connections can be used as hypotheses to direct the lab tests. This last part (connecting the dots through a narrative) is nearly impossible to do with software, but even if one can do that, there is another level of difficulty. The narrative needs to be updated iteratively based on small-scale experiments and additional data. Which experiments to do to challenge the story is a scientific process and we do not believe any software can replace the process.

Long story short, computational biologists need to solve real puzzles consisting of determining of functional or disease mechanisms, and then build tools based on a handful of successful paths. The current approach of focusing on human genomes and throwing all kinds of data and statistical tools (GWAS) in the hope of discovering disease mechanisms (or romantic partners) is too limited and wasteful.

Finally Some Honest Reporting of GWAS for a Change

Kudos to Leroy Hood, Stuart Kim and colleagues for not publishing yet another hyped up GWAS study.

Whole-Genome Sequencing of the World’s Oldest People

Supercentenarians (110 years or older) are the world’s oldest people. Seventy four are alive worldwide, with twenty two in the United States. We performed whole-genome sequencing on 17 supercentenarians to explore the genetic basis underlying extreme human longevity. We found no significant evidence of enrichment for a single rare protein-altering variant or for a gene harboring different rare protein altering variants in supercentenarian compared to control genomes.

Here is what being honest gets you – being ridiculed by a busy Berkeley math professor, who could not look up the rest of the abstract.


The authors of GWAS study indeed tried to follow up with their best hit on another 99 long-lived individuals.

We followed up on the gene most enriched for rare protein-altering variants in our cohort of supercentenarians, TSHZ3, by sequencing it in a second cohort of 99 long-lived individuals but did not find a significant enrichment.

The following part of the abstract rings death knell to the nascent industry trying to predict future diseases of healthy people by from their genomes.

The genome of one supercentenarian had a pathogenic mutation in DSC2, known to predispose to arrhythmogenic right ventricular cardiomyopathy, which is recommended to be reported to this individual as an incidental finding according to a recent position statement by the American College of Medical Genetics and Genomics. Even with this pathogenic mutation, the proband lived to over 110 years. The entire list of rare protein-altering variants and DNA sequence of all 17 supercentenarian genomes is available as a resource to assist the discovery of the genetic basis of extreme longevity in future studies.

Prediction is very Difficult, Especially if it’s about the Future

UK’s Independent in 2000 –

Snowfalls are now just a thing of the past

Britain’s winter ends tomorrow with further indications of a striking environmental change: snow is starting to disappear from our lives.

Sledges, snowmen, snowballs and the excitement of waking to find that the stuff has settled outside are all a rapidly diminishing part of Britain’s culture, as warmer winters – which scientists are attributing to global climate change – produce not only fewer white Christmases, but fewer white Januaries and Februaries.

The first two months of 2000 were virtually free of significant snowfall in much of lowland Britain, and December brought only moderate snowfall in the South-east. It is the continuation of a trend that has been increasingly visible in the past 15 years: in the south of England, for instance, from 1970 to 1995 snow and sleet fell for an average of 3.7 days, while from 1988 to 1995 the average was 0.7 days. London’s last substantial snowfall was in February 1991.

Global warming, the heating of the atmosphere by increased amounts of industrial gases, is now accepted as a reality by the international community. Average temperatures in Britain were nearly 0.6°C higher in the Nineties than in 1960-90, and it is estimated that they will increase by 0.2C every decade over the coming century. Eight of the 10 hottest years on record occurred in the Nineties.

However, the warming is so far manifesting itself more in winters which are less cold than in much hotter summers. According to Dr David Viner, a senior research scientist at the climatic research unit (CRU) of the University of East Anglia,within a few years winter snowfall will become “a very rare and exciting event”.

“Children just aren’t going to know what snow is,” he said.

The effects of snow-free winter in Britain are already becoming apparent. This year, for the first time ever, Hamleys, Britain’s biggest toyshop, had no sledges on display in its Regent Street store. “It was a bit of a first,” a spokesperson said.

Fen skating, once a popular sport on the fields of East Anglia, now takes place on indoor artificial rinks. Malcolm Robinson, of the Fenland Indoor Speed Skating Club in Peterborough, says they have not skated outside since 1997. “As a boy, I can remember being on ice most winters. Now it’s few and far between,” he said.

Michael Jeacock, a Cambridgeshire local historian, added that a generation was growing up “without experiencing one of the greatest joys and privileges of living in this part of the world – open-air skating”.

Warmer winters have significant environmental and economic implications, and a wide range of research indicates that pests and plant diseases, usually killed back by sharp frosts, are likely to flourish. But very little research has been done on the cultural implications of climate change – into the possibility, for example, that our notion of Christmas might have to shift.

Professor Jarich Oosten, an anthropologist at the University of Leiden in the Netherlands, says that even if we no longer see snow, it will remain culturally important.

“We don’t really have wolves in Europe any more, but they are still an important part of our culture and everyone knows what they look like,” he said.

David Parker, at the Hadley Centre for Climate Prediction and Research in Berkshire, says ultimately, British children could have only virtual experience of snow. Via the internet, they might wonder at polar scenes – or eventually “feel” virtual cold.

Heavy snow will return occasionally, says Dr Viner, but when it does we will be unprepared. “We’re really going to get caught out. Snow will probably cause chaos in 20 years time,” he said.

The chances are certainly now stacked against the sortof heavy snowfall in cities that inspired Impressionist painters, such as Sisley, and the 19th century poet laureate Robert Bridges, who wrote in “London Snow” of it, “stealthily and perpetually settling and loosely lying”.

Not any more, it seems.

Fast forward to 2014 –

Winter 2014 set to be ‘coldest for century’ Britain faces ARCTIC FREEZE in just weeks

Meanwhile in US –


Reference-free detection of isolated SNPs

Another de Bruijn graph magic by Raluca Uricaru, Rayan Chikhi, Guillaume Rizk and co-authors.

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, DISCOSNP, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, DISCOSNP ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, DISCOSNP requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.

Reddit AMA by Weiss and Buchanan – Please Spread the News

Tuesday, Nov 18, 1pm EST, Link

Readers of our blog and especially those working on trying to connect human diseases with genetic variations will be delighted to meet Ken Weiss and Anne Buchanan, the authors of insightful The Mermaid’s Tale blog, through Reddit AMA.


Professor Weiss is the Evan Pugh Professor Emeritus at Penn State university and wrote the following book in 1993. He published several outstanding papers on related topics.


Their blog is named after their book with the same title. We have not read it yet, but found the other book written by Professor Weiss (‘Genetics and Logic of Evolution’) really fascinating.



We will post the link as soon as it is available.

Variable-Order de Bruijn Graphs

Today’s must read algorithm paper ! It describes the theory behind -MEGAHIT: An Ultra-fast Single-node Solution for Large and Complex Metagenomics Assembly via Succinct de Bruijn Graph . [Correction: The algorithm is an improvement over MEGAHIT. Please see comment section.]

Our Contribution. SPAdes [2] and IDBA [21] represent the state-of-the-art for genome assemblers, producing assemblies of greatly improved quality compared to previous approaches. However, their need to construct several de Bruijn graphs of different orders over the assembly process makes them extremely slow on large genomes. In this paper we address this problem by describing a succinct data structure that, for a given K, eciently represents all the de Bruijn graphs for k K and allows navigation within and between each graph. We have implemented our data structure and show that in practice it requires around twice the space of a graph for a single K, and incurs a modest slow down in construction time and on navigation operations.



The de Bruijn graph GK of a set of strings S is a key data structure in genome assembly that represents overlaps between all the K-length substrings of S. Construction and navigation of the graph is a space and time bottleneck in practice and the main hurdle for assembling large, eukaryote genomes. This problem is compounded by the fact that state-of-the-art assemblers do not build the de Bruijn graph for a single order (value of K) but for multiple values of K. More precisely, they build d de Bruijn graphs, each with a specific order, i.e., GK1,GK2,…,GKd. Although, this paradigm increases the quality of the assembly produced, it increases the memory by a factor of d in most cases. In this paper, we show how to augment a succinct de Bruijn graph representation by Bowe et al. (Proc. WABI, 2012) to support new operations that let us change order on the fly, effectively representing all de Bruijn graphs of order up to some maximum K in a single data structure. Our experiments show our variable-order de Bruijn graph only modestly increases space usage, construction time, and navigation time compared to a single order graph.

An Evolutionary Perspective on the Kinome of Malaria Parasites

A few days back, we posted a commentary titled Vaccinating with the genome: a Sisyphean task?, and a reader asked –

Sorry being ignorant. Can we profile the diseases based all the “omics” data integration and then find the right cure ? What prevents us from doing it ? Is it the cost, right technology or the institutions aren’t just interested in it.

There are three components of studying any living organism – informational aspect, the physical/chemical aspect and the evolutionary aspect. The omics data sets and especially the genome covers only the informational aspect, whereas the other two take much more work. Moreover, they often tend to be organism-specific, which means an investigator will need to spend a lot of time thinking about data, arguing about possibilities and connecting the dots. In contrast, the modern way of doing science has been to throw technology at things and expect computer to sort everything out. If not, throw more money/technology and run a bigger experiment. That has not been working with the malaria parasite.


Here is an informative blog post from The Mermaid’s Tale on the evolution of ‘malaria-protective’ alleles of G6PD gene that explains why it has become difficult to connect various evolutionary dots.

Evolution of malaria resistance: 70 years on…and on….and on

Far beyond malaria: Relationship to fundamental evolutionary questions
The idea of balanced polymorphisms played into a major theoretical argument among evolutionary biologists at the time, and sickle cell anemia became a central case in point, and a stereotypical classroom example. But the broader question was quite central to evolutionary theory. Balancing selection was, for many biologists who held a strongly selectionist version of Darwinism, the explanation for why there was so much apparently standing genetic variation in humans, but generally in all, species.

The theory had been that harmful mutations (the majority) are quickly purged, so the finding that there was widespread variation (polymorphism) in nature at gene after gene, the result of the type of genotyping possible then (based on protein variation), demanded explanation; balanced polymorphism provided it. This was countered by a largely new, opposing view called ‘non-Darwinian’ evolution, or the ‘neutral’ theory; it held that much or even most genetic variation had no effect on reproductive success, and the frequency of such variants changed over time by chance alone, that is, experience ‘genetic drift’. This seemed heretically anti-Darwinian, though that was a wrong reaction and only the most recalcitrant or rabid Darwinist today denies that much of observed genomic variation evolves basically neutrally. But many saw the frequency of variants associated with what were seen as serious recessive diseases, like PKU and Cystic Fibrosis (and others) as the result of balancing selection.

In support of the selectionist view, many variants have been found in the globin and other genes for which the frequency of one or more alleles is correlated geographically with the presence (today, at least) of endemic malaria. But there are lots of variants that might be correlated with other things geographic because the latter are themselves often correlated with population history. Thus, the correlations are often empirical but not clearly causal. Indeed, not many variants have been clearly shown experimentally or clinically actually to be functionally related to malaria resistance.

Speaking of understanding the physical/chemical aspect of components within cell, we came across this wonderful review paper on the protein kinase genes in malaria parasite. To develop proper insight, the same kind of analysis has to be done on many other gene families, and the claims need to be backed by biochemical experiments.

Malaria parasites belong to an ancient lineage that diverged very early from the main branch of eukaryotes. The approximately 90-member plasmodial kinome includes a majority of eukaryotic protein kinases that clearly cluster within the AGC, CMGC, TKL, CaMK and CK1 groups found in yeast, plants and mammals, testifying to the ancient ancestry of these families. However, several hundred millions years of independent evolution, and the specific pressures brought about by first a photosynthetic and then a parasitic lifestyle, led to the emergence of unique features in the plasmodial kinome. These include taxon-restricted kinase families, and unique peculiarities of individual enzymes even when they have homologues in other eukaryotes. Here, we merge essential aspects of all three malaria-related communications that were presented at the Evolution of Protein Phosphorylation meeting, and propose an integrated discussion of the specific features of the parasite’s kinome and phosphoproteome.

Readers may also enjoy the following informative paper from the same group. It covers Apicomplexa, of which Plasdomium is a member.

Structural and evolutionary divergence of eukaryotic protein kinases in Apicomplexa

Background: The Apicomplexa constitute an evolutionarily divergent phylum of protozoan pathogens responsible
for widespread parasitic diseases such as malaria and toxoplasmosis. Many cellular functions in these medically
important organisms are controlled by protein kinases, which have emerged as promising drug targets for parasitic
diseases. However, an incomplete understanding of how apicomplexan kinases structurally and mechanistically
differ from their host counterparts has hindered drug development efforts to target parasite kinases.

Results: We used the wealth of sequence data recently made available for 15 apicomplexan species to identify the
kinome of each species and quantify the evolutionary constraints imposed on each family of apicomplexan kinases.
Our analysis revealed lineage-specific adaptations in selected families, namely cyclin-dependent kinase (CDK),
calcium-dependent protein kinase (CDPK) and CLK/LAMMER, which have been identified as important in the
pathogenesis of these organisms. Bayesian analysis of selective constraints imposed on these families identified the
sequence and structural features that most distinguish apicomplexan protein kinases from their homologs in
model organisms and other eukaryotes. In particular, in a subfamily of CDKs orthologous to Plasmodium falciparum
crk-5, the activation loop contains a novel PTxC motif which is absent from all CDKs outside Apicomplexa. Our
analysis also suggests a convergent mode of regulation in a subset of apicomplexan CDPKs and mammalian
MAPKs involving a commonly conserved arginine in the aC helix. In all recognized apicomplexan CLKs, we find a
set of co-conserved residues involved in substrate recognition and docking that are distinct from metazoan CLKs.

Conclusions: We pinpoint key conserved residues that can be predicted to mediate functional differences from
eukaryotic homologs in three identified kinase families. We discuss the structural, functional and evolutionary
implications of these lineage-specific variations and propose specific hypotheses for experimental investigation. The
apicomplexan-specific kinase features reported in this study can be used in the design of selective kinase inhibitors.

Nature Promotes GWAS Madness to Study ‘Mental Health’


We came across the above tweet from Magdalena Skipper, the genetics editor of Nature magazine. Nature used to be a fine journal in some distant past, but lately it has become the mouthpiece of various over-rated clowns promoting their pet projects.


Today’s self-serving article comes from Broadstar Steven Hyman, who runs the Stanley Center for Psychiatric Research at Broad Institute. Dr. Hyman proposes to collect genetic data from more than 100,000 people and run a giant GWAS study to search for ‘the depression gene’, which he is sure will be found through his study. Needless to say, he expects government to pay for his pet project and Broad Institute to benefit greatly from it.

Dr. Hyman made his case about bigger GWAS studies for mental diseases so poorly that it reminded us of several popular quotes about insanity. Let us start with Einstein –

Insanity: doing the same thing over and over again and expecting different results.

Hyman –

The largest meta-analysis of genome-wide association studies of depression (approximately 9,500 cases) has yielded no significant findings. Similarly sized studies of almost all other conditions have convincingly implicated at least some genetic loci (see ‘Signal search’). So far, 108 independent loci have been found to demonstrate genome-wide significance for schizophrenia.

Nonetheless, I am convinced that genetic variants for depression can be found. [snip] More than 100,000 people with MDD will be needed to find enough loci to inform biology and therapeutics. Amassing a data set of this scale is difficult, but worthwhile and possible.

Geez. What if he fails with 100K people? We already know the answer. Dr. Hyman will propose to sequence the entire humanity, including bushmen, peshmerga and jarwas.

What are the metrics for failure versus success in such a large study? We know that there is none given that Dr. Hyman throws in the following canard of the ‘height GWAS’ study explaining 1/5th of heritability in the next sentence.

A meta-analysis of genome-wide association studies published earlier this year for adult height included more than 250,000 subjects and has found 697 common variants thus far, explaining nearly one-fifth of the heritability.

To understand why that statement is ‘at best misleading’, readers are encouraged to look at the following four blog posts by professor Ken Weiss. For introduction, professor Weiss wrote his textbook on genetic variation and human disease twenty years back and had been arguing against bad science (such as the type promoted by Hyman) in his blog for a long time.

The height of folly: are the causes of stature numerous but at least ‘finite’?

The most recent is an extensive study in (where else?) Nature Genetics, by a page-load of authors. In summary, the authors found, pooling all the data from many different independent studies in different populations, that at their most stringent statistical acceptance (‘significance’) level, 697 independent (uncorrelated) spots scattered across the genome each individually contributed in a significance-detectable way to stature. The individual contributions were generally very small, but together about 20% of the overall estimated genetic component of stature was accounted for. However, using other criteria and analysis, including lowering the acceptable significance level, and depending on the method, up to 60% of the heritability could be accounted for by around 9,500 different ‘genes’. Don’t gasp! This kind of complexity was anticipated by many previous studies, and the current work confirms that.

Many issues are swept under the rug here, however–that is, relegated to a sometimes obscure warren of tunneling of Supplemental information. The individuals were all of European descent so that genetic contributions in other populations are not included. The analysis was adjusted for sex and age. Subjects were all, presumably, ‘normal’ in stature (i.e., no victims of Marfan or dwarfism), and all healthy enough to participate as volunteers. The majority were older than 18, but the range was 14 to 103, and 45% were male, 55% female. The data were also adjusted for family relationships.

Hiding behind technicalities to avoid having to change

Morgan’s insight–and Morgan’s restraint

BigData: scaling up, but no Big message

Lately we have seen a big spike in the number of GWAS type studies to ‘understand’ various aspects of human brain. Dan Graur reported about one particularly bad study with his humorous touch.

GWAS Excrement Again: PNAS Paper Explains 0.02% of the Variation in an Ill-Defined Trait

The title states: “Common genetic variants associated with cognitive performance identified using the proxy-phenotype method.”

In the following I present to the reader with an annotated abstract:

We identify common genetic variants associated with cognitive performance using a two-stage approach, which we call the proxy-phenotype method. [We performed a complicated GWAS meta-analysis on an ill defined trait that only the Illuminati know how to measure and whose reproducibility is nil.]

First, we conduct a genome-wide association study of educational attainment in a large sample (n = 106,736), which produces a set of 69 education-associated SNPs. Second, using independent samples (n = 24,189), we measure the association of these education-associated SNPs with cognitive performance. [We now brag about the amount of unoriginal data that we used]

Three SNPs (rs1487441, rs7923609, and rs2721173) are significantly associated with cognitive performance after correction for multiple hypothesis testing. [We found three SNPs that are correlated with whatever “cognitive performance” measures.]

In an independent sample of older Americans (n = 8,652), we also show that a polygenic score derived from the education-associated SNPs is associated with memory and absence of dementia. [We studied carefully Darrell Huff’s 1954 book “How to Lie with Statistics” and decided to follow his negative examples by combining apples and oranges]

Convergent evidence from a set of bioinformatics analyses implicates four specific genes (KNCMA1, NRXN1, POU2F3, and SCRT). All of these genes are associated with a particular neurotransmitter pathway involved in synaptic plasticity, the main cellular mechanism for learning and memory. [From other studies we picked 4 genes for which we could spin a “Just So Story,” and ignored the vast majority of data pointing to gene deserts.”

The penultimate sentence in the discussion states:
“In future work, the magnitude of explained variance will increase as researchers gain access to datasets with even larger first-stage samples.”

In other words, a sample of 106,736 genomes, which produced a set of 69 “education-associated SNPs” is not enough. With more data the magnitude of explained variation will increase. Unfortunately, the magnitude of explained variation in this study is nowhere to be found in the paper. One has to dig in the supplementary material, to find its value: 0.0002 to 0.0006.

I bettcha one can replace “cognitive performance” by “favorite color” and get similar results. Why do semi-respectable journals still publish such crap?

Continuing on his last sentence, it seems like semi-respectable journals (PNAS) publishing crap and previously reputed journals (Nature) pimping for bigger-sized crap has become the new industry-standard.