Archives

Entropy of Never Born Protein Sequences

Why do some proteins get born and some do not? Grzegorz Szoniec and Maciej J Ogorzalek attempts to find out.

Existing and known proteins are only a small subset of all possible sequences. Why were only some proteins selected during evolution? The reason is not known but two possible ways are considered: deterministic and random. To investigate theoretical sequences of amino acids a term Never Born Protein was introduced (Chiarabelli et al. 2006). Since 2006 only a few papers about them have been published. The most significant research has shown that 20% of them fold (i.e. reach stable and functional 3D structure) in laboratory conditions

It is a theoretical paper, where they generated Never Born Proteins with Random Blast tool and compared with natural proteins from Uniprot by measuring block entropy in both cases.

Findings and conclusion:

Findings

Both block and relative entropies are similar what means that both protein kinds contain strongly random sequences.

An artificially generated Never Born protein sequence is closely as random as a natural one.

Conclusions

Information theory approach suggests that protein selection during evolution was rather random/non-deterministic.

Natural proteins have no noticeable unique features in information theory sense.

Let’s Lend Ruibang a Helping Hand to Fix SOAPdenovo2

Several researchers are encountering difficulties with SOAPdenovo2 and said that the old SOAPdenovo used to run lot faster.

Capture

Maybe we got lucky, but we never encountered the same problem. We are not sure whether Nick Loman’s problems were only with the latest update of SOAPdenovo2 made few weeks back, or whether they appeared also in the version released about a year ago (r223).

Capture2

In our case, the program always ran to completion at blazing speed (check few test cases here) and gave us reasonably correct assemblies. In fact, the recently released SOAPdenovo-trans, which is a transcriptome assembler written following parts of SOAPdenovo2, helped us assemble a large library that we could not do with Trinity.

If you encounter problems with SOAPdenovo2, please use ’1>/dev/stderr 2>log’ to run the program and send the log file to Ruibang (email: rbluo at hku dot hk). Alternatively, you can post it here so that others can compare with their error notes to see, whether a common pattern arises. In case you want to venture into the source code to track the error by yourself, we discussed the code of SOAPdenovo2 here and in the wiki. SOAPdenovo mailing list is also very active and responses come within a day or two.

———————–

Looks like Nick will be sending his read files to Ruibang. We will check with Ruibang about any update made in the code.

Capture3

Physicist Ken Wilson Passes Away

Physics Nobel Laureate Kenneth Wilson Dies

————————

Mickey Atwal ‏(@MickeyAtwal) said -

Capture

and we fully agree. Very rarely physicists are awarded Nobel prize for mathematically beautiful theories, Wilson’s renormalization group being an exception. Renormalization group touches every branch of theoretical physics, because it connects physics of the last few centuries (linear theories + perturbation) with physics of possibly the next few centuries (nonlinearity, self similarity, fractal).

Yet when we search for Ken Wilson in google, everyone from a sportcaster and car dealer to a high class thief comes in the top page, but not the brilliant physicist. You do not find him until you use the full name Kenneth Wilson. That is nearly as tragic as his passing away.

Capture2

Government is the Mother of All Invention

You all heard the old saying ‘Government is the mother of all invention’, right? You have not? There is a reason such a saying does not exist.

Few days back, we got into a twitter brawl with Michael Eisen after writing DORA Pledge – YASI from Government-funded Scientists. In that article, we argued that centralized government funding is the primary reason for all other distortions, such as impact factor, hyping of ENCODE results, counting number of papers to evaluate scientific contribution, take-over of science by commercial journals and ten thousand other idiocies scientists are frustrated with. If that is true, the best way to solve the problem is to decentralize big grant agencies like NIH and NSF, and make sure the decentralized bodies cannot ‘coordinate policies’.

Dr. Eisen was not happy to hear that. At first, he claimed that before government started to fund science, science was rich, white persons’ hobby.

Capture

We gave countless examples of scientists, who all came from modest families and did not have rich parents to fund their scientific hobbies, starting with Robert Hooke. Sure, Hooke became wealthy during his lifetime, but much of that wealth came from his work directly related to his scientific abilities.

His adult life comprised three distinct periods: as a scientific inquirer lacking money; achieving great wealth and standing through his reputation for hard work and scrupulous honesty following the great fire of 1666, but eventually becoming ill and party to jealous intellectual disputes.

Twitter is a lousy medium for intelligent discussion. Among six or seven examples we gave of scientists, who were not rich through family means before being interested in science, Dr. Eisen picked only Francis Crick, mentioned that his work was funded through LMB and claimed victory.

Capture2

Tremendous Growth in Centralized Government Funding of Science since 1957

Overall, the above exchange established nothing about our original point that big centralized government funding agencies were the primary reason for distortions like impact factor. Even though government funded some level of scientific research during Watson and Crick’s discovery, the real growth of government funding started in late 50s and never looked back. Check the growth of government funding between 1950 and 1970 in the following charts (MRC=UK).
Continue reading Government is the Mother of All Invention

Counting Animals

I heard this story from a Mexican friend, whose great grandfather moved to Northern California long before the land got transferred to United States. Prior to discovery of gold in the hills of Sierra and arrival of 49ers, California was sparsely populated and two Spanish farmers lived in the area currently known as Menlo Park. Juan and Estevan raised avestruz (ostrich), cerdo (pig), ganso (goose) and toro (bulls) in their huge farm.

Life was quite easy and a bit too boring at times. To add spices to their daily routine, Juan invented a new game. Every once in a while, he made all animals go to the small pond to drink water one at a time. When an animal was at the pond, he raised a flag – red for avestruz, green for cerdo, yellow for ganso and blue for toro. Estevan sat far away on top of a redwood tree and marked the colors of flags on a piece of paper. Juan also kept his own notes, and after all animals drank water, they sat together for lunch and compared their marking. On other days, Juan climbed the tree and Estevan took the animals to the ponds. What a silly game, don’t you think? Let me remind you that they were farmers and not chess grand-masters. If they were happy with it, what can we say?

write-blog

When I was in Mexico, my friend showed me those pieces of papers, which their family saves for generations to remember the lives of their ancestors. It was an amazing feeling, when I sat with the papers, closed my eyes and tried to imagine what life was like in completely desolate ‘Silicon valley’. I even attempted some ‘data science’ on their notes and found a puzzling pattern. Always, the guy on the tree got the correct animal from flag signal 85% of times. Among errors, about 11% of times, he noted the same flag twice. Another 4% of times, he missed the flag altogether, and wrote the wrong color 1% of times.

Here is what I found rather odd. The same pattern continued again and again, no matter whether Juan or Estevan sat on the tree. I have no idea, whether that observation says anything about human nature, Mexican farmers or farms built in redwood forests around Menlo Park. I wish I had a time machine to go back to 1830s and watch those farmers.

Whitepaper: Accelerated BLAST Performance with Tera-BLAST™: a comparison of FPGA versus GPU and CPU BLAST implementations

Sponsored by Active Motif Inc (previously TimeLogic).

A number of technologies have emerged for accelerating similarity search algorithms in bioinformatics, including the use of field programmable gate arrays (FPGA), graphics processing units (GPU), and clusters of standard multicore CPUs. Here we present Tera-BLAST™, an FPGA-accelerated implementation of the BLAST algorithm, and compare the performance to GPU-accelerated BLAST and the industry standard NCBI BLAST+ on high performance computers. Our results show that Tera-BLAST, running on the TimeLogic J-series FPGA Similarity Search Engine, performs 100’s of times faster than BLAST running on generic NVIDIA Tesla M2090 GPU cards or standard Intel Xeon multi-core CPU’s.

Those interested in downloading the whitepaper are requested to click here and get it in the form of email attachment from Active Motif Inc.

Another Bioinformatics Job Posted

Please feel free to check our job forum here from time to time. The new job is also from Dupont Pioneer, Johnston, IA.

Title

Senior Research Associate: Bioinformatics & Microbial Genomics

Job Description:
The Genome Analytics group within Trait Informatics at DuPont-Pioneer focuses on the development and application of next-generation sequencing approaches for genetic discovery and product development. This group contributes to research programs and centers located across Dupont.

Details here.

DORA Pledge – YASI from Government-funded Scientists

Before continuing further,

DORA = Declaration on Research Assessment
YASI = Yet Another Stupid Idea

Rage against impact factor has become the new fashion. The following note has been making the rounds on twitter and other places.

An ad hoc coalition of unlikely insurgents—scientists, journal editors and publishers, scholarly societies, and research funders across many scientific disciplines—today posted an international declaration calling on the world scientific community to eliminate the role of the journal impact factor (JIF) in evaluating research for funding, hiring, promotion, or institutional effectiveness.

The San Francisco Declaration on Research Assessment, or DORA, was framed by a group of journal editors, publishers, and others convened by the American Society for Cell Biology (ASCB) last December in San Francisco, during the Society’s Annual Meeting. The San Francisco group agreed that the JIF, which ranks scholarly journals by the average number of citations their articles attract in a set period, has become an obsession in world science. Impact factors warp the way that research is conducted, reported, and funded. Over five months of discussion, the San Francisco declaration group moved from an “insurrection,” in the words of one publisher, against the use of the prominent two-year JIF to a wider reconsideration of scientific assessment. The DORA statement posted today makes 18 recommendations for change in the scientific culture at all levels—individual scientists, publishers, institutions, funding agencies, and the bibliometric services themselves—to reduce the dominant role of the JIF in evaluating research and researchers and instead to focus on the content of primary research papers, regardless of publication venue. The DORA coalition calls on all individuals and organizations engaged in scientific research to sign the San Francisco declaration.

The stupidity of these central planners knows no bound !! If signing a declaration can solve any problem, we will start collecting signatures for DONL (declaration on not lying) for politicians, DONBTG (declaration on not bribing the government) for big companies and DOTI for all DORA signers. Yes, DOTI stands for ‘declaration for thinking intelligently’, if you have not figured it out.

Impact factor is only a symptom, but the real disease is large centrally-funded science. Prior to government taking over science, scientists used to argue about nature, truth, plagiarism (defined as stealing of ideas, not paragraphs). There were few new ‘fields’ and few scientists, but most of them were solid scientists and not bull-shitters (BSers). Science thankfully did not have enough money to feed too many BS-ers, and those were the happy days.

After government took over and decided to pour billions of dollars, it had to create few simple measures to ‘evaluate’ productivity. At first, paper count became that simple measure. In the early to mid-90s, paper count was in vogue. Those, who graduated in that period, gamed paper count game heavily to get grants and positions. Eventually, central planners decided to do something new and brought in ‘citation count’ and ‘impact factor’. Some people still have not adjusted to the new mode as you can see in this 2009 paper titled Estimates of global research productivity in gynecologic oncology -

Research production and international cooperative teamwork in the 2 main journals of gynecologic oncology increased within the 10 last years; 65.3% of all published articles dealt either with epithelial ovarian cancer, cervical cancer, or endometrial cancer. Endometrial cancer had the worst ratio number of publications to estimated national incidence (United States, 2007). The United States (41.15%) and Europe (29.72%) make up a striking 70.87% of the world’s research production in the field of gynecologic oncology. However, the highest rate of increase shows in Turkey (22.5), the People’s Republic of China (6.87), and South Korea (5.83). Adjusted to the national GDP per capita and population for the year 2006, research productivity seems best in Israel, Austria, and Turkey.

Other planners want to be more cute and measure both ‘quantity’ and ‘quality’. Check “Worldwide research productivity in critical care medicine” for example.

Introduction
The number of publications and the impact factor of journals are accepted estimates of the quantity and quality of research productivity. The objective of the present study was to assess the worldwide scientific contribution in the field of critical care medicine.

Method
All research studies published between 1995 and 2003 in medical journals that were listed in the 2003 Science Citation Index (SCI®) of Journal Citation Reports under the subheading ‘critical care’ and also indexed in the PubMed database were reviewed in order to identify their geographical origin.

Results
Of 22,976 critical care publications in 14 medical journals, 17,630 originated from Western Europe and the USA (76.7%). A significant increase in the number of publications originated from Western European countries during the last 5 years of the study period was noticed. Scientific publications in critical care medicine increased significantly (25%) from 1995 to 2003, which was accompanied by an increase in the impact factor of the corresponding journals (47.4%). Canada and Japan had the better performance, based on the impact factor of journals.

Conclusion
Significant scientific progress in critical care research took place during the period of study (1995–2003). Leaders of research productivity (in terms of absolute numbers) were Western Europe and the USA. Publications originating from Western European countries increased significantly in quantity and quality over the study period. Articles originating from Canada, Japan, and the USA had the highest mean impact factor.. Canada was the leader in productivity when adjustments for gross domestic product and population were made.

If central planners measure how well they ‘cured’ cancer by counting the number of papers or ‘high impact’ papers on cancer, then ‘paper count’ and ‘impact factor’ will be maximized. Therein lies the problem. Moreover, due to simplistic nature of central planning, central planners will have to shift to a new measure to evaluate efficiency of central planning in post-(impact factor) era. So, DORA declaration, if adopted properly, will merely shift the problem to something else.

Support of large projects and ‘center grants’ is another bad effect of central planning. Central planners can only juggle with few balls in the air, and that is why they want huge ‘successful’ projects like ENCODE in their portfolio.

What is the solution then?

In 1971, John Cowperthwaite became the governor of Hong Kong. He decided to stop collection of all kinds of government statistics. People of Hong Kong never lived better !!

Asked what is the key thing poor countries should do, Cowperthwaite once remarked: “They should abolish the Office of National Statistics.” In Hong Kong, he refused to collect all but the most superficial statistics, believing that statistics were dangerous: they would led the state to fiddle about remedying perceived ills, simultaneously hindering the ability of the market economy to work. This caused consternation in Whitehall: a delegation of civil servants were sent to Hong Kong to find out why employment statistics were not being collected; Cowperthwaite literally sent them home on the next plane back.

We believe impact factor and other anomalies will go away, if scientists shun central planning agencies (NSF, NIH, etc.) creating measures for their effectiveness. The next step should be complete dismantling or decentralization of those central planning agencies. We do not see small software companies worried about ‘impact factor’ of their libraries before releasing new code. Why scientists?

Web Software Section is Open

We created a separate section to discuss web-related codes needed to present bioinformatics data and information. We will continue to write on genome assembly, other algorithms and their hardware implementation here. Some of the links in the new forum need a bit of cleaning up, and we will do that, when we write on biostar code in the next commentary.

Biostar I – Discussions with Supernova Istvan

Big Data and Buzzword-driven Science

—————-

WordPress, Drupal and Other Content Management Systems

Business Model and Licenses

MVC Frameworks – RoR, Django, Scala, CodeIgniter, Symphony and Laravel

HTML5 and SVG

Git and Github

Few Useful Commentaries on Front-end (CSS) Design

Javascript Checker Jsfiddle and annoying.js

DBAD Public License for Releasing Software by Phil Sturgeon

Many Good Genome Assembly Related Comments Coming from ‏@sebhtml Feed

Sébastien Boisvert, the author of Ray assembler, is currently working on his PhD thesis. The process involved going through hundreds of other papers and documents and give them proper credit. Thankfully, he is keeping us updated with what he is up to through his informative tweeter feed. Those interested in genome assembly probably already follow him, but if not, we would highly recommend it.

Few examples -

1. Collection of other PhD theses

Sébastien Boisvert ‏@sebhtml 12m
@dzerbino @jaredtsimpson @mjpchaisson @RayanChikhi Open access doctoral theses on de novo genome assembly http://dskernel.blogspot.ca/2013/06/open-access-doctoral-theses-on-de-novo.html … #bibtex

He got all useful PhD theses on his webpage, and if you check their references, you will possibly find everything done on genome assembly over the last 200 years. Someone needs to check and tell us, whether they were as thorough as Heng Li in his Fermi paper to include Gingeras 1979.

2. On Kautz graphs

Sébastien Boisvert ‏@sebhtml 1h
I am reading this for my thesis:
Efficient tilings of de Bruijn and Kautz graphs http://arxiv.org/abs/1101.1932 #graph #computing

Sébastien Boisvert ‏@sebhtml 1h
@homolog_us Maybe this is the return of the Kautz graphs. SiCortex used Kautz graphs for their networks http://en.wikipedia.org/wiki/SiCortex #graphs

3. On why Ray is so fast

Sébastien Boisvert ‏@sebhtml 1h
@homolog_us I would say it’s because everything is in RAM, and messages don’t touch storage neither. And RayPlatform is well designed too

Sébastien Boisvert ‏@sebhtml 1h
@homolog_us we will submit soon a manuscript about RayPlatform and on how to craft scalable tools to Big Data @edd @MaryAnnLiebertInc

4. On why US spy agencies should start tracking his activities

Capture

———————–

Heng Li’s list of historically important papers on genome assembly. We believe it is incomplete without one of SOAPdenovo/ALLPATHS papers, because Velvet is muchisimo malo for large assembly.

Here are some historically important papers on the theory of de novo assembly (not in the practical aspect). It was edited from the note I took for writing the fermi paper. I was quite confused with the history and theory on de novo assembly at that time. The note is a little lousy but might be useful to someone in case.

Staden, 1979; Gingeras et al, 1979: first attempt to use a computer program to assist sequence assembly.

Peltola et al, 1984: first mathematical formulation sequence assembly and the emerge of the overlap-layout-consensus (OLC) paradidgm. OLC consists of three steps: 1) overlap: finding overlaps between all pairs of reads; 2) layout: arrange the reads based on overlaps; 3) consensus: derive the final sequence. In that paper, the authors tried to find a layout that minimizes the assembly.

Myers, 1995 (a legendary figure in computional biology); Kececioglu and Myers, 1995: probably the first graph representation of sequence assembly – the overlap graph. In an overlap graph, a vertex represents a read and an edge represents an overlap. We can find the layout by finding the optimal path through each vertex – an NP-hard Hamilton problem. However, Myers et al. did not stop there. They proposed transitive reduction, a procedure to dramatically reduce the redundancy of overlaps. And after transitive reduction, solving sequence assembly is not a Hamilton problem any more.

Idury and Waterman, 1995 (the Waterman in Smith-Waterman): the birth of the de Bruijn paradigm. I remember that a paper mentioned that both Myers and Waterman participated the 4th DIMACS Implementation Chanllenges in 1994. They proposed two distinct graph representations of sequence assembly, which have such a fundamental incluence even today.

Myers et al, 2000: publication of Celera assembler, arguably the best assembler for short-gun data at that time. This assembler is largely a practical implementation of the theory in their 1995 papers.

Pevzner et al, 2000: publication of the Euler assembler, the first assembler based on the de Bruijn graph. It solved two major problems with the original theory published in 1995 with spectrum error correction and read threading, a procedure to re-impose the full-length read information to the de Bruijn graph. The paper unfortunately spread a wrong notion which can still be seen in a few reviews nowadays: OLC is an NP-hard Hamilton problem, but de Bruijn graph can be solved in linear time and thus is more advanced. However, in fact, Gene Myers with his Celera assembler never tried to reduce OLC to a Hamilton problem. I will come back to the time complexity of de Bruijn later.

Myers, 2005: string graph. String graph is largely a different representation of overlap graph.

Medvedev et al, 2007: de novo assembly is NP-hard, which is true for both the OLC and the de Bruijn paradigm. A catch is that although constructing the de Bruijn graph and finding the optimal path is linear, resolving the optimal read threading is NP-hard and without threading, we cannot fully use all the information in reads.

Zerbino et al, 2009: the publication of velvet. This is a landmark paper on short-read assembly. Graph cleaning (e.g. tip trimming and bubble popping) was first introduced in this paper and is now widely used in essentially every short-read assembler.

Simpson and Durbin, 2010: the first application of FM-index in sequence assembly. This paper resolved a long-standing difficulty in applying OLC to short reads: finding the overlaps. After this paper, two other fast overlap finder were also published (Dinh and Rajasekaran, 2011; Gonnella and Kurtz, 2012) which are claimed to be much faster and more lightweight.

Earl et al, 2011: Assemblathon 1. In my view, this is undoubtedly the best paper on the evaluation of de novo assemblers.