In the earlier posts of this series (here, here, here and here), we covered the mathematical and biological aspects of evo and evo2. One important topic that we have not covered yet is how the models were trained.
In this article, I will argue that Multi Parameter Statistics, or even better, Massively Parameterized Statistics (MPS) better describes the application of AI models in biology and medicine. Also, I will introduce you to a new preprint on DNA sequence modeling that claims to match evo.
In the last three posts of this series (here, here and here), we covered the mathematical aspect of evo and evo2. Let us now discuss the biological findings from these models. It will take multiple posts to go over these topics.
In the first two posts of this series (here and here), we covered the AI-related mathematical concepts applied to evo and evo2. Before moving on to the biological side, here is one last post on the model.
In the first post of this series, we covered the basic technical terms of the evo and evo2 papers. We also mentioned the key technological innovation that made their work possible. That led to the question - if they were using fast fourier transform (FFT), were they using convolutional neural network (CNN)? The answer is no. The computer science work done by the Stanford group is quite groundbreaking. Let me go over that in detail.
Two recent papers applying AI-related large language models on DNA sequences are gaining a lot of attentions and a bit of controversy.
The first paper titled Sequence Modeling and Design from Molecular to Genome Scale with Evowrote -
Trained on 2.7M prokaryotic and phage genomes, Evo can generalize across the three fundamental modalities of the central dogma of molecular biology to perform zero-shot function prediction that is competitive with, or outperforms, leading domain-specific language models. Evo also excels at multi-element generation tasks, which we demonstrate by generating synthetic CRISPR-Cas molecular complexes and entire transposable systems for the first time. Using information learned over whole genomes, Evo can also predict gene essentiality at nucleotide resolution and can generate coding-rich sequences up to 650 kb in length, orders of magnitude longer than previous methods.
What are the rules of the genomes? What patterns do the genome sequences follow? What biochemical and evolutionary mechanisms are behind these patterns? Are newly published genomes and pangenomes displaying many exceptions to the rules, or do they all confirm the expected patterns?
Now that we are on the very last day of 2021, it is not too late to review the positives of the year. I picked four categories (humor, science, society, technology) and shortlisted a tiny subset from many deserving candidates.
In NGS experiments, when the researchers encounter issues with genome assembly or analysis, they go back to the raw data composed of sequencing reads. In a latest preprint submitted to zenodo, Steven C. Quay did exactly that for a seminal paper and concluded - “The alternative conclusion is that this sample was not a fecal specimen but was contrived. The data cannot, however, distinguish between a non-fecal specimen that came from true field work on the one hand and a specimen created de novo in the laboratory on the other hand.” This is no simple matter, because the entire world had been running like headless chicken for the last two years relying on the genome assembly submitted in the paper.
In early 2020, Prashant Pradhan and collaborators posted a preprint titled “Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag” in biorxiv. Based on the released emails from NIH under FOIA, we now know that this article and its coverage in zerohedge upset Fauci so much that he immediately convened an urgent meeting of virologists and several health bureaucrats from US, UK and Europe. All details of this meeting had been redacted, but the virologists present in the meeting fast-tracked a Nature Medicine paper claiming the virus definitely came from animals even though they described it as lab-engineered in their private emails. This paper was then used for over one year to censor all counter-arguments. Especially, biorxiv retracted the preprint due to intense pressure and thus destroyed its reputation as a preprint server for good.
A newly leaked classified document revealed that scandal-ridden Francis Collins plans to change his name to continue running the NIH. He got the idea by observing Facebook CEO Mark Zuckerberg, who is rebranding himself to be a reptilian.
US establishment biologists are so tone-deaf that they gave Trevor Bedford both Howard Hughes and MacArthur awards. These same people also scream at the top of their lungs - “Trust the experts”. Here is what I got by trusting “experts” like Trevor Bedford.
Yesterday, an explosive set of leaked documents on the origin of SARS-CoV-2 virus got released by DRASTIC. People following the topic are describining them as “worse than the Chernobyl in the biology field”. In my opinion, this release changed the entire understanding of the origin of the pandemic and exposed a group of people as extremely wicked, shockingly evil and vile (sorry to borrow the movie name). Let me explain why.