#Biodata14 Conference - Twitter Summary and Links

B1xL-BRCMAAWD0K

The Biodata14 conference is taking place at CSHL (picture from @gladrandomgraph). It is a new conference to focus on bioinformatics and ‘big data’ aspects of data analysis. The Readers may follow #biodata14 hashtag in twitter to get a broad overview of the topics and discussions at the conference.

This link has a list of all talks and posters. Overall there were 43 talks and 75 posters this time. Our blog covered the publications of many speakers and authors from the conference in the past. Below we give a list of talks, and are in the process of updating each section with relevant links and any additional information we found from twitter. The information below is partial and we are modifying them.

For posters, if any author is interested in including his/her one in this blog post, please tweet the link to us at @homolog_us and we will include here.

@lisafederer Fun fact: about 100,000 whole human genomes have been sequenced to date. #biodata14

Talks -

-————————————————-

Haussler, D.H. - Global exchange of human genetic data for medicine and research

Haussler group at UCSC has been working on developing algorithms to analyze large number of human genomes together. We previously covered some of those algorithms in the following posts.

HAL: a Hierarchical Format for Storing and Analyzing Multiple Genome Alignments

Mapping to a Graph-style Reference Genome Arxiv Paper

One important point of his keynote talk is that ‘reference genome’ is neither one genome, or is a linear entity. Each individual has unique genome and the ‘reference genome’ is just a mosaic of all. Therefore, his group is developing tools to represent genomes using graphs instead of linear character array.

Genome not being ‘linear’ is a recurrent theme in many of the other talks as well.

-————————————————-

Church, D.M. - Analog reporting in a digital age

Deanna Church has been working on human reference genome for a long time. Again, the ‘an assembly is not a genome its a MODEL of a genome’ comes up in her talk. Relevant tweets -

Lisa Federer ?@lisafederer Nov 6

Diagnostic exome sequence only solves the question 25-50% of the time #biodata14

Avinash ?@gatoravi Nov 6

DC: convention to shift indels farthest right in clinical data. discrepancy with vcf leads to duplicate records. #biodata14

Olga Botvinnik ?@olgabot Nov 6

.@deannachurch reminding us that genotype-phenotype associations started long before the genome project, with cDNAs #biodata14

Avinash ?@gatoravi Nov 6

great talk from DC. succinctly highlights many challenges with variant reporting that a lot of us face. #biodata14 biodata14

-————————————————-

Charlop-Powers, Z.- Creating a drug atlasApplications of big data to natural product drug discovery

We have not covered much on this topic, but here is a media link.

Searching for drugs in dirt, researchers call on citizen scientists

Microbes are not only a rich source of disease, but also a rich source of medicines, and experts think many life-saving compounds produced by as-yet- unnamed bacteria are awaiting discovery. But they dont always give up their secrets easily. Researchers must know where to look to find promising bacteria, and how to get them to grow in the lab, the traditional route to identifying potentially valuable molecules they produce.

Researchers in Sean Bradys Laboratory of Genetically Encoded Small Molecules are working on a way around these roadblocks. By using genomic sequencing technology, they can investigate the organisms that live in habitats like soil without having to grow the microbes in the lab. They are using this information to map out the location of gene clusters they believe may encode novel antibiotics, and, with help from citizen scientists around the country, they are hoping to process soil samples from areas they would never be able to visit on their own.

-————————————————-

Chatterjee, S. - A large database informatics method for characterization of the human gut microbial proteome

Relevant tweets -

Rachel Melamed ?@rdmelamed 1d1 day ago

SC: phylogenetic diversity of microbiome is huge, but the protein function is conserved #biodata14

deannachurch ?@deannachurch 1d1 day ago

SC: microbial composition varies between individuals but gene fxn seems to be conserved. #biodata14

-————————————————-

Aghamirzaie, D. - An accurate support vector machine classifier for assessing coding potential of transcripts using several sequential and structural features

-————————————————-

Allen, J.E. - Designing new algorithms for emerging data-intensive computing architectures to improve the speed and accuracy of shotgun metagenomic analysis

-————————————————-

Carneiro, M.O. - Native GATKwhy you should care about performance

GATK is immensely popular as you can tell from the number of tweets (Read them from bottom to top for context).

Matt Massie ?@matt_massie 2h2 hours ago

Kudos to the @broadinstitute for sharing the new GATK C++ engine w/an MIT license. A step in the right direction. @gatk_dev #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC addresses questions from the floor on the closed/open sourcedness of GATK. #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

Next on #biodata14, Andrew Warren with Pan-genome graphs for bacteria & the web

Olga Botvinnik ?@olgabot 2h2 hours ago

@markgerstein @mauricinho as was discussed, not truly “open source” bc it limits who is able to contribute to the development #biodata14

Mark Gerstein ?@markgerstein 2h2 hours ago

Q for @mauricinho: Why GATK went from open source to not? A: Broad lawyers were experimenting! Now back to open source #biodata14

Rob Patro ?@nomad421 2h2 hours ago

#biodata14 Great talk on GATK by Mauricio Carneiro. Yup, C++ is still way faster than Java, at least for what GATK does.

Wendy Demos ?@DemosWM 2h2 hours ago

MC faster GATK is coming! #biodata14

Dan Evans ?@DanEvans0 2h2 hours ago

#biodata14 New GATK I/O C++ library is called “gamgee”, ‘cause it’s Sam-wise. Get it?

Avinash ?@gatoravi 2h2 hours ago

MC: pairhmm package freely available to use in other packages. #biodata14

Han Fang ?@Han_Fang_ 2h2 hours ago

@mauricinho : In gamgee, reading bam files is 17X faster, and mark duplicates is 5X faster. #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: Gamgee source code available: https://github.com/broadinstitute/gamgee #biodata14

Jason Pitt ?@JasonJPitt 2h2 hours ago

MC: New joint C++/Java implementation of GATK speeds up HaplotypeCaller by 9 fold #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: GATK C++ gamgee, Reading bam files is 17x faster, reading VCFs is 50x faster, calling varies by tech, from 9x to 720x faster #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC: check out talk from cppcon on performance of gamgee. c++ gatk version worked on. #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: Java cannot make use of modern hardware, no access to low-level concepts. Switched to C++ because it allows same. #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC: more than 70% of instructions in GATK are memory access( like most tools) #biodata14

Richard Sever ?@cshperspectives 2h2 hours ago

The brown fat of the internet ;-) MT @morgantaschuk: MC: “data center..a machine whose only job is to turn electricity into heat” #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: Accessing memory from CPU takes 100x longer than from L1 cache, many wasted cycles #biodata14

Sara Ballouz ?@SaraBallouz 2h2 hours ago

Carneiro: software and hardware - what’s happened to the arms race? #biodata14

James Taylor ?@jxtx 2h2 hours ago

My slides from #biodata14: https://speakerdeck.com/jxtx/adventures-in-scaling- galaxy-at-biological-data-science-2014#

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: Modern CPUs are “too fast”, difficult to utilize all of the processing power. Need faster memory. #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC slide of memory access times. ( things EEs memorize) #biodata14

Lisa Federer ?@lisafederer 2h2 hours ago

Fun fact: modern CPUs can handle 36 billion instructions per second. Software isn’t able to take advantage of that. #biodata14

Dan Evans ?@DanEvans0 2h2 hours ago

#biodata14 M. Carneiro shows pic of data centre at rest: “This is a machine whose only job is to turn electricity into heat.”

B. Boutros-Blather ?@boutrosblather 2h2 hours ago

why aren’t we using the water vapor emanating from data centers to turn turbines?! #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC: data centers are just steam rooms if processing not done right. #biodata14

Sam Minot ?@sminot 2h2 hours ago

MC: “We’re not doing particle physics here” #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: “This is a data center. It’s a machine whose only job is to turn electricity into heat.” General applause. #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC: majority of cpu time spent waiting. usually ‘write to disk’ time. #biodata14

Han Fang ?@Han_Fang_ 2h2 hours ago

@mauricinho : “Native GATKwhy you should care about performance”. Talking about the C++14 library for GATK 4.0 (in progress) #biodata14

B. Boutros-Blather ?@boutrosblather 2h2 hours ago

GATK takes 2 days to process a single genome – which is a feature, not a bug #takeabreak #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: It takes 44 hours to process a single genome from alignment through pre- processing. cpu is not efficiently used #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC: highly encourages adopting best practices pipeline. problem - speed.(44hrs before variant calling) #biodata14

Jason Pitt ?@JasonJPitt 2h2 hours ago

MC: Single-sample processing followed by joint (multi-sample) genotyping is crucial for scalability #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC: best practices pipeline most important contribution. #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: Broad has active outreach to teach people how to use the GATK properly, videos available on website.https://www.broadinstitute.org/gatk/ #biodata14

0 replies 3 retweets 1 favorite

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: “The whole world started using the GATK. We weren’t quite ready for that.” #biodata14

Avinash ?@gatoravi 2h2 hours ago

MC: -44000 exomes and ~2000 wgs in 2013 at broad. #biodata14

B. Boutros-Blather ?@boutrosblather 2h2 hours ago

carneiro: lean software, dense slides at #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MC: You should care about performance because everyone has stopped caring about performance. noone talks to the hardware folks. #biodata14

Jason Pitt ?@JasonJPitt 2h2 hours ago

MC: Hardware is getting faster while software is getting slower. As programmers, we’re getting lazy #biodata14

GATK is not without critics however. The program has a monopoly. Do the critics perceive it as the next Microsoft? Well, the number of lawyers being involved to tweak various licenses give them such as impression.

James Taylor ?@jxtx 2h2 hours ago

Thanks @StevenSalzberg1 for asking the right question at #biodata14 GATK IS NOT OPEN. Dont use it. Reject papers that do. Bad for science

-————————————————-

Chilton, J.M. - Rapidly bringing software to biologists with Galaxy and Docker

Not sure whether this is the same topic, but here are the slides from James Taylor on Galaxy -

Capture

-———————————————–

Cox, A.J. -Compressed indexing of multiple human genomesPractice and applications

Tony Cox from Illumina has been working on BWT-based indexing scheme for large Illumina files so that they can be searched rapidly. We extensively covered his work here along with algorithms and other details. Here are the earliest and latest links.

Academic Bioinformaticians Uncomfortable with Illuminas Publication of Variant Caller

Latest from Tony-Cox BEETL-fastq

Relevant tweets -

Morgan Taschuk ?@morgantaschuk 1h1 hour ago

AC: BEETL available open source http://beetl.github.io/BEETL #biodata14

Mark Gerstein ?@markgerstein 1h1 hour ago

Cox: shows benefits of compressed indexing the reads. Validates by rapidly finding deletion breakpoints on NA12878 #biodata14

Morgan Taschuk ?@morgantaschuk 1h1 hour ago

AC: Compress 152GB gzipped fastq with BEETL-fastq compressed indexing to ~100GB. #biodata14

-————————————————-

De La Vega, F.M. - Scaling up genomic data management, indexing, and analysis for a million genomes

-————————————————-

Dobin, A. STARtoolsUltra-fast comprehensive RNA-seq analysis suit

When the STAR aligner came out, we wrote about it.

STAR: Really Kick-ass RNA-seq Aligner

Now the author develops ‘STARtools’ on top of it to make it even more useful. You can access STAR at this github page.

STARtools appears to do expression analysis, but we are not sure why the tweets discussed about comparison with RSEM. The right comparison should be with Bowtie-Tophat-Cufflinks given that this is reference-based, right?

-————————————————-

Dowling, J. - The secure analysis and storage of genomic data using BiobankCloud

-————————————————-

Fang, H. - Classifying INDELs to reduce calling errors in whole-genome and exome sequencing data

-————————————————-

Felix, V.M. - Open Science Data Framework (OSDF)A system for organizing, accessing, and querying scientific data

-————————————————-

Gerstein, M.B. - A computational framework to prioritize regulatory variants from whole-genome sequencing in cancer

-————————————————-

Ghose, K. - A community-driven framework for scalable and reproducible informatics in the cloud

-————————————————-

Haake, A.R. - User-centered design approaches for visual information processing

-————————————————-

Hwang, T. - An integrative somatic mutation analysis to identify pathways linked with survival outcomes across 19 cancer types

-————————————————-

Kim, M. - Parallel compression of metagenomic sequences via extended Golomb codes

They are using Kraken and other tools to classify metagenomes. Read tweets from bottom to top -

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MK: MetaCRAM available on http://web.engr.illinois.edu/~mkim158 #biodata14

Avinash ?@gatoravi 2h2 hours ago

MK: 3-6 fold compression improvement over gzip. #biodata14

Sam Minot ?@sminot 2h2 hours ago

Notably, the only disadvantage to MetaCRAM is the compression time, potentially making it a good option for long-term storage #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MK: MetaCRAM is much slower than bzip2, gzip, because of blast to use reference based compression. Compression rates are 2x #biodata14

Sam Minot ?@sminot 2h2 hours ago

MK: MetaCRAM takes about 200X longer than gzip #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MK: for reads and start positions, power law was closest model. #biodata14

Sam Minot ?@sminot 2h2 hours ago

MK: MetaCRAM improves on gzip by 2-3 fold for HMP data #biodata14

Avinash ?@gatoravi 2h2 hours ago

which distribution do i model with ? #classicbioinfo #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MK: extended golomb encoding stores number of division operations between n and m and array of remainders, for power law dist #biodata14

Avinash ?@gatoravi 2h2 hours ago

MK: extended golomb encoding for ints with power law encoding. #biodata14

Avinash ?@gatoravi 2h2 hours ago

MK now gives a quick background about golomb encoding, useful for ints with geometric distribution. #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MK: huffman encoding uses a priori probability distribution to use less bits for more frequently occurring symbols, can be large #biodata14

Avinash ?@gatoravi 2h2 hours ago

MK talks about huffman encoding, assign less bits to more frequent symbols based on pmf. #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MK: 1. classify reads into taxonomy, 2. align reads to “relevant reference” 3. assemble metagenome of other reads, 4. compress #biodata14

B. Boutros-Blather ?@boutrosblather 2h2 hours ago

#metacram talk at #biodata14 is very clear!

Avinash ?@gatoravi 2h2 hours ago

MK: taxonomy classification tools - kraken and metaphyler, picked kraken. #biodata14

Mark Gerstein ?@markgerstein 2h2 hours ago

Kim cites: Data compression for sequencing data

http://www.almob.org/content/8/1/25 A nice review mentioning reference-based compression #biodata14

Han Fang ?@Han_Fang_ 2h2 hours ago

Minji Kim: MetaCRAM, assemble/compress simultaneously, in iterative, parallel manner. #biodata14

Sam Minot ?@sminot 2h2 hours ago

MK: MetaCRAM uses Kraken for taxonomic identification #biodata14

Morgan Taschuk ?@morgantaschuk 2h2 hours ago

MK: MetaCRAM is first do novo, parallel, CRAM-like software specialized for FASTA-format metagenomic read compression #biodata14

-————————————————-

Knight, J. - RMSRun My Samples

James Knight developed the de novo assembler for 454 reads and is now at Yale. We covered his work in the following blog post.

New Bioinformatics Blog to Keep an Eye on James Knight

-————————————————-

Kural, D. - Self-Learning algorithms for millions of genomes

-————————————————-

Langmead, B.T. - Scalable software for uniform analysis of many RNA-seq samples

Ben Langmead has been working on several cutting-edge RNAseq tools. Here are the relevant tweets from his talk -

Mark Gerstein ?@markgerstein 2h2 hours ago

.@BenLangmead: Cloud-scale RNAseq… w/ Myrna

http://genomebiology.com/2010/11/8/r83 AWS run w. Geuvadis gives $.34/Gb, less than $1/sample #biodata14

deannachurch ?@deannachurch 2h2 hours ago

.@BenLangmead wants to make it easy for biologist to reanalyze large scale public data. #biodata14 #reproducibility

deannachurch ?@deannachurch 2h2 hours ago

.@BenLangmead starting the morning session on motivation for building RNA-seq analysis tools. #biodata14

-————————————————-

Lauter, K. - Homomorphic encryption as a tool to preserve privacy in genomic computation

-————————————————-

Lee, H. - Sugarcane genome de novo assembly challenge

-————————————————-

Lovci, M.T. - FlotillaAn open-source toolkit for single-cell RNA-seq data analysis

-————————————————-

Mainzer, L. - Profiling accuracy and performance of human variant calling workflows on BlueWaters

-————————————————-

Margolis, R. - Designing a data discovery index to find and cite data

-————————————————-

Massie, M. - Building fast, petabyte-scale biological data systems

-————————————————-

Piccolo, S.R. - Gene set omic analysisA gene-set analysis approach that can be applied to many omic types

-————————————————-

Pitt, J.J. - Robust scaling of next-generation sequencing analyses using the modular SwiftSeq workflow

-————————————————-

Ratsch, G. - Automatic summarization of cancer clinical notes to understand patient trajectories and the effect of somatic mutations

-————————————————-

Rendon, A. - 100,000 genomes of patients with cancer and rare heritable diseases

-————————————————-

Russell, D.P. - The Open Microscopy EnvironmentOpen image informatics for the biological sciences

-————————————————-

Sadedin, S.P. - CpipeA bioinformatics platform for the analysis of clinical sequencing data in a diagnostic setting

-————————————————-

Sakhanenko, N. - An information theory method for efficient discovery of multivariable dependencies

-————————————————-

Shimizu, K. - Privacy preserving similarity search in biomedical data by homomorphic encryption

-————————————————-

Stombaugh, J.I. - Power Decoder A simulator for the evaluation of pooled shRNA screen performance

-————————————————-

Tan, J. - Learning high-level biological principles from Pseudomonas aeruginosa using denoising autoencoders

-————————————————-

Veeraraghavan, N. - Staging the largest genomic cloud computeAnd living to tell about it

-————————————————-

Warren, A.S. - Bacterial spaghettiPan-genome graphs for the web

-————————————————-

Williams, J. - Unleash your inner data scientistEnabling scalable data driven collaborations with iPlant Cyberinfrastructure

-————————————————-

Wu, T. - Designing genomic data structures for fast computation

Tony Cox ?@coxtonyj 41m41 minutes ago

Thomas Wu - use “discriminating character array” instead of LCP as part of enhanced suffix array #biodata14

Tony Cox ?@coxtonyj 1h1 hour ago

Thomas Wu - use “discriminating character array” instead of LCP as part of enhanced suffix array #biodata14

Zamin Iqbal ?@ZaminIqbal 7m7 minutes ago

@coxtonyj ANy paper on this?

Tony Cox ?@coxtonyj 4m4 minutes ago

@ZaminIqbal Did not seem like it, but will ask him if I get the chance

Morgan Taschuk ?@morgantaschuk 58m58 minutes ago

TW: Speeding GMAP/GSNAP with genomic data representation: added compression, longer k-mers, vertical columns, suffix array #biodata14

-————————————————-

Yates, A.D. - The Ensembl REST API”Gone gamma”

‹»A Number of Genetic Tools Developed to Study Choanoflagellates « »Review paper - Data compression for sequencing data«›