On big data in biology

by Nicolas Descostes

In recent years, big data is a term that invaded the media and that the public has been exposed to. From finance to social networks, data are collected to infer trends and sometimes to manipulate opinions as it has been observed during recent elections. However, the public is less aware of the big data revolution that is occurring in biology. In this post, I would like to begin by explaining how big data is used in biology, and more specifically in genomics, and end by sharing some thoughts on how big data is currently shaping research.

In the early 2000’s, a battle opposed J. Craig Venter and the International Human Genome Sequencing Consortium to publish the first sequencing data of the human genome. The race produced two articles published in Science and Nature but more importantly, opened an intellectual revolution giving new possibilities to explore the 3 billion base-pair DNA sequence of the human genome in its entirety. Even if sequencing was used a long time before, sequencing the human genome opened the door to sequencing the genome of many other species.

Rapidly, scientists realised that they will have to face a major problem of data storage. Right off the sequencer, the human genome data set consists of roughly 200 gigabytes. Very quickly, researchers reached the terabyte scale. More than having the initial DNA sequence of a particular genome, the main work of researchers is to make sense of it to understand how our genome is organised. Analysing data exponentially increased the storage needs. In 2017, the EBI (European Bioinformatics Institute) was holding 120 petabytes of data.

Afterwards, sequencing technologies flourished, enabling researchers to explore different aspects of genomes. They found that genomes are organised in active and inactive compartments, that different proteins bind to different locations or that genomes are folded in 3 dimensions, and that all of this can regulate gene expression or lead to diseases. Even if a lot of these observations were made locally in the 20th century, researchers can now study these phenomena at the entire genome scale. Facing the challenge of analysing so many data, scientists organised consortia to coordinate research efforts. The ENCODE project aims to build a comprehensive parts list of functional elements in the human genome; the FANTOM project explores gene expression in different species and the BLUEPRINT project explores the epigenome. These projects involve thousands of scientists from all over the world who generate tremendous amounts of data and make them available to the public and to the research community.

We are currently collecting millions of data about organisms and the question is: ‘what’s the aim of it’? Thousands of articles were already published and repositories keep increasing in size. Currently, the Gene Expression Omnibus database contains almost three million samples. Some solutions such as ChIP Atlas aim at summarising all of this information. Nevertheless, the heterogeneity and complexity of the data dooms scientists to cope with barely readable graphics; if even trying to make sense of all of them. Should we rather see big data in biology as a very large and unconnected ensemble? Knowing that biologists seek to understand life in a cross-species manner, deduce general mechanisms ruling molecular and biological systems, the answer is probably no. Another vision is to consider research as the antechamber of Medicine. The more information we have, the more likely it is that we will be able to develop therapeutic solutions. The problem of this vision is that it eliminates the essence of science. Improving human life is indeed an important goal, but history tells us that understanding phenomena without an obvious direct (commercial) application is as important. From space discovery to quantum physics, knowledge has been crucial, not only for us to understand the world that we live in but also to push beyond our intellectual concepts and creativity.

Society is currently swimming in reductionism. Where the human mind cannot handle such amount of information, scientists work really hard at developing computing solutions such as machine learning, to define simpler rules that could explain a whole. This enterprise is not new, in ancient Greece philosophers started elaborating a theory of everything. As it could make sense to try unifying different theories in physics, I think that undertaking such a goal in biology is misleading. Firstly, because there are no theories per se in biology and secondly, because we will have to accept the limitations of the human brain. I am convinced that simplification and story telling required by the publication system is not serving our purpose. We have to embrace the fact that the observational aspect of biology imposes a contextual thinking. We tend to forget that we are looking at these data in light of *reference* genomes. With the advances of the 1000 genome project, our understanding of biology will probably become more subtle, organism and tissue dependent, requiring even more efforts to develop intelligible bioinformatics. However, as technology will become more and more sophisticated, we should not forget the limits of our minds and think twice before opening new scientific horizons.

Leave a Reply

Your email address will not be published. Required fields are marked *