17/02/2016 - 12:00 - Auditori PRBB

When Bigger is not Better ... The unfortunate tale of enormously large multiple sequence alignments

Scientific sessions, CRG Group Leader Seminars

Cédric Notredame

Comparative Bioinformatics Group, Bioinformatics and Genomics Programme, CRG


Cédric Notredame is a senior principal investigator at the Centre for Genomic Regulation (CRG) in Barcelona that he joined in 2007. A molecular biologist by training, he graduated from Toulouse University and completed his PhD in Bioinformatics in 1998, at the EMBL-EBI (Cambridge, UK). After a Post Doc at the NIMR-MRC in the lab of Willie Taylor, Cedric became a CNRS staff scientist (Marseille) and a group leader at the Swiss Institute of Bioinformatics (Lausanne).

His main interests lie in the development of novel algorithms for the multiple comparison of evolutionarily related sequences and subsequent phylogenetic or structural modelling. He is the leading author of T-Coffee, a popular multiple sequence alignment package able to combine structural information and sequences. Cedric is also well known for having co-authored with Jean-Michel Claverie the book  “Bioinformatics for Dummies” that sold 40.000 copies worldwide. Over the last 20 years, Cedric has published more than 80 publications in peer review journals that have received over 11,000 citations. Overall Cedric knows of nearly 100 other useless metrics that could be used to describe his work.


Very large multiple sequence alignments (MSA) are gradually becoming essential components of genomics analysis. In theory, these alignments could shed an essential light on very ancient evolutionary processes. Unfortunately, the computation of high quality MSAs is a complex process, hampered by many computational and biological difficulties. As a consequence, it relies on approximate progressive algorithms limited by the problem of computational complexity. Until recently, it was thought that this process would benefit from the extra information contained in larger datasets and eventually deliver models increasingly accurate with the number of incorporated sequences.

I will show here that this is not the case and that the current algorithmic framework for MSA – even in its latest adaptations - is by nature unsuitable for both the computation of large MSAs and the estimate of the corresponding phylogenetic trees. I will conclude by proposing some alternative solutions to this complex and essential biological problem. The results presented here have been obtained through extensive computation. I will take the opportunity of this presentation to introduce Nexflow, our in-house tool for the deployment of complex computation across heterogeneous computational infra-structures. All the tools presented here are open source available from