Colbourne et al. Science 331:555, 2011
Journal club by Antonio Marco
The paper in a sentence
The crustacean Daphnia pulex has more than 30,000 genes, an expanded set mostly originated by multiple tandem duplications, although the maintenance of these duplicates is associated with functional diversification of paralogs and the co-expansion of genes from the same metabolic pathway.
Background:
Our knowledge on genes and genomes mostly comes from species of little relevance in real ecosystems. Colbourne et al. provide the genome sequence of Daphnia pulex, a keystone species in freshwater ecosystems. Moreover, this is the first crustacean genome to be fully sequenced, so it is paramount to understand the origin and evolution of arthropod genes.
What the paper says:
The authors present the genome sequence of a Daphnia pulex strain, called TCO (“the chosen one”) with very low variability (nucleotide heterozygosity of approx. 0.14%). After sequencing and assembling they covered the 80% of the nuclear genome with 8.7-fold coverage. They predicted 30,907 genes, which is a large number for an arthropod. They had additional evidence for 26,867 of these genes (EST, proteomic analyses, conservation…), meaning that the initial set was reliable. Although Daphnia have a number of introns comparable to other invertebrates (excluding Drosophila), it suffered a reduction in the size of introns, as well as in the size of the intergenic regions. This streamlining of the genome is in sharp contrast with the high increment in the number of protein-coding genes.
The evolutionary analyses of the protein coding genes reveal that more than one third of them have no homologous counterparts in any other sequenced genome (Fig. 1A in the paper). This Daphnia-specific genes are often member of large multigene families (Fig. 1B), suggesting a Daphnia-specific gene expansion produced by high duplication rates. Figure 1C confirms this hypothesis. The number of sysnonymous substitutions among all pairs of duplicated genes in Daphnia reveals that no whole genome duplication may have been involved in the process, and most likely the expansion is produced by constant rates of gene birth-and death (Fig. 1D). Since they observed Gene Conversion events in the hemoglobin genes (Fig. 2 in the paper) they corrected for that in their analyses, and the results remained unaltered. Gene Conversion (GC) is a non-reciprocal DNA recombination. When GC occurs extensively among multiple genes of the same family (non-allelic GC), there is an homogenization of their sequences (that is, all they get very similar). Thus, GC is an important source of error in time divergence estimates.
Expression analyses reveal that paralogs diversify their expression pattern soon after they emerged (Fig. 3A). Actually, newly duplicated genes already have (on average) an expression level twice as big as their paralog (Fig. 3B). The expansion of gene families is also associated to metabolic processes. In fact, they found 19 pancrustacean (insects and crustaceans) gene families whose expansion is overrepresented in certain metabolic subnetworks (Fig. 4).
What about the Daphnia-specific genes? It looks like these genes are expressed more under ecological conditions (that is, exposed to biotic and abiotic stressors such as Kairomone or Cadmium), whereas conserved genes are more expressed in lab/standard conditions (Fig. 5A). However, a closer inspection to those genomic regions expressed under ecological conditions (using tilling arrays) showed that most transcripts come from intergenic regions (Fig. 5B). That means that more genes may remain to be discovered, and the authors proposed more analyses using ecological conditions to explore the functional part of the genome.
Putting all this together, the authors proposed a model for gene evolution, that they call PBE (Preservation by Entrainment), described in the paper in Figure 6. Under this model, two paralogs can have incompatible expression patterns after duplication so one of them gets lost. On the other hand, two duplicates can maintain the same expression pattern and, if the increase in dose is beneficial for the species, they both are retained. A third possibility is that one of the paralogs changes their expression pattern, but this one interacts with a new partner and this new association is beneficial for the host. Both genes, again, are retained.
What we said about the paper:
This work involves several groups that systematically explored the many aspects of Daphnia genome. It was clearly a tour de force not limited to obtaining the genomic sequence but that additionally explored the transcriptomic changes under different conditions and/or treatments. Thus, we would expect a paper of this kind to be published in a high profile journal such as Science. However, most of our comments focused on the interpretation of the results more than in the validity of them.
The first observation that there are a lot of Daphnia-specific genes is biased by the fact that this is the first crustacean genome to be sequenced. The sequencing of other crustaceans (outside the Daphnia genre) will undoubtedly decrease this proportion. It is true, however, that many of these genes have no known homolog in insects, and that may still indicates an expansion within the crustacean lineage.
One of the most controversial topics is the effect of Gene Conversion in the analyses. Lynch and Conery (2000) showed that the gene content of genomes is a consequence of non-adaptive high rate of gene turnover. However, Teshima and Innan (2004) explored the effects of gene conversion in those studies estimating the age of gene duplications. Consequently, the authors of the present paper corrected for Gene Conversion. However, Gene Conversion is often detected using conservative approaches to prevent false positives, and that may produce that many true positives were not filtered out of the final datasets. We did not get deeper into the discussion because we did not know exactly how the Gene Conversion was calculated.
There was some discussion about the fact that all-against-all pairwise comparisons are used to calculate the age distribution of genes (Fig. 1D). Large recently expanded gene families may significantly contribute to reduce the average gene divergence, biasing the results.
A last comment involves the interpretation that retained genes are enriched for certain metabolic pathways, meaning that genes co-duplicate in the same metabolic sub-network. However, this pattern is observed for less than 20 gene families, involved in 7 sub-networks. Whether these results can be generalized needs to be further explored.