Maximal matching, gap statistics, complete link and the reconstruction of ancient flowering plant genomes.
Speaker: David Sankoff – Ontario, CanadaTopic(s): Applied Computing
Abstract
Starting with the phylogeny, or family tree, of a plant order, and the linear ordering of the 20,000 - 50,000 genes on the chromosomes of some typical species in this order, we wish to infer gene content and gene ordering for each chromosome in their common ancestral genome as well as in the ancestor of each portion of the phylogeny.
We first identify all the genes in each ancestral genome with a member of a restricted set of ancestral homologs, by way of a surjective orthology mapping, and rewrite each genome in terms of its images under this mapping. We identify a large number of gene adjacencies from all chromosomes in the set of input genomes. Since each gene has two distinct ends, the 5Õ end and the 3Õ end, each adjacency consists of two ends. So that the three genes A, B, C ordered 5ÕA-3ÕA, 3ÕB-5ÕB, 5ÕC-3ÕC, for example, would generate two adjacencies 3ÕA 3ÕB and 5ÕB 5ÕC. while the 5ÕA and 3ÕC would be linked to ends of other genes.
Because of many genome rearrangement mutations during evolution , there are an order of magnitude more adjacencies than genes. To pick out a plausible subset of the adjacencies, we use a Maximum Matching algorithm, where the vertices are the adjacencies and edges join any two vertices that contain the 5Õ end and the 3Õ end of the same gene, respectively. This outputs inferred linear ancestral contigs, each containing up to several hundred genes.
To group the contigs into clusters reflecting ancient chromosomes, we match each one against the chromosomes of the given genomes and count the number of times any two contigs match the same chromosome. The resulting co-occurrence matrix, smoothed by a correlation analysis of pairs of contigs, is then submitted to a complete-link clustering analysis to collect the contigs, and hence the gene content, appropriate to each hypothetical ancestral chromosome. Uncertainty about the number of clusters, i.e., the number of clusters is resolved by adapting gap statistics to determine when adding more clusters leads to a significant improvement and when it simply reflects overfitting.
Once contig content of each chromosome is posited, the data on relative order of each pair of contigs on a chromosome is submitted to a Linear Ordering Problem routine to locate them along the chromosome.
Applying this methodology to the genomes of seventy species, independently in eleven major plant orders in the Rosid and Asterid groups of flowering plants, reveals that the ancestors of all of them had highly similar genomes in terms of the basic chromosome complement.
About this Lecture
Number of Slides: 25Duration: 45 - 50 minutes
Languages Available: English, French
Last Updated:
Request this Lecture
To request this particular lecture, please complete this online form.
Request a Tour
To request a tour with this speaker, please complete this online form.
All requests will be sent to ACM headquarters for review.