|
Figure 1. The structure of the SpTrf gene loci. (A) A representative map of the SpTrf loci. Locus 1 has allelic regions with unequal numbers of genes. Although Clusters 3 and 4 in Locus 2 appear identical, the different sizes of the flanking STR islands indicate that these two clusters are allelic. The colored polygons indicate the SpTrf genes located in the clusters with the pointed end of the polygon indicating the transcription direction. GA STRs (green triangles) and GAT STRs (black triangles) flank each gene and large GA STR islands flank Clusters 3 and 4 in Locus 2. The black horizontal line indicates the DNA extending from the 5ʹ and 3ʹ ends of the clusters. The colored angle-arrows in Locus 2 indicate the regions amplified by PCR and correspond to the colored bars over lanes in the DNA gels shown in (B, C). (B) Clusters 3 and 4 in Locus 2 are the same size. The BAC templates for PCR are indicated above the lanes as M18 (BAC 10M18; Cluster 3) or P4 (BAC 3104P4; Cluster 4). Amplicons of Clusters 3 (BAC 10M18) and 4 (BAC 3104P4) indicate identical size. (C) Clusters 3 and 4 in Locus 2 have varying sizes of large GA STR islands. PCR was carried out for the P4 and M18 BAC clones to amplify the GA STR islands. M indicates the all-purpose Hi-Lo DNA marker (BioNexus), and sizes of the relevant bands are indicated to the left in (B, C).
|
|
Figure 2. The IGRs of Clusters 1 and 2 in Locus 1 include non-matching sequences. (A, B) Dot plots show the comparison between Cluster 1 and Cluster 2. The dot plot in (A) employed a preset e-value threshold of 10 whereas the dot plot in (B) employed an e-value threshold of 10,000. The central diagonal in the dot plots indicate the mostly identical sequence of the allelic regions, while lines offset from the central diagonal indicate repeats that are highly similar in either a tandem (green) or inverted (red) orientation. Highlighted, colored vertical bars in (A) indicate the locations of mismatched sequences between the two clusters. The blue and red bars show the locations of sequences in Cluster 2 that are absent from Cluster 1 and the green box indicates a region of complete dissimilarity. The arrow between the red bar and green box in (A) indicates a region of similarity that is located between the two regions of dissimilarity. The black boxes in (A, B) are expanded in (C) to show details. YASS8 was used to generate dot plots with standard parameters (scoring matrix = +5, -4, -3 -4: composition bias correction: gap costs = -16, -4: X-drop threshold = 30). (C) The IGRs located between the A2/a and B8/a genes in Clusters 1 and 2 are a mixture of similar and dissimilar sequences. The red and orange polygons indicate the SpTrf genes, A2/a and B8/a genes, that flank these IGRs. GA STRs (green triangles) surround each gene. The horizontal black line indicates the DNA that extends from the 5ʹ and 3ʹ ends of each gene. The lengths of the IGRs between the A2/a and B8/a genes are indicated by upper and lower brackets. The sizes of the areas within the IGRs are indicated by colors that are coordinated when similar. The red and white striped region is a sequence that is only present in Cluster 2 and corresponds to the red bar in (A). The yellow region is a short area of similarity, and the area of complete dissimilarity is shown as a polygon of a red and purple gradient. This figure in not drawn to scale.
|
|
Figure 3. All SpTrf genes are in frame, have identifiable TATA, Inr sites, one or more stop codons, and most can be aligned with the previously established repeat-based and cDNA alignments. (A) A representative map of the genes shows the 5ʹ UTR, exons, intron, and 3ʹ UTR. The 5ʹ and 3ʹ UTRs are indicated by white rectangles, the two exons are indicated as striped rectangles, and the intron is indicated by a solid black line. The range in lengths of the 5ʹ UTR among genes is indicated. The four colored boxes in 5ʹ UTR indicate putative 5ʹ regulatory elements and their locations + or – of the conserved +1 A of the start transcription site (red). The TATA box (yellow), the Inr (blue), and the ATG translation start (green) are indicated. The 3ʹ UTR is variable in length among genes and is indicated by colored brackets showing the four possible locations of stop codons, which are labeled in lowercase ‘a’-‘d’. (B) The cDNA alignment of genes from the four clusters. The manual alignment was done in BioEdit by adding the genes in the clusters to a pre-aligned set of cDNAs and genes according to previous publications (12, 19). All possible elements are numbered at the top and the four possible stop codons are indicated in element 26. The leader (L), the intron (Int), elements (colored rectangles), and gaps (horizontal lines) are indicated for each gene. Intron type and subtype of element 15 are labeled within each respective rectangle. (C) The repeat-based alignment of the genes from the three clusters. The manual alignment was done as in (B) according to Buckley and Smith (19). All possible elements are numbered at the top and the four stop codons are indicated in element 27. The leader, intron, intron type, elements, subtype of element 10, and gaps are indicated as in (B). The six types of repeats in the gene sequence are indicated by rectangles of identical color at the bottom.
|
|
Figure 4. Sequence similarities among the SpTrf genes and their putative evolutionary relationships are revealed by similar structures of maximum likelihood trees. Alignments were performed with PRANK, and Phylogenetic analysis was completed in MEGA7. Phylogenetic trees were generated using three approaches: neighbor joining, maximum parsimony (see
Supplementary Figures S1–S4
), and maximum likelihood (shown), all of which resulted in similar tree structure. Colored boxes shown in the legend indicate the cluster in which the gene is located. Bootstrap values from 500 iterations are indicated for each tree. (A) The intron tree. The intron types (indicated by α-ϵ labels for separate clades) for each gene was identified using a previously published alignment of introns (19) with the introns from HeTrf genes defined as the outgroup. (B) The exon 2 tree. Exon 2 from each gene was aligned and the exon 2 sequences from HeTrf genes were defined as the outgroup. (C) The 5ʹFR tree. The 5ʹFR for each gene was selected using GenePallete and corresponded to 400 nt upstream of the start codon. The 5ʹFR of the LvTrf gene was used as the outgroup. (D) The 3ʹFR tree. The 3ʹFR for each gene was selected using GenePallete and corresponded to 400 nt downstream of the stop codon. The 3ʹFR of the LvTrf gene was used as the outgroup. (E) The concatenated 5ʹ-3ʹFR tree. The 5ʹFRs and 3ʹFRs used in (C, D) were aligned and then concatenated prior to phylogenetic analysis. The concatenated 5ʹ-3ʹ FR of the LvTrf gene was used as the outgroup.
|
|
Figure 5. Relatedness among the genes can be inferred from percent mismatch scores for the 5ʹFR, exon 1, the intron, exon 2, and the 3ʹFR. Pairwise comparisons among all genes are shown for the FRs, exons, and the intron. The X axis in the graphs indicates the calculated percent mismatch score for each pair of genes and the Y axis indicates the region in the gene. Solid lines, dashed lines, and dotted lines are used to identify the pairs of genes compared. The color of the line corresponds to the color of the genes shown in
Figure 1A
with the exception of the D1 genes, which are all shown as green lines. Below each graph is a table that gives the percent mismatch scores for each region graphed above. (A) The A2 genes vs. other SpTrf genes. (B) The average percent mismatch of D1 genes vs. other SpTrf genes. (C) The B8 genes vs. other SpTrf genes. (D) The C4 genes vs. other SpTrf genes. (E) The E2 genes vs. other SpTrf genes. (F) The 01 genes vs. other SpTrf genes. Percent mismatch [pairwise distance/Ln2] was calculated from the pairwise distance matrix scores generated with MEGA7 using the PRANK codon alignment.
|
|
Figure 6. Representative dot plots of Cluster 3 compared to other SpTrf clusters indicates that the GA STRs are the likely edges of the segmental duplications in both loci. (A) A portion of Cluster 1 shows the region with the segmental duplications and the D1 genes. The previously reported D1 segmental duplications are indicated at the top of the figure [red dotted brackets (30)] and the revision to the proposed segmental duplications are indicated on the bottom (black brackets). The arrow indicates the directional shift of the proposed edges of the segmental duplications. (B–D) Representative images of a gene cluster or portion of a gene cluster are located to the left and top of each dot plot. Colored polygons indicate genes and transcriptional direction. Green triangles represent GA STRs and black triangles represent GAT STRs. The central diagonal in (A) shows the main alignment of cluster 3 against itself, while lines that are offset from the central diagonal in all dot plots indicate the locations of repeats or highly similar regions. Diagonal dark green lines indicate similar regions in the same orientation whereas, dark red solid lines indicate regions of inverse orientation. The highlighted horizontal and vertical lines of multiple colors (matching to the genes at the top or side) are added to the dot plots to illustrate the location of matched sequences. Dark green areas indicate the locations of GA STRs and dark gray areas indicate the locations of the GAT STRs. (B) Cluster 3 vs. Cluster 3. (C) Cluster 3 vs. a subset of genes in Cluster 1. (D) Cluster 3 vs. a subset of genes in Cluster 2. YASS8 was used to generate dot plots with standard parameters (scoring matrix = +5, -4, -3 -4: composition bias correction: gap costs = -16, -4: e-value threshold = 10: X-drop threshold = 30).
|
|
Figure 7. The percent mismatch between regions of the SpTrf genes suggests that the segmental duplications include the B8 and C4 genes with the D1 genes. Alignments of the B8/a::D1y/d-IGRs, the D1 IGRs, the D1b/e::E2/a-IGRs, and the C4/a::D1h/f-IGRs, plus the alignment of the 5ʹFR, exon 1, the intron, exon 2, and the 3ʹFR for all SpTrf genes were done with PRANK. (A) The pair-wise percent mismatch scores for IGRs indicate the level of sequence similarity. Percent mismatches were calculated from pairwise diversity scores in MEGA7 and are indicated with the color gradient legend. There are no mismatch scores between 30-40%. (B) A graphical representation shows levels of sequence similarities among genes and IGRs based on percent mismatch scores against the D1f/h genes. All genes are oriented in the same direction as indicated by the pointed polygon labeled with the gene name. From left to right across the figure are blocks that represent the IGRs, GA and/or GAT STRs, the 5ʹFR, the exon 1, the intron (narrow region), the exon 2 with the gene name, and the 3ʹFR. The thin dotted lines indicate how the sequences are linked together in their respective clusters and do not indicate sequence. The double bars in some IGRs indicate sequence that was not analyzed and is not shown. Percent mismatches for all blocks are color coded based on the gradient key. (C) The percent mismatch values for the regions of all SpTrf genes compared to the D1f/h indicates regions of similarity and dissimilarity. Results are color coded according to the gradient key.
|
|
Figure 8. Comparisons the IGRs between the A2, E2, 01, and D1f/h genes identify short regions of similarity. (A) The D1f/h::GA-IGRs and E2/a::E2b/01-IGRs are compared to each other, and both are compared to the 5ʹ and 3ʹ ends of the A2/a genes (indicated by red brackets). (B) The D1f/h::GA-IGRs, D1d/e::E2/a-IGRs, and the E2/a::E2b/01-IGRs are compared. Genes are indicated by the polygon labeled with the gene name and are colored according to
Figure 1A
and are flanked by UTRs (open boxes). The genomic DNA is indicated by horizontal black lines that passes behind the genes and includes the IGRs and flanking regions. Genes without color were not included in the analysis and are shown for orientation and comparison to
Figure 1A
. GA (green triangles) and GAT (black triangles) STRs are indicated. The colored boxes above and below the black horizontal line indicate regions of similarity as identified from dot plots from the YASS genomic similarity search tool set to a threshold of e-20. Areas of shared sequence among IGRs are numbered for clarity; see text for detailed description. Dotted lines connect the regions of similarity between IGRs including regions in the same (black lines) and inverted (red lines) orientation. This figure is drawn to scale. * indicates regions of similarity among all three alignments in (A).
|
|
Figure 9. A model for the theoretical evolutionary history of the SpTrf gene clusters in the sequenced genome based on gene duplications, ectopic insertions, and deletions. Each step in this theoretical evolutionary history of the gene clusters is indicated on the right with numbers and labeled on the left (A–G). Genes and their direction are indicated by colored polygons and labeled with the gene name. The prime (ʹ) associated with a gene name indicates a hypothetical LCA version of the gene. Gene polygons without color indicate genes that are proposed to exist but whose element pattern cannot be determined. Variations in or changes to gene colors indicate internal point mutations/insertions/deletions during the lineage of a particular gene. GA (green triangles) and GAT (black triangles) STRs are shown. The horizontal gray line indicates the IGRs that flank the genes. Open boxes surrounding the genes in (A, B) indicate edges of proposed duplication regions. Curved green arrows indicate duplications and ectopic insertions, dotted black arrows indicated the transition from an ancestral SpTrf gene to specific SpTrf gene lineages, straight black arrows indicate duplications of genes over time. The shaded box shows the ectopically inserted region in Locus 1. Brackets connected with double ended arrows indicate a recent duplication event. Large black Xs indicate gene deletions. This figure is not drawn to scale..
|