Using this criterion, we identified 33 rarely used codons and found that the relative occurrence decreased for 26 out of these 33 rare codons when the old and new codon usage tables were compared. Six of the other seven rare codons were used at the same frequency and one, UAU, was used slightly more frequently. Conversely, the relative occurrence of 16 out of the 28 more commonly used codons increased and seven others stayed the same. In the remaining five commonly used codons, the frequency went down but the frequency of the most CP-358774 common codon of that codon family increased in each case. This reduction in the use of common, but second choice codons would be consistent with expected changes resulting from the removal of noncoding regions from the annotation. In the remaining case, the frequency of both glutamate codons decreased indicating that glutamate codons were over-represented in the regions that are no longer considered coding regions. Thus these results are consistent with the idea that the two methods of identifying protein-coding genes allowed us to remove non-coding regions that had atypical codon usage patterns. In this study, we demonstrated that a combination of a manual inspection with an automated evaluation of the C. crescentus genome annotation using MICheck resulted in the identification of more than 200 errors in the existing annotation. Each evaluation method found annotation errors that were not identified by the other method. Therefore, it appears that our manual approach checks for patterns based on third position GC content that are not assessed by MICheck. However, MICheck was able to identify annotation errors that our manual approach should have detected but they escaped the attention of our human analysis. This problem with the manual analysis could be corrected by automating our manual pattern recognition approach. The program would first calculate the third position GC content for each of the six possible reading frames excluding regions with low overall GC content, and then, compare the positions of the regions of high third position GC content to the positions of the annotated coding regions and generate a file of regions where a one to one correspondence was absent. If no annotated coding regions were detected opposite a high third position GC peak, the open reading frames in the region would be examined for an appropriate match. If a matching ORF was identified, the corresponding amino acid sequence would be compared to the NCBI database using BLAST, and the presence of significant matches in the database would verify that the ORF coded for a protein. Similarly if no high third position GC peak was present for a particular annotated coding region and the flanking genes did have high third position GC peaks, the corresponding amino acid sequence would be compared to the NCBI database using BLAST, and the absence of significant matches in the database would suggest that the ORF was unlikely to code for a protein.
Downstream from the beginning of the annotated the corresponding amino acid sequence also would be compared to the NCBI database using BLAST
Leave a reply