Genome sequencing projects have provided a foundation for a new biology centered around the molecular representation of genes and proteins as sequences and structures in computers. The parallel development of genome science, bioinformatics, the Internet and desktop "supercomputers" has helped bring this revolution to academic, industry and government labs worldwide. Unfortunately, the adverse impact of database errors on experimental science is easily demonstrated. The broad, long-term objective of this proposal is to continue to improve the reliability of the genome and proteome sequences of the model organism Escherichia coli K-12, leading to a Gold Standard Reference Strain for prokaryotic organisms, especially Gram-negative pathogens. This proposal focuses on improving the accuracy of the E. coli genome and proteome. The specific aims are: (1) to ensure the continued maintenance, improvement and expansion of EcoGene, a primary data repository for the continually revised E. coli genome and proteome sequences, and their annotations; EcoGene also serves as the systematic ORF nomenclature registry for E. coli K-12. EcoGene is part of an annotation-sharing collaboration among the Coli Genetic Stock Center at Yale, the Colibri database at the Pasteur Institute, and SWISS-PROT; (2) to establish two Indexer positions for expert electronic and legacy journal surveillance, to ensure that newly published and pre-released functional data about E. coli genes is entered promptly and accurately into EcoGene, then released to the public, and partner databases; (3) to augment electronic data collection with bioinformatics analysis to (a) discover new evolutionary relationships, thus improving functional predictions and (b) detect-and-report internal and external database errors, including DNA and protein sequence errors, often detected and resolved during analysis-anomaly-refinement-reanalysis (AARR) cycles; and (4) to use laboratory studies to (a) resolve remaining DNA frameshift errors in the E. coli K-12 genome by re-sequencing, (b)verify ambiguous protein starts, and (c) verify the secreted (periplasmic and outer membrane) proteome. The accurate annotation of the E. coli genome is necessary in its own right as the most well-understood cellular organism, and to provide the foundation for the analysis of bacterial genomes whose characterizations will be crucial for the development of biological and chemical defense against bacterial bioterrorism. [unreadable] [unreadable]