The first human genome sequence was published in 2001, yet as of now, eight years later, major questions remain, such as how many genes are encoded by the genome, and of those genes, how many functional products are encoded due to phenomena like alternative splicing. The Encyclopedia of DNA Elements (ENCODE) project has been coordinated by National Human Genome Research Institute (NHGRI) to answer these questions by comprehensively classifying functional elements on the human genome. The pilot phase of the project studied 1% of the genome in detail, revealing extensive transcription well beyond that predicted by classical gene models. The biological function of a significant portion of the discovered transcripts is unclear. The ENCODE project is now scaling up to examine the whole human genome. It is likely that results will echo the pilot project, revealing extensive transcription, a significant fraction of which has unexplained function. Proteomic technologies can be applied, in a process called proteogenomic mapping, to determine which of the myriad transcripts encode proteins. This approach has been used to reveal new genes, new alternative splice variants, new start sites, and upstream open reading frames (ORFs). While substantive progress has been made in developing proteogenomic mapping technologies, a significant hurdle in using proteogenomics to assist with the ENCODE project is the lack of proteomic data sets that are coordinated with the ENCODE transcription mapping efforts. Here we propose to generate large-scale proteomic data sets directly from the same tier I ENCODE cell lines studied by the transcription efforts, coordinating the results with the transcription mapping efforts to determine which of the pervasive transcripts are translated. Our specific aims are to: 1) produce large scale proteomic data sets on ENCODE cell lines using the most advanced mass spectrometry methods, 2) use our database technologies to store, manage, and make accessible to the community all results of the project, and 3) use our software pipeline to map the results to the latest human genome drafts, producing a UCSC (University of California Santa Cruz) genome browser track with the results. We believe the result will be a significant advancement in knowledge about our genomes and the functional products they encode. PUBLIC HEALTH RELEVANCE: The human genome is the blueprint for human life and human health, but we do not yet understand its language - the language of genes. The ENCODE project is deciphering that language systematically, and the goal of this proposal is to accelerate that effort by revealing which parts of the blueprint contain instructions to build proteins.