The specific aim of this proposal is to annotate all the evidence-based gene features at high accuracy on the human genome reference sequence. This includes identifying all the protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence available in the public nucleotide database (NCBI/EMBL/DDBJ) and pseudogenes. To achieve this goal we will integrate computational approaches, including recent comparative methods, expert manual annotation, able to integrate literature information, and targeted experimental approaches. Based on the exhaustive experimental and computation investigation of our initial GENCODE annotation of the ENCODE regions we are confident that we can deliver a gene set with high specificity and sensitivity that will provide critical information to other biologists and other ENCODE groups. As part of this process we will label all apparent gene loci clearly, classifying them according to their likely current functional status, so users are informed where regions that appear gene like are most likely pseudogenes or where transcript evidence is most likely artefactual. There are a number of motivated groups working in the area of defining protein coding genes for the human genome. This proposal includes most such groups and coordinates with other key groups. Critically, all the groups bring extensive experience of data integration and evaluation, leading to the resolution of annotation discrepancies by multiple approaches. This gives us confidence that through this integrated project we will be able to eliminate many of the remaining uncertainties about the precise location of genes and their component exons and transcript structure in the human genome. Genome-wide, highly accurate transcript definition will be of enormous value to the myriad of researchers working on the human genome. It will both have large cost savings worldwide due to increased specificity of reagent design and provide a more complete view of human genes, in particular those associated with disease. From this foundation, more accurate descriptions of the genetic causes of disease can be discovered.