DESCRIPTION: The modENCODE project is a key sequel to the sequencing of the fly and worm genomes, and will have an enormous impact on our understanding of biological processes in all higher eukaryotes, including human. In order to manage the diverse, large-scale datasets that will be produced by modENCODE, we propose to create a data coordinating center (DCC) to track the data, integrate it with other information sources, and make it available to the research community in a timely and open fashion. This proposal brings together four groups with highly relevant backgrounds: The Micklem group, through its work on the InterMine system and FlyMine database, has extensive experience in integrating diverse types of data into high-performance data mining systems. The Stein and Lewis groups bring to the project an intimate familiarity with the C. elegans and D. melanogaster genomes, their reagents and research communities, and are well-positioned by their work with the WormBase and FlyBase databases to liaise with those MODs. The Kent group is responsible for the DCC for the Human ENCODE pilot project, and has extensive practical knowledge of developing and managing projects of this sort. We will assemble a team of three data managers stationed at CSHL and at Berkeley, who have a background in the bioinformatics of C. elegans and/or D. melanogaster. The managers will liaise with their contacts at the data provider sites to determine data file formats, milestones and quality control procedures for their datasets. They will also liaise with representatives from NCBI to coordinate modENCODE activities with the primary data repositories at GenBank and GEO. Data providers will upload their data sets to a staging server where they will be able to preview their data on an instance of the GBrowse genome browser. The data managers will QC the data before approving its transfer to the production database. Data will be integrated in the production database using InterMine, and from there released to the public on a monthly schedule. Researchers will be able to access the data via the GBrowse genome browser, bulk downloads, and via complex queries and reports mediated by InterMine and the BioMart data warehousing system. All major software systems used by the proposed DCC will be based on open source tools from the Generic Model Organism Database (GMOD), human ENCODE, and other sources. Throughout the project, Lewis and Stein will work close with FlyBase and/or WormBase to ensure that data collected by modENCODE becomes an integral part of the relevant model organism database. In addition we will dedicate a significant part of a data manager's effort to transfer data from modENCODE into the MODs during the last year of the project.