The Computational Genomics Unit uses bioinformatic techniques to understand the molecular innovations that drove the surge of diversity in early animal evolution. Through the use of phylogenetic and comparative genomic approaches, the Unit studies developmental proteins that play a fundamental role in the specification of body plan, pattern formation, and cell fate determination during metazoan development. The Unit also leads international efforts focused on the sequencing and analysis of the genomes of selected invertebrate species in an effort to better understand the relationship between genomic and morphological complexity, as well as the molecular basis for the evolution of novel cell types. Current genome sequencing efforts concentrate on species that can serve as models for the study of key questions related to regeneration, allorecognition, and stem cell biology, work that will be significantly advanced by the availability of high-quality whole-genome sequencing data from these organisms. The focus of our work during the past fiscal year has been on the sequencing and annotation of of the cnidarian Hydractinia. Cnidarians such as Hydractinia have been extensively studied by researchers with an interest in fundamental biological processes such as whole-body regeneration, stem cell biology, aging, early development, body plan evolution, immunity, symbiosis, and allorecognition. What makes these simple systems particularly attractive for study is our observation that the genomes of cnidarians encode more homologs to human disease genes than do classic invertebrate models such as Drosophila and C. elegans (Maxwell et al., 2014), an observation that strongly positions cnidarians as powerful model systems for the study of biological phenomena such as pluripotency and lineage commitment. In addition, Hydractinia is well-suited as an emerging model organism because it possesses interstitial cells (called 'i-cells') that are pluripotent. There is a single i-cell population that maintains this pluripotency throughout its life cycle, and these cells are easily identified by their distinctive morphology. Finally, Hydractinia is colonial and possesses an allorecognition system, a feature of interest from the standpoint of understanding host-graft rejection. Given their experimental potential, having sequence data in-hand will allow for the use of comparative genomic approaches to examine orthology relationships between genes implicated in the maintenance of pluripotency, stem cell function, and regeneration, potentially leading to functional studies aimed at determining factors conserved between animal models for regeneration and humans. We are leading an international consortium that is currently generating whole-genome and transcriptomic data from two Hydractinia species. For H. echinata, we are sequencing a female wild type strain. For H. symbiolongicarpus, we are sequencing two genomes using both PacBio and Dovetail sequencing technologies: one, a male wild type, and the second, a female inbred strain currently being used by several groups in the context of allorecognition studies. Assemblies are continually being generated on these data as new sequencing data becomes available. In parallel, total RNA was extracted from multiple H. echinata developmental stages and polyp types. RNA-seq libraries were subsequently constructed, indexed, and pooled; they were then sequenced on multiple lanes of a HiSeq 2500 using version 4 chemistry to produce approximately 40 million paired-end reads. These RNA-seq transcript fragments will be mapped to the genome assembly using GSNAP and used for transcript annotation. A gene model prediction pipeline has been developed, to identify candidate gene models consistent with available sequence data once sequencing and assembly is completed. As part of our active collaboration with Matt Nicotra's group at the University of Pittsburgh focused on questions related to allorecognition, for which Hydractinia is a long-standing model system, we are sequencing and annotating a reference haplotype of the allorecognition complex (ARC). Currently, we have identified >4 Mb of ARC sequence, including scaffolds containing Alr1 and Alr2, two genes known to be responsible for colony fusion. We have also identified 16 additional Alr1-like genes, but no Alr2-like genes have been found. Going forward, we will use the full gene content of the ARC to search for potential synteny with other animal genomes. Traversing highly repetitive regions is one of the most challenging aspect of any whole-genome sequencing effort. Large tandem repeats, such as ribosomal genes, segmental duplications, and telomeric repeats are found within the short arms of acrocentric chromosomes, and these sequences are usually missing from genome reference sequences. Interestingly, these repeat regions play a key role in cellular processes such as genome replication, cell proliferation, and the maintenance of genome integrity, so a thorough characterization of these regions is important to our understanding of cellular function. Here, we are characterizing the rDNA repeats found within the H. echinata and H. symbiolongicarpus genomes, making use of a novel methodology developed by a member of this group (Sofia Barreira) that was successfully used to extend the sequence content adjacent to the last human ribosomal gene (rDNA) cluster. Given that the overall repeat and AT-content of Hydractinia is quite high, the application and refinement of these strategies, coupled with newly available long-read sequencing data generated in the course of the Hydractinia genome project, will allow for a much more comprehensive characterization of the centromeric, telomeric, and rDNA regions found within these de novo sequence assemblies. This new methodology also has the potential to be more widely applicable, in the context of other whole-genome sequencing efforts. From an algorithmic standpoint, we have developed a new methodology to identify groups of putative orthologs (i.e., orthogroups). The algorithms in current use fall into two distinct categories: tree-based and graph-based clustering methods. Tree-based methods may ultimately become the preferred solution when high quality genomic assemblies are available for most organisms. However, for the time being, graph-based methods are faster, less sensitive to missing data, and do not require a high quality species tree (which can be difficult to acquire). Of the different applications of graph theory that have been used for orthogroup classification, Markov clustering (MCL) has emerged as the dominant approach. Briefly, the MCL algorithm takes an all-by-all similarity graph of the sequences under study and then iterates over a series of simple matrix operations that assign sequences to clusters of highest similarity. In the current study, we have increased the overall resolving power of de novo MCL-based orthogroup assignment with a number of novel enhancements. These enhancements include refinement of the pairwise similarity metric, using a supervised heuristic to dynamically select MCL parameters, recursively subdividing orthogroups, and testing putative orthogroups for best-hit cliques to maximize resolution. We have chosen the name Recursive Dynamic Markov clustering (RD-MCL) for the method and the associated open-source software project to highlight incorporation of these refinements. RD-MCL has been validated on simulated data and on manually curated protein families, including 1000 sequences from the family of metazoan gap junction forming proteins called pannexins. We observe substantial improvement over current graph based methods (OrthoMCL and OMA) and have built a command-line user interface that is easy to install and operate.