The Computational Genomics Unit uses bioinformatic techniques to understand the molecular innovations that drove the surge of diversity in early animal evolution. Current genome sequencing efforts concentrate on species that can serve as models for the study of key questions related to regeneration, allorecognition, and stem cell biology, work that will be significantly advanced by the availability of high-quality whole-genome sequencing data from these organisms. From a broader perspective, this work is intended to lead to the establishment of new model organisms with the potential to inform important questions in human biology and human health. Given their great potential as new animal models for human disease, we are actively sequencing and annotating the genomes of two cnidarian species: Hydractinia echinata, which is used in the context of studying regeneration and stem cell biology, and Hydractinia symbiolongicarpus, which is used in the context of allorecognition studies. What makes Hydractinia particularly well-suited as an 'emerging model organism' lies in the face that they possess a specific type of interstitial cell (or 'i-cell') that is pluripotent and provides the basis for tissue regeneration, expressing genes whose bilaterian homologs are known to be involved in stem cell biology. Hydractinia is also colonial, possessing an allorecognition system that may provide insights into important questions related to host-graft rejection. Given their experimental potential, having high-quality, long-range sequence data in-hand will allow for the use of comparative genomic approaches to examine orthology relationships between genes implicated in the maintenance of pluripotency, stem cell function, and regeneration, potentially leading to functional studies aimed at determining factors conserved between animal models for regeneration and humans. Currently, preliminary sequencing data indicate an estimated genome size of 774 Mb for H. echinata (84x coverage) and 514 Mb for H. symbiolongicarpus (94x coverage), with these genomes being both AT-rich (>65%) and highly repetitive (at least 47%). The vast majority of a set of evolutionarily conserved single-copy orthologs can be easily identified in these preliminary assemblies, and these large-scale whole-genome sequencing data are already providing a strong foundation for current genomic and functional studies that have the potential to identify new targets for therapies in regenerative medicine. As part of our active collaboration with Matt Nicotra's group at the University of Pittsburgh focused on questions related to allorecognition, for which Hydractinia is a long-standing model system, we are sequencing and annotating a reference haplotype of the allorecognition complex (ARC). To date, we have a nearly complete sequence of this genomic region controlling Hydractinia allorecognition, a repeat-rich region encompassing 11 Mb and containing at least 30 Alr-like sequences; this region also includes Alr1 and Alr2, two previously identified genes known to be responsible for colony fusion or rejection. We hypothesize that several of these newly identified Alr-like sequences may serve as additional allorecognition genes in Hydractinia, revealing a previously unappreciated genomic complexity underlying allorecognition. We have also analyzed the histone complement found within Hydractinia. Alongside core and other replication-independent histone variants, we found several histone replication-dependent variants, including a rare replication-dependent H3.3, a female germ cell-specific H2A.X, and an unusual set of five H2B variants, four of which are make germ-cell specific. Interestingly, protamines are completely absent from Hydractinia, with these H2B variants being used instead for DNA compaction within its sperm, having confirmed the presence of canonical nucleosome core particles. These studies provide additional insight into the evolution of spermatogenesis and, more importantly, provide a framework for future studies on the role of histones (and the post-translational modification of these basic proteins) in cnidarian epigenetics. Using the reference transcriptome we have generated for H. echinata as a reference, we have begun performing RNAseq experiments to identify specific genes involved in the regenerative process. This set of experiments involve decapitation of Hydractinia polyps, then harvesting tissue samples at key time points during head regeneration for RNA isolation and analysis. RNAseq data was subsequently used to identify differentially expressed transcripts during the regeneration process. We are currently in the process of clustering and annotating these transcripts, with the goal of identifying specific pathways and genes involved in regeneration in Hydractinia. Traversing highly repetitive regions is one of the most challenging aspect of any whole-genome sequencing effort. Large tandem repeats, such as ribosomal genes, segmental duplications, and telomeric repeats are found within the short arms of acrocentric chromosomes, and these sequences are usually missing from genome reference sequences. Interestingly, these repeat regions play a key role in cellular processes such as genome replication, cell proliferation, and the maintenance of genome integrity, so a thorough characterization of these regions is important to our understanding of cellular function. Here, we are characterizing the rDNA repeats found within the two Hydractinia genomes, making use of a novel methodology developed by a member of this group that was successfully used to extend the sequence content adjacent to the last human ribosomal gene (rDNA) cluster. Given that the overall repeat and AT-content of Hydractinia is quite high, the application and refinement of these strategies, coupled with newly available long-read sequencing data generated in the course of the Hydractinia genome project, will allow for a much more comprehensive characterization of the centromeric, telomeric, and rDNA regions found within these de novo sequence assemblies. This new methodology also has the potential to be more widely applicable, in the context of other whole-genome sequencing efforts. From an algorithmic standpoint, we have developed a new methodology to identify groups of putative orthologs (i.e., orthogroups), called RD-MCL (for Recursive Dynamic Markov Clustering). In essence, RD-MCL is an extension of conventional Markov clustering-based orthogroup prediction algorithms like OrthoMCL, with three key differences: (1) The similarity metric used to describe the relatedness of sequences is based on multiple sequence alignments, not pair-wise sequence alignments or BLAST. This significantly improves the quality of the information available to the clustering algorithm. (2) The appropriate granularity of the Markov clustering algorithm, as is controlled by the 'inflation factor' and 'edge similarity threshold', is determined on the fly. This is in contrast to almost all other methods, where default parameters are selected at the outset and imposed indiscriminately on all datasets. (3) Differences in evolutionary rates among orthologous groups of sequences are accounted for by recursive rounds of clustering. This methodology, implemented as an open-source Python project, is currently being tested on data from a number of manually curated protein families for final refinement of the methodology prior to publication. To date, we observe substantial improvement over current widely used graph-based methods.