We propose developing an algorithm and user-friendly software to better identify treatments using Medicare claims data. We will validate our approach using procedures listed in the Surveillance, Epidemiology, and End Results (SEER) database as a gold standard. In this way, we hope to better match procedures identified using Medicare claims data with SEER listed procedures. The focus of this research is observational (i.e. non-randomized) data. Well-run randomized clinical trials can provide the best level of evidence of treatment effects. However, randomized trials in the United States have suffered from poor accrual for many interventions. Despite the fact that well-designed randomized clinical trials should be the gold standard, well-designed observational studies might be the only method of obtaining inferences concerning comparative effectiveness for some cancer interventions. In cancer research, one of the most commonly used databases for observational research is the linked SEER-Medicare database. SEER-Medicare data has provided useful measurements of the effectiveness of a number of cancer therapies. Algorithms for identifying relevant treatment and diagnosis codes using Medicare data are often based on clinical reasoning and scientific evidence. One group of researchers, for example, developed an algorithm for identifying laparoscopic surgery among kidney cancer cases before claims codes for laparoscopic surgery were well developed. While such algorithms are useful for others pursuing similar investigations, there may still be substantial mismatch between treatment identified by the SEER cancer registry and treatment identified through Medicare claims. In this work, we propose developing a rigorous machine learning algorithm that can help researchers in better identifying treatments in Medicare claims data. Specifically, we will design a neural language modeling algorithm and implement a software system that finds vector representations of diagnosis and procedure codes. We plan on using the neural language modeling algorithm to learn vector representations from SEER- Medicare claims data where related procedure and diagnosis codes are neighbors (i.e. closely related). We will investigate whether the codes we identify within neighborhoods correspond to the procedure codes used for published SEER-Medicare studies. We will then design a software assistant interface that will allow an investigator to explore which codes are related to a given seed of diagnosis or procedure codes. Finally, we will investigate the sensitivity and specificity of the algorithm by comparing procedures identified using Medicare claims with procedures listed in the SEER database. We will replicate analyses from a published SEER-Medicare paper to investigate if estimated treatment effects differ when using our novel algorithm compared to using the algorithm in the published paper.