Abstract Speech carries rich linguistic information over a large range of temporal scales: the average durations of phonemes, syllables, words and sentences range from tens of milliseconds to multiple seconds, respectively. Thus, to achieve successful speech perception, the acoustic speech signal needs to be analyzed over appropriate temporal scales to interface with their respective linguistic representations. Where and how this acousto-linguistic mapping of temporal speech properties occurs is still not fully explained in current speech/language models. Here, we show how cortical processing of acoustic temporal structure in speech is modulated by higher-level linguistic analysis. This requires two essential features: (1) control over the temporal scale at which analysis occurs; (2) control over the linguistic content of the information. For (1), we use a novel sound-quilting algorithm that controls the temporal structure in speech at different temporal scales by shuffling and then stitching together speech segments of a certain length; this approach yields new ?speech quilt? signals that preserve the natural temporal structure in the original source signal only up to the set segment length, but not beyond. The segment lengths (30, 120, 480, and 960 ms) are chosen to span the typical temporal range of phonemes, syllables, and words. For (2), we manipulate speech familiarity by using recordings of bi-lingual speakers, reading from a book in English and Korean, as the source signal to create speech quilts in two languages. This approach ensures that any changes at the signal acoustics level affect both languages identically, while manipulating the linguistic percept differently. Thus, neural responses that vary as a function of segment length but are shared or similar across the two languages will suggest analysis at the signal-acoustics level, whereas neural responses that differ based on language familiarity will imply the presence of linguistic processes. In Aim 1, we argue (using fMRI) that temporal acoustic structure in speech is extracted in superior temporal sulcus (STS) for both languages; linguistic processes, originating in inferior frontal gyrus (IFG), become engaged in a familiar language only and in turn modulate such signal-acoustics level analyses in anterior and posterior STS. In Aim 2, we capitalize on the high temporal resolution of EEG to suggest that one potential neural mechanism for the results in Aim 1 is that neurons are able to phase-lock more to the speech quilt signal as its natural temporal structure increases (longer segment lengths), which in turn is again modulated and enhanced by speech familiarity. The results will have a significant impact on speech/language models that need to account for where and how specific temporal scales in speech interface with their linguistic representations, while also informing approaches towards clinical populations such as children struggling to decode critical temporal speech units, as in dyslexia or auditory processing disorder (APD). !