SP Modules Review Contents (2)-木盒主机

本文目录

Module 5 TTS front-end

We want to generate speech that is

Intelligible: you can clearly perceive what words are being said
Natural: sounds like human speech
Appropriate: conveys the right meaning in a specific context
Front-end: Analyze text, generate a linguistic specification of what to actually generate
- Front-end purpose: derive a linguistic specification from text that includes the necessary information to generate speech
Back-end: Waveform generation from the linguistic specification

Linguistic specification guides what we generate

Phones
Syllables
Words
Phrases
Utterances
Discourses
Pronunciation Dictionaries: Use pre-existing pronunciation dictionaries to map words to phonetic transcriptions
- CMUDict: 1 big text file of words and their pronunciations
  - CMUDict is dialect specific, like Scotland accent.
- Unilex is an ‘accent-independent’ lexicon based on the Unisyn database
  - Classifies phones by keywords e.g. ‘Foot’ vs ‘Strut’ are keywords
    - ‘Put’ → FOOT class
    - ‘Putt’ → STRUT class
  - Use this to describe phonemic variation in English dialects/accents
  - A single lexicon to encode different accents: run lexicon through accent specific rules to produce accent specific lexica
Phoneset choice
- Unilex is more generalizable than CMUDict
- Unilex more compact: 1 base lexicon + rules
  - But we need to define rules to convert from one accent to another. This leads us to revisit the concept of phoneme

We use decision tree when we want learn rules from data.

Module 6 Waveform generation

Diphone database Requirements

Clean, clear recordings of a single speaker
Recordings of every possible diphone in the language
Phone segmentation (timings) to calculate where diphones start and end

The most common use of lexical stress marking is for determining which syllable in a word a pitch accent will be placed on if that word is made prosodically prominent.

The Tone and Break Indices (ToBI) model of prosody basically aims to capture prosodic prominence (pitch accents), boundary tones, and the extend of prosodic breaks (break indices). It doesn’t try to capture pragmatic or affective content of speech such as speech acts or emotions.

Spectral smoothing, as the name suggests, will help spectral discontinuity by making the change in the spectrum more smooth across a join.

未经允许不得转载：木盒主机 » SP Modules Review Contents (2)

SP Modules Review Contents (2)

Module 5 TTS front-end

Module 6 Waveform generation

相关推荐

热门推荐

DMIT 美国/香港/日本 CN2 GIA

搬瓦工限量版CN2 GIA整理

随便看看

热门标签

分类