Text-to-speech produces output that has relatively flat affect, that is, it's mostly free of emotion. TTS engines account somewhat for end of sentences by changing the inflection and pause of the last word before a period or question mark. There are slight pauses for other punctuation in a sentence, but for the most part TTS doesn't do much to interpret sentences.
I'd love to be able to separate text from the presentation of the text, in the same way cascading style sheets allow web designers to separate written text from the presentation of the text. I'd like CSS for TTS. I'd like to be able to control the speed, pitch, intensity and stress of TTS text by tagging the text, and then writing style sheets that recognize the tags and control the TTS output accordingly. It would be a big step towards being able to define the persona of an agent implemented fully in TTS.
This should be perfectly feasible. I'm surprised it hasn't been done. If anyone with a technical background wants to work on this and needs some direction drop me a line.
Subscribe to:
Post Comments (Atom)
2 comments:
The W3C has a TTS CSS, called SSML. The specification is here: http://www.w3.org/TR/speech-synthesis/
Displaying one's ignorance in public is usually a bad thing, but I displayed ignorance of TTS CSS and someone helpfully sent me a link about SSML. Thanks, Anonymous, for letting me know about that.
Post a Comment