Happy Talk
BelfastJS, 10/12/14
Peter Gasston
@stopsatgreen
broken-links.com
We are born to talk. Typing is a barrier to communication. Moviemakers know this.
Amazon : ‘Alexa’
Apple : Siri
Google : Voice Search
Microsoft : Cortana
Google clear winner according to https://www.stonetemple.com/great-knowledge-box-showdown/
55% of teens
41% of adults
use voice search every day*
*maybe
From Google research, but sources not provided. Could be ‘of teens who use, 55% use every day’. http://googleblog.blogspot.co.uk/2014/10/omg-mobile-voice-survey-reveals-teens.html
10% of Baidu
search queries are by voice
That’s ~500m per day
Character input is hard, plus high rural illiteracy. http://blogs.wsj.com/digits/2014/11/21/baidus-andrew-ng-on-deep-learning-and-innovation-in-silicon-valley/
http://iaminchina.wordpress.com/2010/04/13/crowded-street-in-xian/
Synthesis
Long history of replicating voice with sound (Brazen Heads back to ~12th C.) but first systems emerged in 1960s. Bell Labs 1961 sang Daisy Bell, coincidentally Arthur C. Clarke was visiting. Today Stephen Hawking uses system with old voice as it’s ‘his’.
Chrome/Safari
var txt = 'Hello world',
say = new SpeechSynthesisUtterance(txt);
window.speechSynthesis.speak(say);
Play
SSU Attributes
var txt = 'Hello world',
say = new SpeechSynthesisUtterance(txt);
say.lang = 'en-GB';
say.pitch = 0.75;
say.rate = 1.5;
say.volume = 0.5;
window.speechSynthesis.speak(say);
Play
SpeechSynthesis Methods
var txt = 'Hello world',
say = new SpeechSynthesisUtterance(txt);
window.speechSynthesis.speak(say);
window.speechSynthesis.pause(say);
window.speechSynthesis.resume(say);
window.speechSynthesis.cancel(say);
Play (Safari)
SpeechSynthesis Attributes
var txt = 'Hello world',
say = new SpeechSynthesisUtterance(txt),
speak = window.speechSynthesis.speak(say);
if (speak.pending) {}
if (speak.speaking) {}
if (speak.paused) {}
SSU Events
var txt = 'Hello world',
say = new SpeechSynthesisUtterance(txt);
say.onstart = function () {};
say.onpause = function () {};
say.onresume = function () {};
say.oncancel = function () {};
say.onerror = function () {};
say.onend = function () {};
window.speechSynthesis.speak(say);
Play (Safari)
Synthesis As A Service
http://developer.att.com/apis/speech
https://ws.neospeech.com/
https://www.cereproc.com/en/products/cloud
http://www.ivona.com/en/for-business/speech-cloud/
Neospeech
https://tts.neospeech.com/rest_1_1.php?method=ConvertSimple&email=mail@example.com&accountId=abcd1234&loginKey=LoginKey&loginPassword=123abc45de&voice=TTS_PAUL_DB&outputFormat=FORMAT_WAV&sampleRate=16&text=Hello+Belfast+JS
<response conversionNumber="28" resultCode="0" resultString="success" status="Queued" statusCode="1"/>
https://tts.neospeech.com/rest_1_1.php?method=GetConversionStatus&email=mail@example.com&accountId=abcd1234&conversionNumber=28
<response statusCode="1" downloadUrl="https://tts.neospeech.com/audio/a.php/23841309/d44caf624653/result_26.wav" resultCode="0" resultString="success" status="Queued"/>
SSML
<speak version="1.0" etc>
<p>
<s>Hello Belfast.</s>
<s>This is <prosody rate="-20%">SSML</prosody>.</s> </s>
</p>
</speak>
Recognition
Developed by Bell in 1952. Could recognise numbers spoken by one person. [Get screengrab / find picture]. 1970s Carnegie Mellon HARPY could recognise 1,000 words. 1980s Hidden Markov method [Teddy Ruxpin]. Chops waves into phonemes and attempts to form words.
Challenges
Accents
Multiple users
Multiple languages
Scottish + Siri
Web Speech API
var recog = new SpeechRecognition();
x-browser
var speechRecognition = (
window.SpeechRecognition ||
window.webkitSpeechRecognition
);
var recog = new speechRecognition();
SpeechRecognition Methods
var recog = new SpeechRecognition();
recog.start();
recog.stop();
recog.abort();
SpeechRecognition Events
var recog = new SpeechRecognition();
recog.onresult = function () {};
recog.onnomatch = function () {};
recog.onerror = function () {};
SpeechRecognitionError interface for reporting errors.
MVS
var recog = new SpeechRecognition();
recog.onresult = function (result) {
output.textContent = results[0][0].transcript;
};
btn.onclick = recog.start();
SpeechRecognitionEvent, results list
SpeechRecognition Events
start
audiostart
soundstart
speechstart
speechend
soundend
audioend
end
Interim Results
var recog = new SpeechRecognition();
recog.interimResults = true;
recog.onresult = function (result) {
var thisResult = result.results[0],
transcript = thisResult[0].transcript;
if (thisResult.isFinal) {
finalOutput.textContent = transcript;
} else {
interimOutput.textContent = transcript;
}
};
btn.onclick = recog.start();
Continuous
var recog = new SpeechRecognition();
recog.continuous = true;
recog.onresult = function (result) {
output.textContent = result.results[0][0].transcript;
};
btn.onclick = function () {
if (listening) {
recog.stop();
} else {
recog.start();
}
}
SpeechRTC +
Web Speech API
Node online, Web Workers offline. https://wiki.mozilla.org/SpeechRTC_-_Speech_enabling_the_open_web
(Picard demo)
Basically just matching a string / regex
JuliusJS
var recog = new Julius();
recog.onrecognition = function (result) {
console.log(result);
}
Wit.ai : Node API
Web Speech API + text
Direct speech : GuM, Web Audio API
http://blog.groupbuddies.com/posts/39-tutorial-html-audio-capture-streaming-to-node-js-no-browser-extensions
Wit.ai : Microphone.js
WebRTC. Opinionated. Gives you a handful of methods & events, no fine control. http://localhost/~petergasston/prototypes/mucking-about/wit/