
Welcome to the regular update from the Internet Research & Future Services team in BBC R&D, making new things on, for and with the internet.
End-to-End Speech to Text Research
Within the data team we have been working on a system for speech-to-text. This system is built on top of an open source project called Kaldi. We have had success with this system, as it has allowed us to generate accurate transcripts for BBC output such as news. However, there are drawbacks to the current system, so within the team we have been investigating how to make a speech-to-text system based on machine learning. As we think this approach will offer these potential benefits:
- Less of an barrier to entry in training acoustic models, as there is no need for phonetic dictionary and no need for time aligned training data.
- Ability to support different languages, as the training data needed is less complex.
- Quicker and less resource intensive than our current speech to text tool.
- A smaller codebase which will be more manageable than the codebase based on Kaldi.
We have been using the paper Towards End-to-End Speech Recognition with Recurrent Neural Networks as an architectural guideline. Using Tensorflow we have been able to train a neural network which translates from audio data into English text.
In the simplest terms you can view our end-to-end speech to text project as having three stages. One preprocessing audio to extract features, two training a neural network model to recognize speech from these features. The final step is to decode the output of the neural network model. I am going to delve a little bit deeper into explaining the final step of the process.
Decoding
Our neural network has learned to map sounds to English characters. The output of the neural network is an acoustic model which has the probability of each character (as well as the probability that it is blank, blank is a special character to deal with consecutive repeated characters and non-speech events e.g. pauses) for each time step. You can think of decoding as trying to find the most probable output transcription given the input acoustic model. A simple decoding method is to look at the predictions at each time step and pick the one element with the highest probability (this is called a Greedy Decoder). Feed that to the next step and repeat. An example of the output we got using this method is below:
ACTUAL: IT WASN'T A GIVEAWAYGREEDY DECODER: IT WASN'T AGIVE AWIY
A Beam Search Decoder provides more accurate results. At each time step it will select multiple characters it can be, think of this as branches. Now have multiple branches that you could continue predicting from at every step. You can then compute the total probability of all the characters that you have generated so far and keep the most likely candidate sequences at every time step.
As you can see from the output the model doesn't really have an understanding of English dictionary words as "AWIY" is not a word. The acoustic model did not have enough training data to learn complicated spelling and grammatical structure. To overcome this we can add a language model to the decoding step. The decoding step now has two inputs the acoustic model and the language model. Adding a language model means that at each time step characters that can still become dictionary words are favored.
So returning to our example after adding the language model the output is:
ACTUAL: IT WASN'T A GIVEAWAY BEAM SEARCH DECODER + LANGUAGE MODEL: IT WASN'T A GIVE AWAY
We hope to open source our End-to-End Speech to Text project on the BBC's Github account in the next coming months.
And now some updates from the rest of the section.
Atomised News
Anthony, Barbara, Frankie, Sean and Thomas worked on adapting the Newsbeat Explains format for the General Election Q&A page. Changes to the code and the UI were made to make it compliant with requirements and robust enough to handle substantial traffic spikes leading to the Elections.
Tristan continued to ramp up the “Reinventing news articles” project, doing some interviews with industry experts, writing a taxonomy of news formats and attending a data journalism conference.
Natural Language Processing
One of our aspirations in the Natural Language Processing project is to extract higher level information from news articles. One way to do this is through Quote Identification and Attribution. Chris Newell has been looking at some sophisticated techniques on the subject, as reported by the Institute for Language, Cognition and Computation (ILCC) at the University of Edinburgh and Priberam, one of the BBC’s partners in the SUMMA project. Chris has started building a classifier to identify cue verbs (e.g. said, claimed, etc). A statistical approach, trained on examples, is reported to give better results than the more obvious approach, using a dictionary of terms. For training we’re using the Penn Attribution Relations Corpus (PARC), kindly provided by Silvia Pareti via the ILCC.
Speech to Text
Ben has evaluated his prototype for near-realtime speech to text, he has now started working on chunking audio into one minute segments and processing the chunks with BBC Kaldi to compare the results against the near-realtime prototype.
Chrissy added the TED-LIUM dataset to the end-to-end speech-to-text project and has started to train a model on the dataset, this will be good for comparisons with other open source projects such as Mozilla DeepSpeech who publish their results here.
Face Recognition
Ben, Jana and Chrissy worked on a demo using face recognition to detect different party leaders on TV. Jana had to train models for the current party leaders. They then used COMMA to run a face recognition algorithm across a selection of news programmes and created a front end to display the results.
Jo and Andrew have been preparing for user testing the Children’s prototypes during half term week, writing test scripts and recruiting users. Henry and Andrew rebuilt the Go Jetters interactive story from a new script written by Jo and Andrew in collaboration with Mark from the Children’s team.Tom and Anthony have been continuing work on the VUI Story Engine, refining features of the story builder and deploying development and testing branches of the engine on Cosmos.
Content Analysis Pipeline
Olivier and Tim spent this sprint doing maintenance work on the Content Analysis Pipeline. They spent a lot of time tweaking various AWS tools: SWF, Cloudwatch, Beanstalk, EC2 and autoscaling in order to fix a few bugs.
Other
Chris has been updating our ViSTA-TV live dashboard application to run on the BBC's AWS Cosmos infrastructure, ahead of this year's Glastonbury Festival. Chris has also started work on a React example for our Peaks.js interactive audio waveform library. Tim spent an enjoyable two days at Advances in Data Science conference presenting the science of pop work and listening to lots of interesting talks on all aspects of data science. Oliver attended the OpenTech conference which he found to be great, challenging and inspiring.
Lastly, we would like to say goodbye to Craig and Kristian who have finished their trainee projects in IRFS and wish them all the best for their trainee projects in Salford.
Interesting Links on end-to-end speech to text