The article explores the architecture and training processes of the RNN-T model in ASR.
In this blog post, we will briefly discuss the popular transducer based model- RNN-T in speech recognition. It does not discuss everything from one particular paper but represents my understanding from multiple sources and Lugosch, 2020’s blog on the same topic. Lately, these models have shown popular interest in the industry due to its natural streaming ability.
The RNN-T model comprises three modules: an encoder, a decoder, and a joiner network. Notably, the predictor in RNN-T operates in an autoregressive manner- meaning the previous output generated by the joiner is fed back into the predictor in order to predict the next output.
The joiner network combines the outputs of the encoder and predictor, generating a probability distribution over all labels and a null output (
During the inference phase, RNN-T employs a greedy search algorithm. At each time step
Starting with
Training RNN-T involves horizontal transitions for blank labels and vertical transitions for actual labels. The forward variable
The above equation represents the computation of the forward variable
A noteworthy advancement is introduced in the paper titled ‘‘Pruned RNN-T for fast, memory-efficient ASR training’’ (link). This involves a stateless prediction network, an alternative to the traditional RNN decoder. The predictor, similar to a bi-gram language model, simplifies the architecture by relying solely on the last output symbol, eliminating the need for recurrent layers. Its sole purpose lies in assisting the model to output an actual label or a blank label.
The Transducer in RNN-T defines a set of possible monotonic alignments between the input sequence
(This example along with the equation has been taken from Lugosch, 2020)
We can calculate the probability of one of these alignments by multiplying together the values of each edge along the path:
As we conclude the post, you now possess at least a basic understanding of RNN-T models and the functional principle behind it in the context of Automatic Speech Recognition (ASR). The way encoder, decoder, and joiner networks work together, along with features like stateless predictions, makes RNN-T easier to train.