
Our model takes audio spectrograms as inputs and predicts a sequence of characters.ĭuring training, we give the decoder the target character sequence shifted to the leftĪs input. ffn_dropout ( ffn_out )) return ffn_out_norm ffn ( enc_out_norm ) ffn_out_norm = self. enc_dropout ( enc_out ) + target_norm ) ffn_out = self. enc_att ( target_norm, enc_out ) enc_out_norm = self. self_dropout ( target_att )) enc_out = self. self_att ( target, target, attention_mask = causal_mask ) target_norm = self. causal_attention_mask ( batch_size, seq_len, seq_len, tf. shape ( target ) batch_size = input_shape seq_len = input_shape causal_mask = self. tile ( mask, mult ) def call ( self, enc_out, target ): input_shape = tf.

range ( n_src ) m = i >= j - n_src + n_dest mask = tf. 1's in the lower triangle, counting from the lower right corner. This prevents flow of information from future tokens to current token. Sequential ( ) def causal_attention_mask ( self, batch_size, n_dest, n_src, dtype ): """Masks the upper half of the dot product matrix in self attention. MultiHeadAttention ( num_heads = num_heads, key_dim = embed_dim ) self. LayerNormalization ( epsilon = 1e-6 ) self. Layer ): def _init_ ( self, embed_dim, num_heads, feed_forward_dim, dropout_rate = 0.1 ): super (). conv3 ( x )Ĭlass TransformerDecoder ( layers. Embedding ( input_dim = maxlen, output_dim = num_hid ) def call ( self, x ): x = self. Conv1D ( num_hid, 11, strides = 2, padding = "same", activation = "relu" ) self. Layer ): def _init_ ( self, num_hid = 64, maxlen = 100 ): super (). pos_emb ( positions ) return x + positions class SpeechFeatureEmbedding ( layers. range ( start = 0, limit = maxlen, delta = 1 ) positions = self. Embedding ( input_dim = maxlen, output_dim = num_hid ) def call ( self, x ): maxlen = tf. Very Deep Self-Attention Networks for End-to-End Speech RecognitionĬlass TokenEmbedding ( layers.


Our model will be similar to the original Transformer (both encoder and decoder)Īs proposed in the paper, "Attention is All You Need". Automatic Speech Recognition with Transformerĭescription: Training a sequence-to-sequence Transformer for automatic speech recognition.Īutomatic speech recognition (ASR) consists of transcribing audio speech segments into text.ĪSR can be treated as a sequence-to-sequence problem, where theĪudio can be represented as a sequence of feature vectorsĪnd the text as a sequence of characters, words, or subword tokens.įor this demonstration, we will use the LJSpeech dataset from theĪudio clips of a single speaker reading passages from 7 non-fiction books.
