Main Article Content
This work focuses on monosyllabic speech recognition, where the ultimate goal is to accurately recognize a set of predefined words from short audio clips. It uses a data set of speech commands that consist of 64,000 one-second utterances of 30 short words, from which we learn to classify 10 words, as well as classes for "unknown" words, and also "Silence". We use a convolutional neural network (CNN) with one-dimensional convolusions on the raw audio signal to classify the samples. The results show that the model can predict samples of words it saw during training with high accuracy, but it somewhat struggles with generalizing to words that are beyond the training data, and extremely noisy samples.
This work is licensed under a Creative Commons Attribution 4.0 International License.