Conclusion
Various deep learning architectures were explored on a Speech Emotion Recognition (SER) task. Experiments conducted illuminate how feed-forward and recurrent neural network architectures and their variants could be employed for paralinguistic speech recognition, particularly emotion recognition. Convolutional Neural Networks (ConvNets) demonstrated better discriminative performance compared to other architectures. As a result of our exploration, the proposed SER system which relies on minimal speech processing and end-to-end deep learning, in a framebased formulation, yields state-of-the-art results on the IEMOCAP database for speaker-independent SER. Future work can be pursued in several directions. The proposed SER system can be integrated with automatic speech recognition, employing joint knowledge of the linguistic and paralinguistic components of speech to achieve a unified model for speech processing. More generally, observations made in this work as a result of exploring various architectures could be beneficial for devising further architectural innovations in deep learning that can exploit advantages of current models and address their limitations.