Opinions and sentiments are essential to human activities
and have a wide variety of applications. As many decision makers
turn to social media due to large volume of opinion data available,
efficient and accurate sentiment analysis is necessary to extract those
data. Hence, text sentiment analysis has recently become a popular
field and has attracted many researchers. However, extracting
sentiments from audio speech remains a challenge. This project
explored the possibility of applying supervised Machine Learning in
recognizing sentiments in English utterances on a sentence level. In
addition, the project also aimed to examine the effect of combining
acoustic and linguistic features on classification accuracy. Six audio
tracks were randomly selected to be training data from 40 YouTube
videos (monologue) with strong presence of sentiments. Speakers
expressed sentiments towards products, films, or political events.
These sentiments were manually labelled as negative and positive
based on independent judgement of 3 experimenters. A wide range of
acoustic and linguistic features were then analyzed and extracted
using sound editing and text mining tools respectively. A novel
approach was proposed, which used a simplified sentiment score to
integrate linguistic features and estimate sentiment valence. This
approach improved negation analysis and hence increased overall
accuracy. Results showed that when both linguistic and acoustic
features were used, accuracy of sentiment recognition improved
significantly, and that excellent prediction was achieved when the
four classifiers were trained respectively, namely kNN, SVM, Neural
Network, and Naïve Bayes. Possible sources of error and inherent
challenges of audio sentiment analysis were discussed to provide
potential directions for future research.