Sentiment Analysis of US Airline Twitter Data
Project on processing documents of short text with the objective of assigning sentiment class; negative, neutral and positive. The data stems from more than 14,000 tweets directed to 6 US Airlines from a 7-day period. They have manually been annotated, and the data can be downloaded here. Distribution of tweets looks as follows:
Various pre-processing techniques were used: normalisation; reduce same words in different forms to common representations, feature reduction; syntactic and semantic indifferent words removed (e.g. html tags), lemmatization; reduce inflective form of words. Multiple corpus representation were used: unigram; one word equals one feature, bigram; two consecutive words equals one feature, unigram+bigram; combination of two previous representations, word-embedding; learned distribution of words, where semantically related words are close to one-another in an n-dimensional space. The representation is a simple term frequency. Term-frequency inverse-document-frequency was tested as well, but showed poor results.
Algorithms used were; Logistic regression, Support Vector Machine, Naive Bayes and Neural Networks of different forms – RNN sequential including LSTM and 1d convolutional. Results of the first 3 baseline models with various representation are reported below.
The features here are limited to 10,000. The neural networks had insufficient training data to allow convergence in a desirable way. This is evident when plotting the training – and validation loss. Multiple techniques exists to avoid this behaviour, and we were later able to push the accuracy above 84% by using the ULMFiT approach, which requires minimal pre-processing and preparation.
The program for running these algorithms can be accessed below.