Master Thesis: Opinion Mining in an Industrial Context

In this post I will go through some of the things I did in my master thesis project in the field of Natural Language Processing. Unfortunately, much of it is confidential due to the collaboration with my thesis company, as they of course has obligations to keep their customers’ data safe and away from the public.

The company I worked with, possess a lot of unstructured textual data from many various sources. These sources include Trustpilot, Facebook and some internal data platforms. They want to improve their use of data, which could support them in providing their customers enhanced service and improved customer satisfaction. By applying different language processing methods they hope to gain insight that previously was not recognised.

In specifics, my thesis partner and I, narrowed down the scope to include state-of-the-art topic modelling – and sentiment analysis. Topic modelling would help the company understand themes and subjects pertinent to their customer, and the sentiment analysis would help the company to understand their customers’ satisfaction in the different areas. These two areas would work individually, but they become more powerful when combined.

All the data at hand was in Danish, thus we trained a Danish lemmatizer and POS-tagger with spaCy to handle the pre-processing of the text. Multiple sets were produced, in order to allow flexibility in the data representation as well as to be able to test multiple sets. This step is really important to many NLP algorithms, as the models intuitively can’t tell if a word like det is equally important to ubehagelig. The steps of the pre-processing were; lower-casing, tokenisation, removal of punctuations and special characters, shared representation for numbers, exclusion of stop-words, lemmatisation and finally POS-tagging for nouns. Numericalisation was postponed to prior model fitting, to allow combination of vocabularies.

Topic Modelling
The goal of topic modelling is to let algorithms reveal the structures of the data, such that similar words or words that often occur together a clustered together in a topic. A topic then consist of a number of top words that defines the topic. Depending on the algorithm, the words will then either have a probability or weight of belonging to a specific topic. This can on a document level be aggregated to assign a single topic, or to show contribution from multiple topics.

It is notoriously difficult to let a model know, when it has produced a good topic. Normally, this is measured by a coherence score, where we used the coherence measure proposed by Mimno to define a topic’s coherence. Weight was also put on a qualitative assessment of the topics’ top words, because if we do not understand them, what good are they?
Especially topic modelling on short text, like what we mostly see on Facebook, is difficult as the algorithms rely on patterns and inferable words of which there are few in a reply on Facebook compared to a news article. We decided to use two baseline classical topic modelling algorithms that have proven their worth on longer documents; Latent Dirichlet Allocation, a probabilistic model that assigns probability of a word belonging to a topic, and Non-negative matrix factorisation, a deterministic model that is a decomposition technique where it produces two matrices, where the first explains the topic to document contribution, and the second explains the word to topic contribution. In addition, we used two models explicitly developed for the purpose of modelling topics on short text; Biterm topic model, an extension of the LDA where it utilises the biterms of the whole corpus, and Word2Vec-Gaussian Mixture Model, which is a new approach to utilise rich word-embeddings in combination with gaussian distributions to define the topics in a learned word distribution.

Above is a depiction of the experimental approach we had in the topic modelling track. All the pre-processed sets were tested on all algorithms, where the NMF have additional preparation in the form of weighting the input matrix with TF-IDF and the W2V-GMM needed a corpus trained word-embedding in order to fit the gaussian distributions.

Surprisingly the NMF produced the most coherent topics, also outperforming the two supposedly state-of-the-art models within topic modelling of short text. The best model was selected, to which we applied a hard classification followed by visualisation of the classified documents. At this stage we were able to investigate if the classification was a good approximation of what the document was containing. It proved to generate isolated clusters of documents with similar intent, but there was also a big cluster of ambiguity, which contained documents of very different intent. Therefore, we isolated this cluster with k-means, to keep one cluster of mixed documents that cannot be inferred. Unfortunately, I cannot share any of these results or visualisations as it would violate the signed NDA.

Sentiment Analysis
For the sentiment analysis track, we already have the various data sets prepared to test with the chosen algorithms. Here we used two baseline models; logistic regression and support vector machine, and one state-of-the-art model; ULMFiT, universal language model fine tuning for text classification. The latter model utilises a range of cool methods to achieve its remarkable results. These include slanted triangular learning rates, discriminative fine-tuning, gradual unfreezing and weight dropout.

The setup is universal in the sense that the same architecture is used throughout all the steps. This architecture is a former state art of the art model within language models, and was first proposed by Merity; AWD-LSTM. It’s a three layered LSTM network with the regularisation techniques mentioned above implemented. First, the network learns the general language from a rich data source like Wikipedia. Fortunately Møllerhøj did already train a AWD-LSTM on the Danish Wikepedia, which we could utilise. The network’s parameters are then trained on a domain specific corpus to converge to the way of expressing on the target domain. Here, we were able to quickly converge to the domain language by utilising both Trustpilot and Facebook data. At the last step, an additional pooling layer is added, and the network is trained to classify on sentiment as opposed to only predict the next word of the sequence.

Domain specific training of language model

The amount of data that we had available from Trustpilot greatly exceeded that of Facebook, and to classify for sentiment on Trustpilot reviews does not make sense, since the customer already assigned a star related to the review. Since we do not initially know what sentiment is associated to the Facebook documents, we had 50 independent subjects annotate sentiment on 40 examples each. Each example was annotated by two subjects, where inter-agreement rating was determined, and those examples above an acceptable rate were accepted to a validation set. The validation set ended up at about 800 examples, which is quite small to also be split into a training set. Fortunately, we have quite a lot more data from Trustpilot, where the reviewer already assigned a star rating, which can also be interpret as a sentiment. We used 1 star rated reviews as negative training examples and 5 star rated reviews as positive training examples. With this approach we were able to reach a validation accuracy on 94.3%, where the best baseline model scored 81.9%.

Therefore, we can conclude that Trustpilot reviews are a great approximation of sentiment on Facebook documents. We combined the pre-processing, topic modelling and sentiment analysis in one single dashboard that showcased the topics with associated sentiment. With this dashboard, the business intelligence department is able to view trends within specific topics, as well as viewing the incoming messages and their associated topic and sentiment. The dashboard tracks changes of sentiment in a 24 hour – and 30 day period. By lowering the update rate, we’re able to implement sophisticated alerts, that from earlier events will be able to detect if similar patterns appear.