Sentiment Analysis
Supervised Sentiment
Overview
This is a set of models which are examples of supervised implementations for sentiment analysis. The larger idea behind these models is to allow ensemble learning with other supervised or unsupervised models.
Files
- examples/supervised_sentiment/supervised_sentiment.py: Sentiment analysis models - currently an LSTM and a one-hot CNN
- examples/supervised_sentiment/amazon_reviews.py: Code which will download and process the Amazon datasets described below
- examples/supervised_sentiment/ensembler.py: Contains the ensemble learning algorithm(s)
- examples/supervised_sentiment/example_ensemble.py: An example of how the sentiment models can be trained and ensembled.
- examples/supervised_sentiment/optimize_example.py: An example of using an hyperparameter optimizer with the simple LSTM model.
Models
Two models are shown as classification examples. Additional models can be added as desired.
Bi-directional LSTM
A simple bidirectional LSTM with one fully connected layer. The number of vocab features, dense output size, and document input length, should be determined in the data preprocessing steps. The user can then change the size of the LSTM hidden layer, and the recurrent dropout rate.
Temporal CNN
As defined in “Text Understanding from Scratch” by Zhang, LeCun 2015 https://arxiv.org/pdf/1502.01710v4.pdf this model is a series of 1D CNNs, with a max pooling and fully connected layers. The frame sizes may either be large or small.
Datasets
The dataset in this example is the Amazon Reviews dataset, though other datasets can be easily substituted.
The Amazon review dataset(s) should be downloaded from http://jmcauley.ucsd.edu/data/amazon/. These are *.json.gzip
files which should be unzipped. The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
For best results, a medium sized dataset should be chosen though the algorithms will work on larger and smaller datasets as well. For experimentation I chose the Movie and TV reviews.
Only the “overall”, “reviewText”, and “summary” columns of the review dataset will be retained. The “overall” is the overall rating in terms of stars - this is transformed into a rating where currently 4-5 stars is a positive review, 3 is neutral, and 1-2 stars is a negative review.
The “summary” or title of the review is concatenated with the review text and subsequently cleaned.
The Amazon Review Dataset was published in the following papers:
- Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. R. He, J. McAuley. WWW, 2016. http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf
- Image-based recommendations on styles and substitutes. J. McAuley, C. Targett, J. Shi, A. van den Hengel. SIGIR, 2015. http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf
Running Modalities
Ensemble Train/Test
Install extra packages for running the model:
pip install -r examples/requirements.txt
Currently, the pipeline shows a full train/test/ensemble cycle. The main pipeline can be run with the following command:
python examples/supervised_sentiment/example_ensemble.py --file_path ./reviews_Movies_and_TV.json/
At the conclusion of training a final confusion matrix will be displayed.
Hyperparameter optimization
An example of hyperparameter optimization is given using the python package hyperopt which uses a Tree of Parzen estimator to optimize the simple bi-LSTM algorithm. To run this example the following command can be utilized:
python examples/supervised_sentiment/optimize_example.py \
--file_path ./reviews_Movies_and_TV.json/ \
--new_trials 50 --output_file ./data/optimize_output.pkl
The file will output a result of each of the trial attempts to the specified pickle file.
Aspect Based Sentiment Analysis (ABSA)
Overview
Aspect Based Sentiment Analysis is the task of co-extracting opinion terms and aspect terms (opinion targets) and the relations between them in a given corpus.
Algorithm Overview
Training: the training phase inputs training data and outputs an opinion lexicon and an aspect lexicon. the training flow consists the following three main steps:
1. The first training step is text pre-processing that is performed by Spacy. This step includes tokenization, part-of-speech tagging and sentence breaking.
2. The second training step is to apply a dependency parser to the training data. for this purpose we used the parser described in [1]. For more details regarding steps 1 & 2 see BIST dependency parser.
3. The third step is based on applying a bootstrap lexicon acquisition algorithm described in [2], the algorithm uses a generic lexicon introduced by [3] as initial step for the bootstrap process.
4. The last step includes applying an MLP based opinion term re-ranking and polarity estimation algorithm. This step is based on using the word embbedding similarities between each acquired term and a set of generic opinion terms as features. A pre-trained model is re-ranking provided.
Inference: the inference phase inputs an inference data along with the opinion lexicon and aspect lexicon generated by the training phase. The output of the inference phase is a list aspect-opinion pairs (along with their polarity and score) extracted from the inference data. The inference approach is based on detecting syntactically related aspect-opinion pairs.
Flow
Training
Full code example is available at examples/absa/train.py
.
There are two training modes:
1. Providing training data in a raw text format. In this case the training flow will apply the dependency parser to the data:
python3 examples/absa/train/train.py --data=TRAINING_DATASET
Arguments:
--data=TRAINING_DATASET
- path to the input training dataset. Should point to a single raw text file with documents
separated by newlines or a single csv file containing one doc per line or a directory containing one raw
text file per document.
Optional arguments:
--rerank-model=RERANK_MODEL
- path to re-rank model. By default when running the training
for the first time this model will be downloaded to ~/nlp-architect/cache/absa/train/reranking_model
Notes:
- The generated opinion and aspect lexicons are written as csv files to:
~/nlp-architect/cache/absa/train/lexicons/generated_opinion_lex_reranked.csv
and to~/nlp-architect/cache/absa/train/lexicons/generated_aspect_lex.csv
- In this mode the parsed data (jsons of ParsedDocument objects) is written to:
~/nlp-architect/cache/absa/train/parsed
- When running the training for the first time the system will download glove word embbedding model (the user will be prompted for authorization) to:
~/nlp-architect/cache/absa/train/word_emb_unzipped
(this may take a while) - For demonstration purposes we provide a sample of tripadvisor.co.uk restaurants reviews under the Creative Commons Attribution-Share-Alike 3.0 License (Copyright 2018 Wikimedia Foundation). The dataset can be found at:
datasets/absa/datasets/absa/tripadvisor_co_uk-travel_restaurant_reviews_sample_2000_train.csv
.
2. Providing parsed training data. In this case the training flow skips the parsing step:
python3 examples/absa/train/train.py --parsed-data=PARSED_TRAINING_DATASET
Arguments:
--parsed-data=PARSED_TRAINING_DATASET
- path to the parsed format (jsons of ParsedDocument objects) of the training dataset.
Inference
Full code example is available at examples/absa/inference/inference.py
.
There are two inference modes:
1. Providing inference data in a raw text format.
inference = SentimentInference(ASPECT_LEX, OPINION_LEX)
sentiment_doc = inference.run(doc="The food was wonderful and fresh. Staff were friendly.")
Arguments:
ASPECT_LEX
- path to aspect lexicon (csv file) that was produced by the training phase.
aspect.csv may be manually edited for grouping alias aspect names (e.g. ‘drinks’ and ‘beverages’)
together. Simply copy all alias names to the same line in the csv file.
OPINION_LEX
- path to opinion lexicon (csv file) that was produced by the training phase.
doc
- input sentence.
2. Providing parsed inference data (ParsedDocument format). In this case the parsing step is skipped:
inference = SentimentInference(ASPECT_LEX, OPINION_LEX, parse=False)
doc_parsed = json.load(open('/path/to/parsed_doc.json'), object_hook=CoreNLPDoc.decoder)
sentiment_doc = inference.run(parsed_doc=doc_parsed)
Inference - interactive mode
The provided file examples/absa/inference/interactive.py
enables using generated lexicons in interactive mode:
python3 interactive.py --aspects=ASPECT_LEX --opinions=OPINION_LEX
Arguments:
--aspects=ASPECT_LEX
- path to aspect lexicon (csv file format)
--opinions=OPINION_LEX
- path to opinion lexicon (csv file format)
References
[1] | Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations, Eliyahu Kiperwasser and Yoav Goldberg. 2016. Transactions of the Association of Computational Linguistics, 4:313–327. |
[2] | Opinion Word Expansion and Target Extraction through Double Propagation, Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Computational Linguistics, 37(1): 9–27. |
[3] | Mining and Summarizing Customer Reviews, Minqing Hu and Bing Liu. 2004. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 168–177. |