Topic and Trend Analysis Solution

Overview

Topic Analysis is a Natural Language Processing (NLP) task of extracting salient terms (or topics) from a textual corpus. Trend Analysis task measures the change of the most prominent topics between two time points.

The solution is based on Noun Phrase (NP) Extraction from the given corpora. Each NP (topic) is assigned a proprietary importance score that represents the significance of the noun phrase in the corpora (document appearances, phrase-ness and completeness).

Flow

The first stage is to extract the topics from the two textual corpora:

A target corpus (e.g., current month’s financial reports)
A reference corpus (e.g., last month’s financial reports).

The analysis is done by running the two corpora through the Topic Extraction pipeline: Normalization -> Noun Phrase extraction -> Refinement -> Scoring. In this stage, the algorithm will also train a W2V model on the joint corpora to be used for the clustering report (this step can be skipped). In the second stage the topic lists are being compared and analyzed. Finally the UI reads the analysis data and generates automatic reports for extracted topics, “Hot” and “Cold” trends, and topic clustering in 2D space.

The noun phrase extraction module is using a pre-trained model which is available under the Apache 2.0 license.

Flow diagram

Reports

Top Topics: highest scored topics from each corpora
Hot Trends: topics with highest positive change in scores
Cold Trends: topics with highest negative change in scores
Trend Clustering: scatter graph showing trends clusters
Topic Clustering: scatter graph showing topic clusters for each corpus
Custom Trends: topics selected by the user to monitor (see section: Filter Phrases and Custom Trends)

Usage

Requirements

Install solution extra packages:

pip install -r solutions/trend_analysis/requirements.txt

First stage

usage: python solutions/trend_analysis/topic_extraction.py [-h] [--notrain] [--url] [--single_thread]
                           target_corpus ref_corpus

positional arguments:
  target_corpus    a path to a folder containing text files
  ref_corpus       a path to a folder containing text files

optional arguments:
  -h, --help       show this help message and exit
  --no_train        skip the creation of w2v model
  --url            corpus is provided as csv file with urls
  --single_thread  analyze corpora sequentially

The topic lists will be saved to csv files, which are the input of the second stage. When using the –url flag, both target_corpus and ref_corpus should be a csv file containing url links to analyze (a single url per row). To use the trend analysis step (step below) it is required to run the topic extraction above without --no_train option.

Second stage

usage: python solutions/trend_analysis/trend_analysis.py [-h] [--top_n TOP_N] [--top_vectors TOP_VECTORS]
                     target_topics ref_topics

positional arguments:
  target_topics         a path to a csv topic-list extracted from the target
                        corpus
  ref_topics            a path to a csv topic-list extracted from the
                        reference corpus

optional arguments:
  -h, --help            show this help message and exit
  --top_n TOP_N         compare only top N topics (default: 10000)
  --top_vectors TOP_VECTORS
                        include only top N vectors in the scatter graph
                        (default: 500)

The input to the second stage is the output lists from the first stage (topic extraction). The analysis results will be saved into the data folder and will be used by the UI at the last stage.

UI stage

In order to visualize the analysis results run:

python solutions/start_ui.py --solution trend_analysis

You can also load the UI as a server using –address and –port, for example:

python solutions/start_ui.py --solution trend_analysis --address=12.13.14.15 --port=1010

and then access it through a browser: http://12.13.14.15:1010/ui

Filter Phrases and Custom Trends

By default, all topics will be analyzed (according to the top N threshold, if provided), and the Custom Trends graph will be empty. The user can filter phrases he wants to omit from the results (post analysis) by selecting the “Filter” radio button, click on the “Filter Topics” tab, and de-select the unwanted topics (currently de-selection is done by holding the Ctrl button and click on a cell). Similarly, in order to select custom trends to be presented in the Custom Trends graph, click on the “Custom Trends” tab and select the phrases to show.

For a permanent custom/filtering, edit the ‘valid’/’custom’ column in the file: data/filter_phrases.csv (assign 1 to show a phrase and 0 otherwise), save the file and refresh the reports web page.