Topic and Trend Analysis Solution

Overview

Topic Analysis is a Natural Language Processing (NLP) task of extracting salient terms (or topics) from a textual corpus. Trend Analysis task measures the change of the most prominent topics between two time points.

The solution is based on Noun Phrase (NP) Extraction from the given corpora. Each NP (topic) is assigned a proprietary importance score that represents the significance of the noun phrase in the corpora (document appearances, phrase-ness and completeness).

Flow

The first stage is to extract the topics from the two textual corpora:

  • A target corpus (e.g., current month’s financial reports)
  • A reference corpus (e.g., last month’s financial reports).

The analysis is done by running the two corpora through the Topic Extraction pipeline: Normalization -> Noun Phrase extraction -> Refinement -> Scoring. In this stage, the algorithm will also train a W2V model on the joint corpora to be used for the clustering report (this step can be skipped). In the second stage the topic lists are being compared and analyzed. Finally the UI reads the analysis data and generates automatic reports for extracted topics, “Hot” and “Cold” trends, and topic clustering in 2D space.

The noun phrase extraction module is using a pre-trained model which is available under the Apache 2.0 license.

Flow diagram

_images/ta_flow.png

Reports

  • Top Topics: highest scored topics from each corpora
  • Hot Trends: topics with highest positive change in scores
  • Cold Trends: topics with highest negative change in scores
  • Trend Clustering: scatter graph showing trends clusters
  • Topic Clustering: scatter graph showing topic clusters for each corpus
  • Custom Trends: topics selected by the user to monitor (see section: Filter Phrases and Custom Trends)

Usage

Requirements

Install solution extra packages:

pip install -r solutions/trend_analysis/requirements.txt

First stage

usage: python solutions/trend_analysis/topic_extraction.py [-h] [--notrain] [--url] [--single_thread]
                           target_corpus ref_corpus

positional arguments:
  target_corpus    a path to a folder containing text files
  ref_corpus       a path to a folder containing text files

optional arguments:
  -h, --help       show this help message and exit
  --no_train        skip the creation of w2v model
  --url            corpus is provided as csv file with urls
  --single_thread  analyze corpora sequentially

The topic lists will be saved to csv files, which are the input of the second stage. When using the –url flag, both target_corpus and ref_corpus should be a csv file containing url links to analyze (a single url per row). To use the trend analysis step (step below) it is required to run the topic extraction above without --no_train option.

Second stage

usage: python solutions/trend_analysis/trend_analysis.py [-h] [--top_n TOP_N] [--top_vectors TOP_VECTORS]
                     target_topics ref_topics

positional arguments:
  target_topics         a path to a csv topic-list extracted from the target
                        corpus
  ref_topics            a path to a csv topic-list extracted from the
                        reference corpus

optional arguments:
  -h, --help            show this help message and exit
  --top_n TOP_N         compare only top N topics (default: 10000)
  --top_vectors TOP_VECTORS
                        include only top N vectors in the scatter graph
                        (default: 500)

The input to the second stage is the output lists from the first stage (topic extraction). The analysis results will be saved into the data folder and will be used by the UI at the last stage.

UI stage

In order to visualize the analysis results run:

python solutions/start_ui.py --solution trend_analysis

You can also load the UI as a server using –address and –port, for example:

python solutions/start_ui.py --solution trend_analysis --address=12.13.14.15 --port=1010

and then access it through a browser: http://12.13.14.15:1010/ui