Local Topic Discovery via Boosted Ensemble of Nonnegative Matrix Factorization


Sangho Suh1, Jaegul Choo1,
Joonseok Lee2, Chandan K. Reddy3

1Korea University, 2Google Research, 3Virginia Tech
August 19-25, 2017 @ Melbourne, Australia

*Invited to Sister Conference Best Paper Track as the ICDM'16 Best Student Paper

Motivation

Topic Modeling Global Topic Modeling
Global Topic Discovery Global Topic Modeling
Global vs. Local Topic
motivation
Example

Second Motivation Example

Sampled topics from papers containing keywords, 'dimension' or 'reduction'

Problem

Existing topic modeling algorithms provide users with
global topics that give general, redundant information

Proposed Idea

Local topic discovery to extract
more specific, informative topics

Proposed Method
(Intuition)

Demo1
Demo2
Demo3
Demo Final
LEnsNMF

For local topic discovery,
1) Iterative Topic Modeling on Residual Matrix
-> Ensemble
2) Boost & Suppress -> Local weighting scheme

Localized Ensemble of
Nonnegative Matrix Factorization (L-EnsNMF)

Proposed Method
(Details)

Overview of L-EnsNMF
Overview

1) NMF Topic Modeling
-> Find a set of topics

2) Residual Update
-> Identify unexplained parts (e.g. egyptian cat)
3) Anchor Sampling & Local Weighting
-> Reveal unexplained parts and suppress explained parts

Nonnegative Matrix Factorization (NMF)
Topic Modeling
NMF for topic modeling
where [
]+ converts every negative element to zero
Residual Update

Identify unexplained parts using residual matrix, R

'overexplained' parts (i.e., negative values) => zero
Anchor Sampling

Sample unexplained document(col) & keyword(row)
Local Weighting

Reveal local topics & suppress global topics

by localizing residual matrix, R, and get RL
Ensemble

Use RL as an input to NMF topic modeling in the next stage

Why NMF on Residual Matrices (1)

Synthetic Example
Why NMF on Residual Matrices (2)

Synthetic Example
Deflation-based method helps to reveal
highly non-redundant, diverse topics
Fast rank-2 NMF (Speed)

Exhaustive search for an
optimal active/passive set partitioning
Fast rank-2

Evaluation

Datasets

  1. Reuters: Articles from the Reuters newswire in 1987
  2. 20 Newsgroups (20News): Newsgroup documents from
    Usenet newsgroups
  3. Enron: 2,000 randomly sampled emails from
    Enron Corporation
  4. IEEE-Vis (VisPub): Academic papers published in
    IEEE Visualization conferences
  5. Twitter: 2,000 randomly selected tweets generated from
    New York City in June 2013
dataset
Baseline Methods

  1. Standard NMF (StdNMF)
  2. Sparse NMF (SprsNMF)
  3. Orthogonal NMF (OrthNMF)
  4. Latent Dirichlet Allocation (LDA)

Quantitative
Evaluation

Topic Coherence

L-EnsNMF generates topics of high-quality regardless of
the number of topics and datasets Topic Coherence
Topic Coherence
Total Document Coverage

Topics by L-EnsNMF become more and more diverse
as the number of topics increases Total Document Coverage Total Document Coverage
Computing Time

L-EnsNMF is the fastest and the most scalable
Computing Time

Qualitative
Topic Example

Local Topic Discovery

We generated 100 topics (10 keywords each) using each different method, but only L-EnsNMF extracted local, specific keywords,
e.g., ‘hurrican’, ‘sandi’, ‘ireland.’

Qualitative Experiment

Dataset: Twitter (New York City in June 2013)

Local Topic Discovery

Ireland football team visited New York City in June 2013
to boost a community hit by Hurricane Sandy in 2012

Example for Qualitative Experiment
User-Specified Anchor Selection

Select user-specified document(col) and keyword(row)

Example for Qualitative Experiment
User-Driven Local Weighting

Reveal user-specified documents/topics

Example for Qualitative Experiment
iL-EnsNMF: User-Driven Topic Discovery

Example for Qualitative Experiment

Future Work

Interactive Topic Discovery System

UTOPIAN
Steer local weighting process to reflect
user's subjective interest and task goals

Conclusion

Summary

  • L-EnsNMF discovers local, focused topics of interest to users
  • Compared to existing topic modeling algorithms, it generates topics of higher quality, higher document coverage at faster speed

Thank you

Questions?

E-mail: jchoo@korea.ac.kr

Code: https://github.com/sanghosuh/lens_nmf-matlab

Overview