L-EnsNMF

Local Topic Discovery via Boosted Ensemble of Nonnegative Matrix Factorization

Sangho Suh¹, Jaegul Choo¹,
Joonseok Lee², Chandan K. Reddy³

¹Korea University, ²Google Research, ³Virginia Tech
August 19-25, 2017 @ Melbourne, Australia

*Invited to Sister Conference Best Paper Track as the ICDM'16 Best Student Paper

Motivation

Topic Modeling Global Topic Modeling

Global Topic Discovery Global Topic Modeling

Global vs. Local Topic
motivation

Example

Second Motivation Example

Sampled topics from papers containing keywords, 'dimension' or 'reduction'

Problem

Existing topic modeling algorithms provide users with
global topics that give general, redundant information

Proposed Idea

Local topic discovery to extract
more specific, informative topics

Proposed Method
(Intuition)

For local topic discovery,
1) Iterative Topic Modeling on Residual Matrix
-> Ensemble
2) Boost & Suppress -> Local weighting scheme

Localized Ensemble of
Nonnegative Matrix Factorization (L-EnsNMF)

Proposed Method
(Details)

Overview of L-EnsNMF

1) NMF Topic Modeling
-> Find a set of topics
2) Residual Update
-> Identify unexplained parts (e.g. egyptian cat)
3) Anchor Sampling & Local Weighting
-> Reveal unexplained parts and suppress explained parts

Nonnegative Matrix Factorization (NMF)
Topic Modeling
NMF for topic modeling

where [

]₊ converts every negative element to zero

Residual Update

Identify unexplained parts using residual matrix, R

'overexplained' parts (i.e., negative values) => zero

Anchor Sampling

Sample unexplained document(col) & keyword(row)

Local Weighting

Reveal local topics & suppress global topics

by localizing residual matrix, R, and get R_L

Ensemble

Use R_L as an input to NMF topic modeling in the next stage

Why NMF on Residual Matrices (1)

Synthetic Example

Why NMF on Residual Matrices (2)

Synthetic Example

Deflation-based method helps to reveal
highly non-redundant, diverse topics

Fast rank-2 NMF (Speed)

Exhaustive search for an
optimal active/passive set partitioning

Evaluation

Datasets

Reuters: Articles from the Reuters newswire in 1987
20 Newsgroups (20News): Newsgroup documents from
Usenet newsgroups
Enron: 2,000 randomly sampled emails from
Enron Corporation
IEEE-Vis (VisPub): Academic papers published in
IEEE Visualization conferences
Twitter: 2,000 randomly selected tweets generated from
New York City in June 2013

Baseline Methods

Standard NMF (StdNMF)
Sparse NMF (SprsNMF)
Orthogonal NMF (OrthNMF)
Latent Dirichlet Allocation (LDA)

Quantitative
Evaluation

Topic Coherence

L-EnsNMF generates topics of high-quality regardless of
the number of topics and datasets

Total Document Coverage

Topics by L-EnsNMF become more and more diverse
as the number of topics increases

Computing Time

L-EnsNMF is the fastest and the most scalable

Qualitative
Topic Example

Local Topic Discovery

We generated 100 topics (10 keywords each) using each different method, but only L-EnsNMF extracted local, specific keywords,
e.g., ‘hurrican’, ‘sandi’, ‘ireland.’

Dataset: Twitter (New York City in June 2013)

Local Topic Discovery

Ireland football team visited New York City in June 2013
to boost a community hit by Hurricane Sandy in 2012

User-Specified Anchor Selection

Select user-specified document(col) and keyword(row)

User-Driven Local Weighting

Reveal user-specified documents/topics

iL-EnsNMF: User-Driven Topic Discovery

Future Work

Interactive Topic Discovery System

UTOPIAN

Steer local weighting process to reflect
user's subjective interest and task goals

Conclusion

Summary

L-EnsNMF discovers local, focused topics of interest to users
Compared to existing topic modeling algorithms, it generates topics of higher quality, higher document coverage at faster speed

Thank you

Questions?

E-mail: jchoo@korea.ac.kr

Code: https://github.com/sanghosuh/lens_nmf-matlab

Motivation

Proposed Method(Intuition)

Proposed Method(Details)