KDD'20 Tutorial: Embedding-Driven Multi-Dimensional Topic Mining and Text Analysis

Yu Meng, Jiaxin Huang, Jiawei Han
Computer Science Department, University of Illinois at Urbana-Champaign
Time: Aug 23, 2020 8:00 AM - 12:00 PM (PST)

Abstract

People nowadays are immersed in a wealth of text data, ranging from news articles, to social media, academic publications, advertisements, and economic reports. A grand challenge of data mining is to develop effective, scalable and weakly-supervised methods for extracting actionable structures and knowledge from massive text data. Without requiring extensive and corpus-specific human annotations, these methods will satisfy people’s diverse applications and needs for comprehending and making good use of large-scale corpora.

In this tutorial, we will introduce recent advances in text embeddings and their applications to a wide range of text mining tasks that facilitate multi-dimensional analysis of massive text corpora. Specifically, we first overview a set of recently developed unsupervised and weakly-supervised text embedding methods including state-of-the-art context-free embeddings and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several embedding-driven text mining techniques that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge, in the form of multi-dimensional topics and multi-faceted taxonomies, from large-scale text corpora. We finally show that the topics and taxonomies so discovered will naturally form a multi-dimensional TextCube structure, which greatly enhances text exploration and analysis for various important applications, including text classification, retrieval and summarization. We will demonstrate on the most recent real-world datasets (including political news articles as well as scientific publications related to the coronavirus) how multi-dimensional analysis of massive text corpora can be conducted with the introduced embedding-driven text mining techniques.

Slides

Introduction [Slides]
Part I: Overview of Text Embedding Methods [Slides]
Part II: Multi-faceted Taxonomy Construction [Slides]
Part III: Multi-Dimensional Topic Mining [Slides]
Part IV: Embedding-Driven Multi-Dimensional Text Analysis [Slides]
Summary: Overview of Text Embedding Methods [Slides]

Presenters

Yu Meng Yu Meng, Ph.D. student, Computer Science, UIUC. His research focuses on mining structured knowledge from massive text corpora with minimum human supervision.

Jiaxin Huang, Ph.D. student, Computer Science, UIUC. Her research focuses on mining structured knowledge from massive text corpora. She is the recipient of Chirag Foundation Graduate Fellowship in Computer Science.

Jiawei Han, Michael Aiken Chair Professor, Computer Science, UIUC. His research areas encompass data mining, text mining, data warehousing and information network analysis, with over 800 research publications. He is Fellow of ACM, Fellow of IEEE, and has received numerous prominent awards, including ACM SIGKDD Innovation Award (2004) and IEEE Computer Society W. Wallace McDowell Award (2009). He has delivered 50+ conference tutorials or keynote speeches.

Yu Meng (孟瑜)

KDD'20 Tutorial: Embedding-Driven Multi-Dimensional Topic Mining and Text Analysis

Abstract

Slides

Presenters