KDD'21 Tutorial: On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining

Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han
Computer Science Department, University of Illinois at Urbana-Champaign
Time: Aug 14, 2021 9:00 AM - 12:00 PM (SGT) / Aug 13, 2021 6:00 PM - 9:00 PM (PST)

Abstract

Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The so learned word embeddings are then used as the input features of task-specific models. Recently, pre-trained language models (PLMs) have revolutionized the natural language processing (NLP) field, which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch.

In this tutorial, we will introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. We will demonstrate on real-world datasets how the pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses.

Slides

  • Introduction [Slides]
  • Part I: Text Embedding and Language Models [Slides]
  • Part II: Revisiting Text Mining Fundamentals with Pre-Trained Language Models [Slides]
  • Part III: Embedding-Driven Topic Discovery [Slides]
  • Part IV: Weakly-Supervised Text Classification: Embeddings with Less Human Effort [Slides]
  • Part V: Advanced Text Mining Applications Empowered by Embeddings [Slides]

Presenters

Yu MengYu Meng, Ph.D. student, Computer Science, UIUC. His research focuses on mining structured knowledge from massive text corpora with minimum human supervision. He received the Google PhD Fellowship (2021) in Structured Data and Database Management.




Jiaxin HuangJiaxin Huang, Ph.D. student, Computer Science, UIUC. Her research focuses on mining structured knowledge from massive text corpora. She received the Microsoft Research PhD Fellowship (2021) and the Chirag Foundation Graduate Fellowship (2018) in Computer Science, UIUC.








Yu ZhangYu Zhang, Ph.D. student, Computer Science, UIUC. His research focuses on weakly supervised text mining with structural information. He received WWW’18 Best Poster Award Honorable Mention.






Jiawei HanJiawei Han, Michael Aiken Chair Professor, Computer Science, UIUC. His research areas encompass data mining, text mining, data warehousing and information network analysis, with over 800 research publications. He is Fellow of ACM, Fellow of IEEE, and has received numerous prominent awards, including ACM SIGKDD Innovation Award (2004) and IEEE Computer Society W. Wallace McDowell Award (2009). He has delivered 50+ conference tutorials or keynote speeches.