WWW'23 Tutorial: Turning Web-Scale Texts to Knowledge: Transferring Pretrained Representations to Text Mining Applications

Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han
Computer Science Department, University of Illinois at Urbana-Champaign
Time: April 30, 2023 11:00 AM - 12:30 PM (CT)


Textual data are ubiquitous and massive on the web: News reports, social media posts, Wikipedia articles, etc. are being created and updated online everyday. While they contain rich information and knowledge, it has remained an open challenge to effectively leverage them in text-intensive applications. Recent developments in pretrained language models (PLMs) have revolutionized text mining and processing: By pretraining neural architectures on large-scale text corpora obtained from the web and then transferring their representations to task-specific data, the knowledge encoded in the web-scale corpora can be effectively leveraged to significantly enhance the downstream task performance. The most common adaptation approach of PLMs is the pretrain-finetune paradigm where the PLMs are further trained on downstream task labeled data. However, the major challenge of such a paradigm is that fully-supervised fine-tuning of PLMs usually require abundant human annotations, which can be expensive to acquire in practice.

In this tutorial, we will introduce recent advances in pretrained text representations learned from web-scale corpora, as well as their applications to a wide range of text mining tasks. We focus on weakly-supervised approaches without requiring massive human annotations, including (1) pretrained language models that serve as the fundamentals for downstream tasks, (2) unsupervised and seed-guided methods for topic discovery from massive text corpora, and (3) weakly-supervised methods for text classification and advanced text mining tasks.


  • Introduction [Slides]
  • Part I: Pretrained Language Models [Slides]
  • Part II: Embedding-Driven Topic Discovery [Slides]
  • Part III: Weakly-Supervised Text Classification [Slides]


Yu MengYu Meng, Ph.D. student, Computer Science, UIUC. His research focuses on mining structured knowledge from massive text corpora with minimum human supervision. He received the Google PhD Fellowship (2021) in Structured Data and Database Management.

Jiaxin HuangJiaxin Huang, Ph.D. student, Computer Science, UIUC. Her research focuses on mining structured knowledge from massive text corpora. She received the Microsoft Research PhD Fellowship (2021) and the Chirag Foundation Graduate Fellowship (2018) in Computer Science, UIUC.

Yu ZhangYu Zhang, Ph.D. student, Computer Science, UIUC. His research focuses on weakly supervised text mining with structural information. He received the Yunni and Maxine Pao Memorial Fellowship (2022) and WWW Best Poster Award Honorable Mention (2018).

Jiawei HanJiawei Han, Michael Aiken Chair Professor, Computer Science, UIUC. His research areas encompass data mining, text mining, data warehousing and information network analysis, with over 800 research publications. He is Fellow of ACM, Fellow of IEEE, and has received numerous prominent awards, including ACM SIGKDD Innovation Award (2004) and IEEE Computer Society W. Wallace McDowell Award (2009). He has delivered 50+ conference tutorials or keynote speeches.