AAAI'22 Tutorial: Pre-Trained Language Representations for Text Mining

Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han
Computer Science Department, University of Illinois at Urbana-Champaign
Time: Feb 23, 2022 2:00 PM – 6:00 PM (PST)

Abstract

This tutorial aims to introduce recent advances in pre-trained text embeddings and language models (e.g., BERT and GPT), as well as their applications to a wide range of text mining tasks. The audiences will be provided with a systematic introduction of (1) the development of pre-trained text representation learning, (2) how the pre-trained models effectively empower fundamental text mining applications and (3) new techniques and approaches to tame pre-trained text representations for text mining tasks with few human annotations. Target audiences include any researchers and practitioners who are interested in artificial intelligence (AI) and machine learning (ML) technologies for natural language and data mining applications using state-of-the-art pre-trained language models. The audiences will learn not only the background and history of text representation learning and text mining, but also the most recent models and methods along with their applications. Our tutorial has a special focus on weakly-supervised methods in text mining which require minimal human efforts for model learning. We will also demonstrate with real-world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses.

The target audience includes typical practitioners of AI who might have a high-level idea about preference learning but are not typically aware of the various challenging facets of the problem. The novelty of the tutorial lies in translating the different paradigms across communities into the language of AI so that it will benefit the ML/AI community. The tutorial will be self-contained and does not expect any prerequisites. Audience with basic knowledge of AI/ML would be able to follow most of the material.

Tutorial Recording

A recording of our tutorial can be found at Google Drive and Dropbox.

Slides

  • Introduction [Slides]
  • Part I: Pre-Trained Language Models [Slides]
  • Part II: Revisiting Text Mining Fundamentals with Pre-Trained Language Models [Slides]
  • Part III: Embedding-Driven Topic Discovery [Slides]
  • Part IV: Weakly-Supervised Text Classification: Embeddings with Less Human Effort [Slides]
  • Part V: Advanced Text Mining Applications Empowered by Pre-Trained Language Models [Slides]

Presenters

Yu MengYu Meng, Ph.D. student, Computer Science, UIUC. His research focuses on mining structured knowledge from massive text corpora with minimum human supervision. He received the Google PhD Fellowship (2021) in Structured Data and Database Management.




Jiaxin HuangJiaxin Huang, Ph.D. student, Computer Science, UIUC. Her research focuses on mining structured knowledge from massive text corpora. She received the Microsoft Research PhD Fellowship (2021) and the Chirag Foundation Graduate Fellowship (2018) in Computer Science, UIUC.








Yu ZhangYu Zhang, Ph.D. student, Computer Science, UIUC. His research focuses on weakly supervised text mining with structural information. He received WWW’18 Best Poster Award Honorable Mention.






Jiawei HanJiawei Han, Michael Aiken Chair Professor, Computer Science, UIUC. His research areas encompass data mining, text mining, data warehousing and information network analysis, with over 800 research publications. He is Fellow of ACM, Fellow of IEEE, and has received numerous prominent awards, including ACM SIGKDD Innovation Award (2004) and IEEE Computer Society W. Wallace McDowell Award (2009). He has delivered 50+ conference tutorials or keynote speeches.