GraRep: Boosting Text Mining with Graphs

Half-day tutorial at CIKM 2018- Friday, 26 October 2018

Michalis Vazirgiannis

LIX, Ecole Polytechnique, France & AUEB, Greece

Michalis Vazirgiannis, is a Professor at LIX, Ecole Polytechnique in France. He has conducted research in GMD-IPSI, Max Planck MPI (Germany), in INRIA/FUTURS (Paris). He has been a teaching in AUEB (Greece), Ecole Polytechnique, Telecom-Paristech, ENS (France), Tsinghua (China) and in Deusto University (Spain). His current research interests are on machine learning and combinatorial methods for Graph analysis (including community detection, graph clustering and embeddings, influence maximization), Text mining including Graph of Words, word embeddings with applications to web advertising and marketing, event detection and summarization. Also distributed machine learning algorithms, distributed dimensionality reduction, distributed resource management. He has active cooperation with industrial partners in the area of data analytics and machine learning for large scale data repositories in different application domains such as Web advertising and recommendations, social networks, medical data, aircraft logs, insurance data. He has supervised fourteen completed PhD theses – several of the supported by industrial funding including Google and Airbus. On the publication frontier, he has contributed chapters in books and encyclopedias, published three books and more than a hundred forty papers in international refereed journals and conferences. He is also co-author of three patents. He has received the ERCIM and the Marie Curie EU fellowships and lead a DIGITEO chair grant. Since 2015 M. Vazirgiannis leads the AXA Data Science chair.

Giannis Nikolentzos

LIX, Ecole Polytechnique, France

He is a Postdoctoral researcher at the Data Science and Mining (DaSciM) group, Computer Science Laboratory (LIX), École Polytechnique, France. He received his PhD in Graph Mining from AUEB in 2017. Before that, he obtained an MSc in Artificial Intelligence from the University of Southampton and a Diploma in Electrical and Computer Engineering from the University of Patras. His current research interests are in the fields of learning on graphs and of graph-based information extraction. He is recipient of the distinguished paper award of IJCAI 2018.

Fragkiskos D. Malliaros

University of Paris-Saclay, France

Fragkiskos D. Malliaros, is an Assistant Professor at Centrale- Supélec, University of Paris-Saclay. Right before that, he was a data science postdoctoral fellow at UC San Diego and a postdoctoral researcher at École Polytechnique, from where he also received his Ph.D. degree in 2015. He obtained his Diploma and his M.Sc. degree from the University of Patras, Greece in 2009 and 2011 respectively. He is the recipient of the 2012 Google European Doctoral Fellowship in Graph Mining and the 2015 Thesis Prize by École Polytechnique. During the summer of 2014, he was a research intern at Palo Alto Research Center (PARC) in Palo Alto, CA, working on anomaly detection in social networks. His research interests span the broad area of data science, with focus on graph mining, machine learning, social network analysis, and natural language processing. In the past, he has been the co-chair of the 2017 Information Theory and Applications Workshop (ITA ’17) and the 2nd ICDM International Workshop on Data Science for Social Media and Risk (SoMeRiS ’16).

Abstract

Graphs or networks have been widely used as modelling tools in Natural Language Processing (NLP), Text Mining (TM) and Information Retrieval (IR). Traditionally, the unigram bag-of-words representation is applied; that way, a document is represented as a multiset of its terms, disregarding dependencies between the terms. Although several variants and extensions of this modelling approach have been proposed (e.g., the $n$- gram model), the main weakness comes from the underlying term independence assumption. The order of the terms within a document is completely disregarded and any relationship between terms is not taken into account in the final task (e.g., text categorization). Nevertheless, as the heterogeneity of text collections is increasing (especially with respect to document length and vocabulary), the research community has started exploring different document representations aiming to capture more fine-grained contexts of co-occurrence between different terms, challenging the well-established unigram bag-of-words model. To this direction, graphs constitute a well-developed model that has been adopted for text representation. The goal of this tutorial is to offer a comprehensive presentation of recent methods that rely on graph-based text representations to deal with various tasks in Web Mining, NLP and IR. We will describe basic as well as novel graph theoretic concepts and we will examine how they can be applied in a wide range of text-related application domains.

Detailed Outline

Section

Topics

Introduction

Basic on IR and NL
Highlights on graph-based document representations
Overview of the topics that will be covered in the tutorial
What the tutorial is not about

Graph-Theoretic Concepts

Basic graph definitions
Node centrality criteria (e.g., closeness, betweenness) and community structure
PageRank and HITS
Graph degeneracy (K-core and K-truss decompositions)
Frequent subgraph mining
Basics on graph kernels

Graph-Based Text Representations

How to construct a graph from a single document or a collection of documents
Graph-of-words concept
Semantics of nodes and edges
Edge directionality and edge weight
Graph construction trade-offs

Information Retrieval

Graph-based term weighting in IR
TW and TW-IDF weighting functions

Keyword - Keyphrase Extraction and Text Summarization

Clustering-based methods
TextRank and PageRank-based approaches for single topic keyword extraction HITS algorithm for keyword extraction
Node centrality criteria for keyword and keyphrase extraction
Graph degeneracy-based methods
Combining graph degeneracy and submodularity for unsupervised extractive summarization
Keyphrase annotation
Software demonstration

Novelty and Event Detection in Text Streams

Degeneracy-based sub-event detection in Twitter streams
A graph optimization approach for sub-event detection and summarization in Twitter

Text Categorization (TC)

Graph-based term weighting for TC
Frequent subgraphs as categorization features term graph models for TC
Graph matching approaches
Graph-based regularization for TC
Graph representation learning methods

Link to External Resources

Tutorial resources