Michalis Vazirgiannis, is a Professor at LIX, Ecole Polytechnique in France. He has conducted research in GMD-IPSI, Max Planck MPI (Germany), in INRIA/FUTURS (Paris). He has been a teaching in AUEB (Greece), Ecole Polytechnique, Telecom-Paristech, ENS (France), Tsinghua (China) and in Deusto University (Spain). His current research interests are on machine learning and combinatorial methods for Graph analysis (including community detection, graph clustering and embeddings, influence maximization), Text mining including Graph of Words, word embeddings with applications to web advertising and marketing, event detection and summarization. Also distributed machine learning algorithms, distributed dimensionality reduction, distributed resource management. He has active cooperation with industrial partners in the area of data analytics and machine learning for large scale data repositories in different application domains such as Web advertising and recommendations, social networks, medical data, aircraft logs, insurance data. He has supervised fourteen completed PhD theses – several of the supported by industrial funding including Google and Airbus. On the publication frontier, he has contributed chapters in books and encyclopedias, published three books and more than a hundred forty papers in international refereed journals and conferences. He is also co-author of three patents. He has received the ERCIM and the Marie Curie EU fellowships and lead a DIGITEO chair grant. Since 2015 M. Vazirgiannis leads the AXA Data Science chair.

He is a Postdoctoral researcher at the Data Science and Mining (DaSciM) group, Computer Science Laboratory (LIX), École Polytechnique, France. He received his PhD in Graph Mining from AUEB in 2017. Before that, he obtained an MSc in Artificial Intelligence from the University of Southampton and a Diploma in Electrical and Computer Engineering from the University of Patras. His current research interests are in the fields of learning on graphs and of graph-based information extraction. He is recipient of the distinguished paper award of IJCAI 2018.

Fragkiskos D. Malliaros, is an Assistant Professor at Centrale- Supélec, University of Paris-Saclay. Right before that, he was a data science postdoctoral fellow at UC San Diego and a postdoctoral researcher at École Polytechnique, from where he also received his Ph.D. degree in 2015. He obtained his Diploma and his M.Sc. degree from the University of Patras, Greece in 2009 and 2011 respectively. He is the recipient of the 2012 Google European Doctoral Fellowship in Graph Mining and the 2015 Thesis Prize by École Polytechnique. During the summer of 2014, he was a research intern at Palo Alto Research Center (PARC) in Palo Alto, CA, working on anomaly detection in social networks. His research interests span the broad area of data science, with focus on graph mining, machine learning, social network analysis, and natural language processing. In the past, he has been the co-chair of the 2017 Information Theory and Applications Workshop (ITA ’17) and the 2nd ICDM International Workshop on Data Science for Social Media and Risk (SoMeRiS ’16).

Graphs or networks have been widely used as modelling tools in Natural Language Processing (NLP), Text Mining (TM) and Information Retrieval (IR). Traditionally, the unigram bag-of-words representation is applied; that way, a document is represented as a multiset of its terms, disregarding dependencies between the terms. Although several variants and extensions of this modelling approach have been proposed (e.g., the $n$- gram model), the main weakness comes from the underlying term independence assumption. The order of the terms within a document is completely disregarded and any relationship between terms is not taken into account in the final task (e.g., text categorization). Nevertheless, as the heterogeneity of text collections is increasing (especially with respect to document length and vocabulary), the research community has started exploring different document representations aiming to capture more fine-grained contexts of co-occurrence between different terms, challenging the well-established unigram bag-of-words model. To this direction, graphs constitute a well-developed model that has been adopted for text representation. The goal of this tutorial is to offer a comprehensive presentation of recent methods that rely on graph-based text representations to deal with various tasks in Web Mining, NLP and IR. We will describe basic as well as novel graph theoretic concepts and we will examine how they can be applied in a wide range of text-related application domains.

Section

Topics

Introduction

Basic on IR and NL

Highlights on graph-based document representations

Overview of the topics that will be covered in the tutorial

What the tutorial is not about

Highlights on graph-based document representations

Overview of the topics that will be covered in the tutorial

What the tutorial is not about

Graph-Theoretic Concepts

Basic graph definitions

Node centrality criteria (e.g., closeness, betweenness) and community structure

PageRank and HITS

Graph degeneracy (K-core and K-truss decompositions)

Frequent subgraph mining

Basics on graph kernels

Node centrality criteria (e.g., closeness, betweenness) and community structure

PageRank and HITS

Graph degeneracy (K-core and K-truss decompositions)

Frequent subgraph mining

Basics on graph kernels

Graph-Based Text Representations

How to construct a graph from a single document or a collection of documents

Graph-of-words concept

Semantics of nodes and edges

Edge directionality and edge weight

Graph construction trade-offs

Graph-of-words concept

Semantics of nodes and edges

Edge directionality and edge weight

Graph construction trade-offs

Information Retrieval

Graph-based term weighting in IR

TW and TW-IDF weighting functions

TW and TW-IDF weighting functions

Keyword - Keyphrase Extraction and Text Summarization

Clustering-based methods

TextRank and PageRank-based approaches for single topic keyword extraction HITS algorithm for keyword extraction

Node centrality criteria for keyword and keyphrase extraction

Graph degeneracy-based methods

Combining graph degeneracy and submodularity for unsupervised extractive summarization

Keyphrase annotation

Software demonstration

TextRank and PageRank-based approaches for single topic keyword extraction HITS algorithm for keyword extraction

Node centrality criteria for keyword and keyphrase extraction

Graph degeneracy-based methods

Combining graph degeneracy and submodularity for unsupervised extractive summarization

Keyphrase annotation

Software demonstration

Novelty and Event Detection in Text Streams

Degeneracy-based sub-event detection in Twitter streams

A graph optimization approach for sub-event detection and summarization in Twitter

A graph optimization approach for sub-event detection and summarization in Twitter

Text Categorization (TC)

Graph-based term weighting for TC

Frequent subgraphs as categorization features term graph models for TC

Graph matching approaches

Graph-based regularization for TC

Graph representation learning methods

Frequent subgraphs as categorization features term graph models for TC

Graph matching approaches

Graph-based regularization for TC

Graph representation learning methods