My picture

Benedek Rozemberczki

Currently I am enjoying the third year of my PhD at the Centre for Doctoral Training in Data Science, University of Edinburgh. I am supervised by Rik Sarkar.

I am mainly interested in graph representation learning, large-scale network science, and network analytics. Previously, I have worked on computational social science and combinatorial game theory. I studied Economic Policy at Central European University and Applied Economics at Corvinus University of Budapest.

CV

Studies

PhD in Data Science, University of Edinburgh, 2017-Now.

MSc by Research in Data Science, University of Edinburgh, 2016-2017.
with Distinction.

MA in Economic Policy, Central European University, 2014-2016.
with Distinction. Won the Stanislav Vidovic MA Thesis Award.

BSc in Applied Economics, Corvinus University of Budapest, 2011-2014.
with Distinction.

Work

Research intern, Google AI, 2019.
I mainly worked on experimental machine learnig research. I was part of the Graph Mining team.

Data science intern, GE Digital, 2016.
I mainly worked on network analytics, NLP and OCR. I was part of the Predix team.

Teaching assistant, University of Edinburgh, 2017.
I've tutored and marked the following course: Social & Technological Networks.

Teaching assistant, Corvinus University of Budapest, 2012-2014.
I've tutored and/or marked the following courses: Programming for Mathematical Economics; Probability Theory; Calculus; Internatiomal Economics; Macroeconomics.

Little Ball of Fur: A Python Library for Graph Sampling

Diff2Vec

Sampling graphs is an important task in data mining. In this paper, we describe Little Ball of Fur a Python library that includes more than twenty graph sampling algorithms. Our goal is to make node, edge, and exploration-based network sampling techniques accessible to a large number of professionals, researchers, and students in a single streamlined framework. We created this framework with a focus on a coherent application public interface which has a convenient design, generic input data requirements, and reasonable baseline settings of algorithms. Here we overview these design foundations of the framework in detail with illustrative code snippets. We show the practical usability of the library by estimating various global statistics of social networks and web graphs. Experiments demonstrate that Little Ball of Fur can speed up node and whole graph embedding techniques considerably with mildly deteriorating the predictive value of distilled features.

For more details you can have a look at the paper. Code is available here.

Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models

Diff2Vec

In this paper, we propose a flexible notion of characteristic functions defined on graph vertices to describe the distribution of vertex features at multiple scales. We introduce FEATHER, a computationally efficient algorithm to calculate a specific variant of these characteristic functions where the probability weights of the characteristic function are defined as the transition probabilities of random walks. We argue that features extracted by this procedure are useful for node level machine learning tasks. We discuss the pooling of these node representations, resulting in compact descriptors of graphs that can serve as features for graph classification algorithms. We analytically prove that FEATHER describes isomorphic graphs with the same representation and exhibits robustness to data corruption. Using the node feature characteristic functions we define parametric models where evaluation points of the functions are learned parameters of supervised classifiers.

For more details you can have a look at the paper. Code is available here.

Karate Club: An API Oriented Open-Source Python Framework for Unsupervised Learning on Graphs

Diff2Vec

We present Karate Club a Python framework combining more than 30 state-of-the-art graph mining algorithms which can solve unsupervised machine learning tasks. The primary goal of the package is to make community detection, node and whole graph embedding available to a wide audience of machine learning researchers and practitioners. We designed Karate Club with an emphasis on a consistent application interface, scalability, ease of use, sensible out of the box model behaviour, standardized dataset ingestion, and output generation. This paper discusses the design principles behind this framework with practical examples. We show Karate Club's efficiency with respect to learning performance on a wide range of real world clustering problems, classification tasks and support evidence with regards to its competitive speed.

For more details you can have a look at the paper. The repository is available here.

Multi-scale Attributed Node Embedding

Diff2Vec

We present network embedding algorithms that capture information about a node from the local distribution over node attributes around it, as observed over random walks following an approach similar to Skip-gram. Observations from neighborhoods of different sizes are either pooled (AE) or encoded distinctly in a multi-scale approach (MUSAE). Capturing attribute-neighborhood relationships over multiple scales is useful for a diverse range of applications, including latent feature identification across disconnected networks with similar attributes. We prove theoretically that matrices of node-feature pointwise mutual information are implicitly factorized by the embeddings. Experiments show that our algorithms are robust, computationally efficient and outperform comparable models on social, web and citation network datasets.

For more details you can have a look at the paper. Code that generates the embeddings can be found here.

GEMSEC: Graph Embedding with Self Clustering

Diff2Vec

Modern graph embedding procedures can efficiently process graphs with millions of nodes. In this paper, we propose GEMSEC -- a graph embedding algorithm which learns a clustering of the nodes simultaneously with computing their embedding. GEMSEC is a general extension of earlier work in the domain of sequence-based graph embedding. GEMSEC places nodes in an abstract feature space where the vertex features minimize the negative log-likelihood of preserving sampled vertex neighborhoods, and it incorporates known social network properties through a machine learning regularization. We present two new social network datasets and show that by simultaneously considering the embedding and clustering problems with respect to social properties, GEMSEC extracts high-quality clusters competitive with or superior to other community detection algorithms. In experiments, the method is found to be computationally efficient and robust to the choice of hyperparameters.

For more details you can have a look at the paper. Code that generates the embeddings can be found here.

Fast Sequence Based Embedding with Diffusion Graphs

Diff2Vec

A graph embedding is a representation of the vertices of a graph in a low dimensional space, which approximately preserves properties such as distances between nodes. Vertex sequence based embedding procedures use features extracted from linear sequences of vertices to create embeddings using a neural network. In this paper, we propose diffusion graphs as a method to rapidly generate vertex sequences for network embedding. Its computational efficiency is superior to previous methods due to simpler sequence generation, and it produces more accurate results. In experiments, we found that the performance relative to other methods improves with increasing edge density in the graph. In a community detection task, clustering nodes in the embedding space produces better results compared to other sequence based embedding methods.

For more details you can have a look at the paper. Code that generates the embeddings can be found here.

Papers

B. Rozemberczki, O. Kiss, R. Sarkar Little Ball of Fur: A Python Library for Graph Sampling, CIKM 2020.
pdf code

B. Rozemberczki, R. Sarkar Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models, CIKM 2020
pdf code

B. Rozemberczki, O. Kiss, R. Sarkar Karate Club: An API Oriented Open-Source Python Framework for Unsupervised Learning on Graphs, CIKM 2020.
pdf code

A. Bojchevski, J. Klicpera, B. Perozzi, A. Kapoor, M. Blais, B. Rozemberczki, M. Lukasik, S. G√ľnnemann Scaling Graph Neural Networks with Approximate PageRank, KDD 2020
pdf code

B. Rozemberczki, R. Davies, R. Sarkar, C. Sutton GEMSEC: Graph Embedding with Self Clustering, ASONAM 2019.
pdf code

A. Ghosh, B. Rozemberczki, S. Ramamoorthy, R. Sarkar Topological Signatures For Fast Mobility Analysis, SIGSPATIAL 2018.
pdf

B. Rozemberczki, R. Sarkar Fast Sequence Based Embedding with Diffusion Graphs, COMPLENET 2018.
pdf code

Preprints

L. Watson, B. Rozemberczki, R. Sarkar Stability Enhanced Privacy and Applications in Private Stochastic Gradient Descent, ArXiV 2020.
pdf

B. Rozemberczki, C. Allen, R. Sarkar Multi-Scale Attributed Node Embedding, ArXiV 2019.
pdf code

Theses

B. Rozemberczki. Diffusion to Vector - Representation Learning of Graphs. MSc by Research Thesis, Centre for Doctoral Training in Data Science, University of Edinburgh. 2017.
pdf code

B. Rozemberczki. Homophily Rearrangement Algorithms and Similarity Based Diffusion on Networks. MA Thesis, Department of Economics, Central European University. 2016.
pdf slides

B. Rozemberczki. Regression Games and Applications. BSc Thesis, Faculty of Economics, Corvinus University of Budapest. 2014.
pdf slides

ENS Challenges 2016-2017 - Societe General Oil Extraction Prediction Task - 1st/2428.
competition code

ENS Challenges 2016-2017 - Regaind Photo Quality Assessment Task - 1st/2428.
competition code

CDMC 2016 - International Data Mining Competition - 2nd/189.
competition code

Driven Data - Data Mining the Water Table - 3rd/1917.
competition code

Email
benedek.rozemberczki[at]inf.ed.ac.uk
Work Address
Room 3.50, Informatics Forum
University of Edinburgh
10 Crichton Street, EH8 9AB
Edinburgh, UK