I am mainly interested in graph representation learning, large-scale network science, and network analytics. Previously, I have worked on computational social science and combinatorial game theory. I studied Economic Policy at Central European University and Applied Economics at Corvinus University of Budapest.
PhD in Data Science, University of Edinburgh, 2017-Now.
MSc by Research in Data Science, University of Edinburgh, 2016-2017.
MA in Economic Policy, Central European University, 2014-2016.
with Distinction. Won the Stanislav Vidovic MA Thesis Award.
BSc in Applied Economics, Corvinus University of Budapest, 2011-2014.
Teaching assistant, Corvinus University of Budapest, 2012-2014.
I've tutored and/or marked the following courses: Programming for Mathematical Economics; Probability Theory; Calculus; Internatiomal Economics; Macroeconomics.
Sampling graphs is an important task in data mining. In this paper, we describe Little Ball of Fur a Python library that includes more than twenty graph sampling algorithms. Our goal is to make node, edge, and exploration-based network sampling techniques accessible to a large number of professionals, researchers, and students in a single streamlined framework. We created this framework with a focus on a coherent application public interface which has a convenient design, generic input data requirements, and reasonable baseline settings of algorithms. Here we overview these design foundations of the framework in detail with illustrative code snippets. We show the practical usability of the library by estimating various global statistics of social networks and web graphs. Experiments demonstrate that Little Ball of Fur can speed up node and whole graph embedding techniques considerably with mildly deteriorating the predictive value of distilled features.
In this paper, we propose a flexible notion of characteristic functions defined on graph vertices to describe the distribution of vertex features at multiple scales. We introduce FEATHER, a computationally efficient algorithm to calculate a specific variant of these characteristic functions where the probability weights of the characteristic function are defined as the transition probabilities of random walks. We argue that features extracted by this procedure are useful for node level machine learning tasks. We discuss the pooling of these node representations, resulting in compact descriptors of graphs that can serve as features for graph classification algorithms. We analytically prove that FEATHER describes isomorphic graphs with the same representation and exhibits robustness to data corruption. Using the node feature characteristic functions we define parametric models where evaluation points of the functions are learned parameters of supervised classifiers.
We present Karate Club a Python framework combining more than 30 state-of-the-art graph mining algorithms which can solve unsupervised machine learning tasks. The primary goal of the package is to make community detection, node and whole graph embedding available to a wide audience of machine learning researchers and practitioners. We designed Karate Club with an emphasis on a consistent application interface, scalability, ease of use, sensible out of the box model behaviour, standardized dataset ingestion, and output generation. This paper discusses the design principles behind this framework with practical examples. We show Karate Club's efficiency with respect to learning performance on a wide range of real world clustering problems, classification tasks and support evidence with regards to its competitive speed.
We present network embedding algorithms that capture information about a node from the local distribution over node attributes around it, as observed over random walks following an approach similar to Skip-gram. Observations from neighborhoods of different sizes are either pooled (AE) or encoded distinctly in a multi-scale approach (MUSAE). Capturing attribute-neighborhood relationships over multiple scales is useful for a diverse range of applications, including latent feature identification across disconnected networks with similar attributes. We prove theoretically that matrices of node-feature pointwise mutual information are implicitly factorized by the embeddings. Experiments show that our algorithms are robust, computationally efficient and outperform comparable models on social, web and citation network datasets.
Modern graph embedding procedures can efficiently process graphs with millions of nodes. In this paper, we propose GEMSEC -- a graph embedding algorithm which learns a clustering of the nodes simultaneously with computing their embedding. GEMSEC is a general extension of earlier work in the domain of sequence-based graph embedding. GEMSEC places nodes in an abstract feature space where the vertex features minimize the negative log-likelihood of preserving sampled vertex neighborhoods, and it incorporates known social network properties through a machine learning regularization. We present two new social network datasets and show that by simultaneously considering the embedding and clustering problems with respect to social properties, GEMSEC extracts high-quality clusters competitive with or superior to other community detection algorithms. In experiments, the method is found to be computationally efficient and robust to the choice of hyperparameters.
A graph embedding is a representation of the vertices of a graph in a low dimensional space, which approximately preserves properties such as distances between nodes. Vertex sequence based embedding procedures use features extracted from linear sequences of vertices to create embeddings using a neural network. In this paper, we propose diffusion graphs as a method to rapidly generate vertex sequences for network embedding. Its computational efficiency is superior to previous methods due to simpler sequence generation, and it produces more accurate results. In experiments, we found that the performance relative to other methods improves with increasing edge density in the graph. In a community detection task, clustering nodes in the embedding space produces better results compared to other sequence based embedding methods.