Heiko Müller
The University of Edinburgh, School of Informatics
Informatics Forum, 10 Crichton Street
Edinburgh, EH8 9AB
Phone: +44 (0) 131 651 38 35
Fax: +44 (0) 131 651 14 26
Archiving Scientific Data

Scientific and reference databases on the Web are a primary source of information. However, the data is subject to continuous change and often only the most recent versions are preserved. For many database providers it has become common practice to overwrite existing database states when changes occur and regularly publish new releases of the data on the Web. Failure to archive earlier states of the data may lead to the loss of scientific evidence, and the basis of findings may no longer be verifiable.

Our archiving tool XArch is an archive management system that allows one to maintain, populate, and query archives of multiple database versions. XArch is based on a nested merge approach that efficiently stores multiple database versions in a compact archive. The system allows one to create new archives, to merge new versions of data into existing archives, and execute both snapshot and temporal queries using a declarative query language.

XArch - The XML Archiver project

Data Cleansing and Data Quality

High costs and loss of reputation caused by data of poor quality made quality assurance and data cleansing hot topics in the business world. Recently, data quality is gaining attention in the scientific community as well. Within this project, we review existing data cleansing methods. We classify data deficiencies that diminish the quality of existing data sources and quality criteria that are affected by these deficiencies. Based on these classifications, we show which cleansing approaches are capable of handling which data deficiencies and quality criteria. We show why existing data cleansing techniques fall short for the domain of genome data and argue that merging overlapping data has outstanding ability to increase data accuracy.

DBIS: Data Cleansing of Genome Data @ Humboldt-Universität zu Berlin

Data Integration

Information integration is often faced with the problem that different data sources represent the same set of real-world objects, but give conflicting values for specific properties of these objects. The main objective of this project is to provide methods that aid the developer of an integrated system over overlapping, but contradicting, sources in the task of resolving value conflicts. In many cases, conflicts between contradicting sources do not occur by chance, but follow some systematic reason. Our goal is to identify such systematic conflicts and outline regular data patterns that occur in conjunction with them. Evaluated by an expert user, the discovered regularities provide insights on possible conflict reasons and help assess the quality of inconsistent values.

DBIS: Conflicts in Data Integration @ Humboldt-Universität zu Berlin

Applied Databases (Tutorials), 2008
Applied Databases (Tutorials), 2009
Conferences, Workshops & Journals (Peer-reviewed)
  • Wenfei Fan, Floris Geerts, Shuai Ma, Heiko Müller
    Detecting Inconsistencies in Distributed Data
    26th IEEE International Conference on Data Engineering, March 1-6, 2010, Long Beach, California, USA
  • Peter Buneman, Heiko Müller, Chris Rusbridge
    Curating the CIA World Factbook
    The International Journal of Digital Curation, Issue 3, Volume 4, 2009 (pdf)
  • Heiko Müller, Peter Buneman, Ioannis Koltsidas
    XArch: archiving scientific and reference data
    ACM SIGMOD international conference on Management of data, Vancouver, Canada, 2008, 1295-1298 (poster)
  • Ioannis Koltsidas, Heiko Müller, Stratis Viglas
    Sorting hierarchical data in external memory for archiving
    Proceedings of the VLDB Endowment, Volume 1 , Issue 1, August 2008
  • Heiko Müller, Johann-Christoph Freytag, Ulf Leser
    Describing Differences between Databases
    ACM 15th Conference on Information and Knowledge Management (CIKM), Arlington, VA, USA, 2006 (pdf)
  • Heiko Müller, Ulf Leser, Johann-Christoph Freytag
    Classification of Contradiction Patterns
    30th Annual Conference of the German Classification Society (GfKl), Berlin, 2006 (pdf)
  • Heiko Müller, Melanie Weis, Jens Bleiholder, Ulf Leser
    Erkennen und Bereinigen von Datenfehlern in naturwissenschaftlichen Daten
    Datenbank-Spektrum, Heft 15, November 2005, 26-35 (in german) (pdf)
  • Michael Mielke, Heiko Müller, Felix Naumann
    Ein Data-Quality-Wettbewerb
    Datenbank-Spektrum, Heft 14, August 2005, 34-37 (in german)
  • Kristian Rother, Silke Trißl , Heiko Müller, Thomas Steinke, Ina Koch, Robert Preissner, Cornelius Frömmel, Ulf Leser
    Columba: An Integrated Database of Proteins, Structures, and Annotations
    BMC Bioinformatics März 2005, 6(1):81 (pdf) (link)
  • P. Rieger, S. Heymann, H. Müller
    Datenbankgestützte Wissensakquisition in den Lebenswissenschaften
    Datenbank-Spektrum, Heft 10, August 2004, pp. 14-21 (in german) (pdf)
  • Heiko Müller, Peter Rieger, Katja Tham, Johann-Christoph Freytag
    Dynamic information fusion for genome annotation
    Informatik 2004 Workshop über Dynamische Informationsfusion, Ulm, Germany, 2004 (pdf)
  • Heiko Müller, Ulf Leser, Johann-Christoph Freytag
    Mining for Patterns in Contradictory Data
    Proceedings of the SIGMOD International Workshop on Information Quality for Information Systems (IQIS'04), Paris, France, 2004 (pdf)
  • Kristian Rother, Heiko Müller, Silke Trissl, Ina Koch, Thomas Steinke, Robert Preissner, Cornelius Frömmel, Ulf Leser
    COLUMBA: Multidimensional Data Integration of Protein Annotations
    International Workshop on Data Integration in Life Sciences (DILS 2004), Leipzig, Germany, 2004 (pdf)
  • Heiko Müller
    Semantic Data Cleansing in Genome Databases
    Proc. of the VLDB 2003 PhD Workshop, Berlin, Germany, September 12-13, 2003
  • Heiko Müller, Felix Naumann, Johann-Christoph Freytag
    Data Quality in Genome Databases
    Proceedings of the Conference on Information Quality (IQ 03), Boston (pdf)
Other publications
  • Heiko Müller
    Describing differences between overlapping databases
    Ph.D. Thesis, Humboldt-Universität zu Berlin, 2009 (link)
  • Heiko Müller
    Archiving and Maintaining Curated Databases
    Grundlagen von Datenbanken, 2009, 135-139 (pdf)
  • Heiko Müller
    XArch - An Archive Management System
    3rd International Digital Curation Conference, Washington, DC, 2007 (poster)
  • Ioannis Koltsidas, Heiko Müller, Stratis Viglas
    Sorting Hierarchical Data in External Memory
    Technical Report, EDI-INF-RR-1217 (pdf)
  • Heiko Müller, Ulf Leser, Johann-Christoph Freytag
    On the Distance of Databases
    HUB-IB-199, Humboldt University Berlin, March 2006 (pdf)
  • Heiko Müller, Johann-Christoph Freytag
    Problems, Methods, and Challenges in Comprehensive Data Cleansing
    HUB-IB-164, Humboldt University Berlin, 2003 (pdf)
  • Heiko Müller
    Realisierung eines einheitlichen Zugriffs auf molekular­biologische Genomkarten mit Hilfe von CORBA
    Diploma Thesis, Technische Universität Berlin, 2000 (in german)
  • Heiko Müller, Ulf Leser
    Integration durch Standards: Erfahrungen mit CORBA in Life Science Research
    Föderierte Datenbanken, Berlin, 1999, 89-102 (in german)