|
|
|
|
|
|
|
Research |
| |
|
Archiving Scientific Data |
| |
|
Scientific and reference databases on the Web are a primary source of information.
However, the data is subject to continuous change and often only the most recent
versions are preserved. For many database providers it has become common practice
to overwrite existing database states when changes occur and regularly publish new
releases of the data on the Web. Failure to archive earlier states of the data may
lead to the loss of scientific evidence, and the basis of findings may no longer be
verifiable.
Our archiving tool XArch is an archive management system that allows one to
maintain, populate, and query archives of multiple database versions. XArch is based
on a nested merge approach that efficiently stores multiple database versions in a
compact archive. The system allows one to create new archives, to merge new versions
of data into existing archives, and execute both snapshot and temporal queries using
a declarative query language.
XArch - The XML Archiver project
|
| |
|
Data Cleansing and Data Quality |
| |
|
High costs and loss of reputation caused by data of poor quality
made quality assurance and data cleansing hot topics in the business
world. Recently, data quality is gaining attention in the scientific
community as well. Within this project, we review existing data cleansing
methods. We classify data deficiencies that diminish the quality of existing
data sources and quality criteria that are affected by these deficiencies.
Based on these classifications, we show which cleansing approaches are capable
of handling which data deficiencies and quality criteria. We show why existing
data cleansing techniques fall short for the domain of genome data and argue
that merging overlapping data has outstanding ability to increase data accuracy.
DBIS: Data Cleansing of Genome Data @ Humboldt-Universität zu Berlin
|
| |
|
Data Integration |
| |
|
Information integration is often faced with the problem that different data sources
represent the same set of real-world objects, but give conflicting values for
specific properties of these objects. The main objective of this project is to provide
methods that aid the developer of an integrated system over overlapping, but
contradicting, sources in the task of resolving value conflicts. In many cases,
conflicts between contradicting sources do not occur by chance, but follow some
systematic reason. Our goal is to identify such systematic conflicts and outline
regular data patterns that occur in conjunction with them. Evaluated by an expert
user, the discovered regularities provide insights on possible conflict reasons
and help assess the quality of inconsistent values.
DBIS: Conflicts in Data Integration @ Humboldt-Universität zu Berlin
|
|
|
|
|
|
|
|
|
|
|
Publications |
| |
|
Conferences, Workshops & Journals (Peer-reviewed) |
| |
|
2010 - 2009 - 2008 - 2007 - 2006 - 2005 - 2004 - 2003 |
| |
|
2010 |
| |
|
- Wenfei Fan, Floris Geerts, Shuai Ma, Heiko Müller
Detecting Inconsistencies in Distributed Data 26th IEEE International Conference on Data Engineering, March 1-6, 2010, Long Beach, California, USA
|
| |
|
2009 |
| |
|
- Peter Buneman, Heiko Müller, Chris Rusbridge
Curating the CIA World Factbook The International Journal of Digital Curation, Issue 3, Volume 4, 2009 (pdf)
|
| |
|
2008 |
| |
|
- Heiko Müller, Peter Buneman, Ioannis Koltsidas
XArch: archiving scientific and reference data ACM SIGMOD international conference on Management of data, Vancouver, Canada, 2008, 1295-1298 (poster)  
- Ioannis Koltsidas, Heiko Müller, Stratis Viglas
Sorting hierarchical data in external memory for archiving Proceedings of the VLDB Endowment, Volume 1 , Issue 1, August 2008
|
| |
|
2006 |
| |
|
- Heiko Müller, Johann-Christoph Freytag, Ulf Leser
Describing Differences between Databases ACM 15th Conference on Information and Knowledge Management (CIKM), Arlington, VA, USA, 2006 (pdf)  
- Heiko Müller, Ulf Leser, Johann-Christoph Freytag
Classification of Contradiction Patterns 30th Annual Conference of the German Classification Society (GfKl), Berlin, 2006 (pdf)
|
| |
|
2005 |
| |
|
- Heiko Müller, Melanie Weis, Jens Bleiholder, Ulf Leser
Erkennen und Bereinigen von Datenfehlern in naturwissenschaftlichen Daten Datenbank-Spektrum, Heft 15, November 2005, 26-35 (in german) (pdf)  
- Michael Mielke, Heiko Müller, Felix Naumann
Ein Data-Quality-Wettbewerb Datenbank-Spektrum, Heft 14, August 2005, 34-37 (in german)  
- Kristian Rother, Silke Trißl , Heiko Müller, Thomas Steinke, Ina Koch, Robert Preissner, Cornelius Frömmel, Ulf Leser
Columba: An Integrated Database of Proteins, Structures, and Annotations BMC Bioinformatics März 2005, 6(1):81 (pdf) (link)
|
| |
|
2004 |
| |
|
- P. Rieger, S. Heymann, H. Müller
Datenbankgestützte Wissensakquisition in den Lebenswissenschaften Datenbank-Spektrum, Heft 10, August 2004, pp. 14-21 (in german) (pdf)  
- Heiko Müller, Peter Rieger, Katja Tham, Johann-Christoph Freytag
Dynamic information fusion for genome annotation Informatik 2004 Workshop über Dynamische Informationsfusion, Ulm, Germany, 2004 (pdf)  
- Heiko Müller, Ulf Leser, Johann-Christoph Freytag
Mining for Patterns in Contradictory Data Proceedings of the SIGMOD International Workshop on Information Quality for Information Systems (IQIS'04), Paris, France, 2004 (pdf)  
- Kristian Rother, Heiko Müller, Silke Trissl, Ina Koch, Thomas Steinke, Robert Preissner, Cornelius Frömmel, Ulf Leser
COLUMBA: Multidimensional Data Integration of Protein Annotations International Workshop on Data Integration in Life Sciences (DILS 2004), Leipzig, Germany, 2004 (pdf)
|
| |
|
2003 |
| |
|
- Heiko Müller
Semantic Data Cleansing in Genome Databases Proc. of the VLDB 2003 PhD Workshop, Berlin, Germany, September 12-13, 2003  
- Heiko Müller, Felix Naumann, Johann-Christoph Freytag
Data Quality in Genome Databases Proceedings of the Conference on Information Quality (IQ 03), Boston (pdf)
|
| |
|
Other publications |
| |
|
2010 - 2009 - 2008 - 2007 - 2006 - 2005 - 2004 - 2003 - 2002 - 2001 - 2000 - 1999 |
| |
|
2009 |
| |
|
- Heiko Müller
Describing differences between overlapping databases Ph.D. Thesis, Humboldt-Universität zu Berlin, 2009 (link)  
- Heiko Müller
Archiving and Maintaining Curated Databases Grundlagen von Datenbanken, 2009, 135-139 (pdf)
|
| |
|
2007 |
| |
|
- Heiko Müller
XArch - An Archive Management System 3rd International Digital Curation Conference, Washington, DC, 2007 (poster)  
- Ioannis Koltsidas, Heiko Müller, Stratis Viglas
Sorting Hierarchical Data in External Memory Technical Report, EDI-INF-RR-1217 (pdf)
|
| |
|
2006 |
| |
|
- Heiko Müller, Ulf Leser, Johann-Christoph Freytag
On the Distance of Databases HUB-IB-199, Humboldt University Berlin, March 2006 (pdf)
|
| |
|
2003 |
| |
|
- Heiko Müller, Johann-Christoph Freytag
Problems, Methods, and Challenges in Comprehensive Data Cleansing HUB-IB-164, Humboldt University Berlin, 2003 (pdf)
|
| |
|
2000 |
| |
|
- Heiko Müller
Realisierung eines einheitlichen Zugriffs auf molekularbiologische Genomkarten mit Hilfe von CORBA Diploma Thesis, Technische Universität Berlin, 2000 (in german)
|
| |
|
1999 |
| |
|
- Heiko Müller, Ulf Leser
Integration durch Standards: Erfahrungen mit CORBA in Life Science Research Föderierte Datenbanken, Berlin, 1999, 89-102 (in german)
|
| |
|
|
|
|
|