# XML and Data Binding: A one-day workshop

## Fri 20 June 2003, Avaya Labs

With a little care, it should be possible to generate all of the following from any one of the following:

• A Java or C++ or C# class
• A Haskell or SML or O'Caml data type
• An XML schema
• An SQL relational schema
• An LDAP schema
• An ASN.1 description
• ... and so on ...
One can automatically generate code to convert data between any of these formats, or automatically create an interface to a query. For instance, an SQL query would be accessed with the Java/C++/C# class derived from the SQL relational schema, as compared to current practice with JDBC and ODBJ, which uses the type Row and supports little or no static checking. Several projects address all or part of this area, including Castor, Harmony, Hibernate, JAX-B, J2EE Entity Beans, PAD, SQL-J, Zephyr, and Henry Thompson's data binding work. But no current approach seems poised to take over the field. The time seems ripe to do something useful, and a coalition (involving, say, AT&T, Avaya, Columbia, IBM, Microsoft, and Penn), might have more impact than a lone effort.

The proposed format is a one-day workshop. We will each present relevant portions of our own work in the morning, and brainstorm in the afternoon.

Participants: If you know of someone else who should be invited, please let me know.

Alan Schmitt, U Penn <Alan.Schmitt *at* inria.fr>
Benjamin Pierce, U Penn <bcpierce *at* cis.upenn.edu>
Eijiro Sumii, U Penn <sumii *at* yl.is.s.u-tokyo.ac.jp>
Erik Meijer, Microsoft <emeijer *at* microsoft.com>
Gail Kaiser, Columbia <kaiser *at* cs.columbia.edu>
Jerome Simeon, Lucent <simeon *at* research.bell-labs.com>
Kathleen Fisher, AT&T Labs <kfisher *at* research.att.com>
Mary Fernandez, AT&T Labs <mff *at* research.att.com>
Mukund Raghavachari, IBM <raghavac *at* us.ibm.com>
Oded Shmueli, IBM <oshmueli *at* us.ibm.com>
Phil Gross, Columbia <phil *at* cs.columbia.edu>
Ricardo Medel, AT&T Labs <rmedel *at* research.att.com>
Stephen Tse, U Penn <stse *at* cis.upenn.edu>
Vladimir Gapeyev, Avaya Labs <vgapeyev *at* seas.upenn.edu>
Vivek Sarkar, IBM <vsarkar *at* us.ibm.com>

## Presentations

### SilkRoute: A Framework for Publishing Relational Data in XML

#### Slides in ppt

XML is the the lingua franca'' for data exchange between inter-enterprise applications, making it possible for data to be exchanged regardless of the platform on which the source data is stored or the data model in which it is represented. In this talk, I will describe SilkRoute, a general, selective, and efficient framework for publishing relational data in XML. In SilkRoute, relational data is published in three steps. First, the relational tables are presented to the database administrator in a canonical XML view. This step requires only the relational schema as input and is fully automated. Second, the database administrator defines a public XML view over the canonical XML view in XQuery. The public view is kept virtual, i.e., is not materialized until needed. Third, an application formulates an XQuery query over the public view and submits it to SilkRoute. SilkRoute composes the application query with the public view query, translates the result into SQL, executes this on the relational engine, and assembles the resulting tuple streams into an XML document.

SilkRoute makes two key technical contributions to XML query processing. First, it describes an algorithm that translates XQuery expressions into SQL. The algorithm applies to any XQuery expression, except those containing recursive functions or features that depend on the order of the XML data. The idea of the translation is to represent an XQuery expression in a way that separates the {\em structure} of the output XML document from the {\em computation} that produces the document's content. The latter is expressed in SQL. We call this representation of an XQuery expression a {\em view forest}. The second contribution addresses a specific optimization problem in XML publishing: how to decompose an XML view over a relational database into an optimal set of SQL queries to be executed by a relational engine. We call such a set of SQL queries a {\em plan}. We define formally the optimization problem, describe the search space, and propose a greedy, cost-based optimization algorithm. The algorithm obtains cost estimates for SQL queries from the relational engine. Experiments confirm that the plan-selection algorithm produces queries that are nearly optimal.

SilkRoute is joint work with Dan Suciu, Wang-Chiew Tan, Atsuyuki Morishima, and Yana Kadiyska.

### Storage Techniques and Mapping Schemas for XML

#### Slides in pdf

Reliable XML storage systems that make this data persistent in order to process it efficiently are needed. We give a brief presentation of XML storage techniques developed in research. We then focus on commercial solutions that provide declarative mapping schemas to express XML-to-Relational mappings. The benefit of using a declarative mapping specification is to make mappings transparent and thus facilitate modifying and combining mappings. However, existing mapping schemas lack flexibility, hard-code multiple defaults and can be used for only one storage backend. We present preliminary work that addresses these issues and makes mapping information accessible to applications through a simple interface.

If mapping information is made accessible, optimizing and reusing applications on top of XML stores will be easier. In particular, XML data exchange would greatly benefit from the knowledge of how XML data is stored in the underlying systems.

### LegoDB: An XML to relational cost-based storage design tool

#### Slides in ppt

There has been significant interest from the data base community towards storing XML data into relational databases. Several approaches have been proposed, including: various forms of generic binary tables, DTD-driven generation of relational tables, or user-specified mappings. In this talk, I will argue that finding the "right" relational storage for a given set of XML documents and schema cannot be done without taking the application needs into account. I will briefly present LegoDB, an XML to relational storage design tool which explores a large number of possible storage approaches and uses a cost model to pick up the most efficient for the target application. The cost model takes the following parameters into account: target query workload and data statistics.

### PADS: Processing Arbitrary Data Streams

#### Slides in ppt

Transactional data streams, such as sequences of stock-market buy/sell orders, credit-card purchase records, web server entries, and electronic fund transfer orders, can be mined very profitably. As an example, researchers at AT&T have built customer profiles from streams of call-detail records to significant financial effect.

Often such streams are high-volume: AT&T's call-detail stream contains roughly 300 million calls per day requiring approximately 7GBs of storage space. Typically, such stream data arrives as is'' in ad hoc formats with poor documentation. In addition, the data frequently contains errors. The appropriate response to such errors is application-specific. Some applications can simply discard unexpected or erroneous values and continue processing. For other applications, however, errors in the data can be the most interesting part of the data.

Understanding a new data source and producing a suitable parser are crucial first steps in any use of such data. Unfortunately, writing parsers for this kind of data is a difficult task, both tedious and error-prone. It is complicated by lack of documentation, convoluted encodings designed to save space, the need to handle errors robustly, and the need to produce efficient code to cope with the scale of the stream. Often, the hard-won understanding of the data ends up embedded in parsing code, making long-term maintenance difficult for the original writer and sharing the knowledge with others nearly impossible.

The goal of the PADS project is to provide languages and tools for simplifying data processing. We have a preliminary design of a declarative data-description language, PADSL, expressive enough to describe the data feeds we see at AT&T in practice, including ASCII, binary, EBCDIC, Cobol, and mixed data formats. From PADSL we generate a tunable C library with functions for parsing, manipulating, and summarizing the data. In joint work with Mary Fernandez and Ricardo Medel, we are working to integrate PADS and XQuery to support declarative querying of data sources with PADS descriptions.

### Xtatic: Native XML Processing in a Statically Typed Language

#### Benjamin Pierce, University of Pennsylvania

The recent rush to adopt XML can be attributed in part to the hope that the static typing provided by DTDs (or more sophisticated mechanisms such as XML-Schema) will improve the robustness of data exchange and processing. However, although XML documents can be checked for conformance with DTDs, current XML processing languages offer no way of verifying that programs operating on XML structures will always produce conforming outputs.

In previous work, we have designed and implemented a domain-specific language for XML processing, called XDuce. The main novelties of XDuce are:

• 1) A type system based on REGULAR EXPRESSION TYPES. Regular expression types are a natural generalization of DTDs, describing structures in XML documents using regular expression operators (*, ?, |, etc.) and supporting a powerful form of subtyping.
• 2) A corresponding mechanism for REGULAR EXPRESSION PATTERN MATCHING, which supports concise "grep-style" patterns for extracting information from inside structured sequences.
The lessons learned from XDuce are now being incorporated in a new language, called Xtatic, whose design focuses on smooth integration of these novel XML-processing features into mainstream, object-oriented languages such as C#. The current vision is that Xtatic will be engineered as a lightweight extension to C#, offering native support for regular expression types and patterns and completely interoperable at the binary level with ordinary C# programs and APIs.

### Principles of Synchronization

#### Slides in pdf

Increased mobility -- of programs between computers, computers between locations, and computers between users -- leads to increased replication, which leads to inconsistency, which leads to a broad (and growing) range of synchronization technologies. These technologies are not only a fact of life; they are fascinating, and they raise a host of challenging scientific questions.

The goal of the Harmony project is to develop formal foundations and an implementation architecture for a universal synchronizer -- a generic tool that can synchonize a wide variety of structured data with only very modest "programming" required for each new type of data. (E.g. it should be able to do a good job of synchronizing many kinds of XML databases, given only a standard schema plus a description of the locations of key fields.)

### Data-Object Modeling and Optimization (DOMO)

#### Slides in ppt

Current and future models for distributed applications such as web services, dynamic e-business and on-demand computing present new challenges to programmers and application developers. There is an increasing demand for building internet-scale distributed applications that are heterogeneous in their use of a wide variety of languages and systems for application components and individual processes. This requirement underscores the need for a universal interchange format among processes and services --- XML has emerged as the de facto standard for this purpose. However, current programming models have little support for XML both in terms of integrating XML data into an application as well as efficient algorithms for the processing of XML. The DOMO (Data-Object Modeling and Optimization) project at the T.J. Watson Research Center is focused on developing a programming model that simplifies the integration of XML data into applications, on application-level checking of XML integrity constraints, and on efficient algorithms for XML processing by applications.

### Overview of JAXB

#### Slides in ppt

JAXB (Java Architecture for XML Binding) is both a specification and its reference implementation from Sun. The stated goal of JAXB is to be a tool to transform an instance of W3C XML Schema to a Java package whose classes form a _customized_ object model for XML documents conforming to the schema. This talk will overview the Schema-to-Java mapping styles adopted by JAXB for mapping various Schema components. I will try to highlight difficulties faced by Schema-to-Java binding and list possible research opportunities.