Data Integration Can be a Hoot with OWL
The movement to standards-based computing that XML and Web Services herald is eerily analogous to the work done in the first half of the twentieth century to establish international long distance telephone standards. Finally, you could dial direct to China! The problem was, well, if you didn’t speak Chinese, you still couldn’t have a conversation. The very same situation is now plaguing the area of IT integration today. One of the great advantages of XML, Web Services, and Service-Oriented Architecture is the increased emphasis on the loose coupling of systems, the adoption of interoperable data formats, and the general lowering of the barrier for accessing information in all its disparate types and locations. However, as we discussed earlier in our Loosely Coupling the Meaning of Data ZapFlash, what good is accessing all these data sources when the applications can’t understand or interpret the information they can access?
In fact, all that Web Services and SOA do is eliminate one barrier to integration while ignoring another: namely the need to process information rather than just presenting it to humans. So, with our ever-advancing movement to greater levels of abstraction for building applications and accessing systems, is integration itself just a shell game, shifting the cost and complexity from one location to another, or is there some solution to the integration morass by finding an approach to solving our data-level, semantic integration issues? In other words, now that the phones interoperate, is there any hope of understanding the person on the other end of the line?
The Semantic Integration Challenge: Understanding the Meaning of Data
This ZapFlash explores the practical side of semantic integration. Enterprise architects must eventually deal with data-level issues, and as a result, it pays to focus on some new developments that are making some of the more intractable problems of semantic integration more easily digestible. The primary challenge of integrating disparate systems is that any two systems simply don’t have the same understanding of any particular set of data or its representation. For example, two systems that require knowledge of customer data might represent the very same customer name, address, and purchase data in different ways, causing problems when it comes time to integrate those systems. In fact, in many ways, this data “impedance mismatch” limits most companies’ ability to reduce their cost of integration.
It is a given that most people won’t be agreeing on any single representation for data any time soon — to the great consternation of the folks championing cross-industry data representation and standardization efforts. So, how can we successfully move to systems that are truly loosely coupled if we ignore the elephant in the room that the semantic integration challenge represents? One of the emerging approaches for solving these data integration challenges is the movement towards representing both data and the relationships among various informational elements in a machine-understandable, interoperable manner. At the core of this movement is the notion of ontologies. An ontology is metadata that describe how a system should interpret the meaning of the data that it receives.
This potent concept of ontology introduces the notion of a hierarchy, or taxonomy, of information that relates various data constructs to others in a well-defined system. An ontology provides an exact description of a particular data element and its relationships to other data elements. In essence, an ontology is metadata that describes how a system should interpret the data it is being presented. For example, an ontology that defines an airport might define the airport name, its longitude and latitude location, elevation, various servicing information, and the relationship among those various pieces of information. Ontologies are everywhere — from the medical field where information taxonomies are important to helping physicians identify ailments and their treatments to online resources, such as Yahoo! and Amazon.com, which regularly classify their information to make it easier to locate products you can buy.
So, how is the concept of ontology useful for solving the semantic integration challenge? Once systems can successfully apply meaning to data in a machine-interpretable manner, users can share a common understanding of the structure of information among systems, enable reuse of domain-specific knowledge, separate the knowledge of data and relationships among data from the representation of those data, and analyze how the enterprise defines, uses, and consumes information. In essence, by abstracting what data mean from how to represent them, we can solve information integration challenges in much the same way that separating how to access and compose Services from how to implement them helps to solve the application integration challenge. The ontology concept is therefore the application of the principles of Service Orientation to the information domain.
How, then, is it possible that producing an abstract definition of information and its interrelationships will solve information integration challenges? By abstracting the relationships and meaning of data from how databases, file systems, and applications actually store and represent them, users can change the meaning and relationships of data elements without having to change the data themselves. They can expose data elements and allow any arbitrary third-party to consume them without having to first come to some agreement about data typing, schema, or the specifics of data representation.
Furthermore, businesses that leverage an abstract ontology layer can aggregate information from across a wide range of disparate, heterogeneous data sources without having to know anything about where or how that information will actually be implemented. This means that companies will not only be able share data internally, but also exchange data with third parties or amongst an entire industry, where industry standards organizations can define those ontologies for their industries. Finally, through ontologies, companies can distribute the task of defining data relationships among all the constituencies in their organization, and later analyze how the organization defines, shares, and consumes those data relationships ar so that they can be reuse them for new integration tasks.
Introducing the Web Ontology Language (OWL)
Just as Web Services refer to a stack of specifications that define how to define and exchange Services and publish them in a registry, ontology specifications are similarly defined in a stack that provides increasing levels of data meaning. At the base of this stack, XML provides a mechanism for annotating information with metadata to structure the information. However, the metadata does not provide any real semantic constraints or meaning to the information. XML Schema adds an increasing level of specificity to the XML by restricting the arbitrary structure of XML documents and specifying how to type and constrain individual elements. The Resource Description Framework (RDF) further extends XML by providing a simple mechanism for relating individual XML elements (known as “resources”) to other such resources, allowing systems to integrate information from different XML documents with different XML schemas via rules-based assertions.
For example, through RDF, a system can assert that a hard drive is a component of a computer, and thus any two schemas that relate to computers can share this information. RDF schemas further extend RDF by providing a specific vocabulary to describe properties and groupings (known as “classes”) of RDF resources, as well as defining hierarchies of such properties and classes. Finally, the Web Ontology Language (OWL) further adds specificity by providing a vocabulary that describes relationships among various classes, their cardinality, the characteristics of specific data properties, rich typing of those properties, and other features. By implementing this complete stack, users finally have enough specificity to process the meaning of information exchanged between systems, rather than having to rely on humans to interpret it.
The ZapThink Take
Ontologies, however, are not a panacea. The main problem with the above stack is the amount of advance work that a company must put into defining data and the relationships among data elements before they can derive any value from the system. However, even after all the work to implement a rich and expressive ontology for all the critical data a company cares about, these ontologies will still not eliminate the human from the information integration equation. By definition, ontology design is a process that requires a human to define the relationships among data elements, so not only is there no single correct ontology for any domain, but the very same data sets can result in completely different ontology representations. As such, the only way that ontologies can solve information integration challenges is if companies adopt rigorous methodologies, cross-industry information sharing, and central ontology repositories to avoid wasted duplication of effort.
In many ways, however, the vision of using ontologies as an abstraction layer for enabling the automated exchange of information is analogous to the use of Service contracts to abstract the implementation of Service providers from consumers. In both cases, the power of loose coupling significantly reduces, but does not eliminate, the cost of integration while simultaneously enabling reuse through composition of application and data elements. While SOA solves the problem of application composition, the use of semantic integration technologies like OWL can solve the problem of data composition. As such, ZapThink envisions that successful enterprise architects will at some point have to embrace both the concepts of SOA as well as OWL to truly enable IT to meet changing business needs. However, just as SOA requires an advanced investment in architecture, the creation of ontologies are quite time-consuming, and require a leap of faith by implementers before they can realize any significant value.