The Power Behind the Throne: Information Integration in the Age of Data-Driven Discovery

Speaker:  Laura M Haas – San Jose, CA, United States
Topic(s):  Information Systems, Search, Information Retrieval, Database Systems, Data Mining, Data Science

Abstract

Integrating data has always been a challenge. The information management community has made great progress in tackling this challenge, both on the theory and the practice.  But in the last ten years, the world has changed dramatically.  New platforms, devices and applications have made huge volumes of heterogeneous data available at speeds never contemplated before, while the quality of the available data has if anything degraded.  Unstructured and semi-structured formats and no-sql data stores undercut the old reliable tools of schema, forcing applications to deal with data at the instance level.  Deep expertise in the data and domain, in the tools and systems for integration and analysis, in mathematics, computer science, and business are needed to discover insights from data, but rarely are all of these skills found in a single individual or even team. Meanwhile, the availability of all these data has raised expectations for rapid breakthroughs in many sciences, for quick solutions to business problems, and for ever more sophisticated applications that combine and analyze information to solve our daily needs. 

These expectations raise the bar for integration technology, while opening the door for it to play a broader role.  Integration has always been a key player in handling data variety, for example, but now more than ever must deal with scale (in the number of types as well as in the volume and speed of data).  While data cleansing has been one step of an integration pipeline, this technology must be leveraged throughout data integration, so that the integration process is better able to deal with the uncertainty in data, offering means to eliminate or reduce it, or, to elucidate it by linking important contextual information, such as provenance and usage.  The complexity of today’s data-driven challenges in fact suggests that the integration process should be context-aware, so that data sets may be combined differently depending on the proposed usage.

In the Accelerated Discovery Lab, we support data scientists working with a broad range of data as they try to find the insights to solve problems of business or societal importance.  Clearly, integration is essential to insight.  However, integration has to be across more than just datasets and schemas, and it has to be done more dynamically and flexibly than the standard tools allow.  It is needed at multiple levels: (1) to build rich (but flexible) collections of diverse data, (2) to tightly bind individual data points into entities, allowing deeper explorations and (3) to bring together data and context to enable re-use by users with differing expertise.  We think of the environment we are building as an integration hub for data, people and applications. It allows users to import, explore and create data and knowledge, inspired by the work of others, while it captures the patterns of decision-making and the provenance of decisions.  I will describe the environment we are creating, the advances in the field that enable it, and the challenges that remain.  

About this Lecture

Number of Slides:  20
Duration:  60 minutes
Languages Available:  English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.