banner.jpg

The Sagence Connection

How Gmail Killed the Enterprise Data Warehouse

Posted by Sagence

1/13/14 5:53 AM

Data itself isn’t bigger than ever; it’s just that storage allows us to keep it all. The introduction of Gmail was a milestone that changed user expectations. Users now expect to have easy access to everything recorded ever for all time. This expectation is effectively overwhelming traditional data warehousing practices and continues to drive implications for the future of enterprise data management.

Remember the enterprise data warehouse? Remember the “Kimball vs. Inmon” debates? As data warehousing emerged, beginning in earnest in the 1990’s and expanding into the first few years of the 21st century, traditional business intelligence (BI) practices were developed to maximize use of limited storageby carefully normalizing, summarizing, and purging data—because it was not possible (i.e., prohibitively expensive) to keep all of the originally recorded data. Data management practices emerged to carefully the-story-of-gmail_5165ae94e0d4cpreserve the most vital information—in normalized relational models and star schemas—and methodically discard the raw, untransformed representations of transactions. It became critical to determine the most durable, flexible, valuable views, and the most likely inquiry use cases, when designing data warehouse schemas and ETL processes.

When Google launched Gmail in 2004, it changed expectations for almost all connected technology users in two significant ways. First, it moved users to a web interface. Second (and more importantly), it taught users that they never had to throw anything away. As part of the 1 GB and increasingly available (now 15+ GB) per-user storage, users were trained that they could easily stop worrying about managing storage (no need to delete!) and that old emails could be retrieved at any time.

This trend–this expectation–of persistent data is one of the most subtle but fundamental underpinnings of Big Data. While it is irrefutable that new [eco]systems are producing data at a faster pace with more variability than ever—Twitter, GPS, high-frequency trading, and mobile apps haven’t always been with us! —the real shift has been that we’ve taught ourselves to expect to have access to all versions of everything recorded ever for all time: emails, documents, purchase histories, daily trade files, usage logs, transcripts, exception logs, pay stubs, instrumentation, “as-of date” prices, inventory counts, work orders, trouble tickets.

The old enterprise data warehouse models don’t provide access to everything recorded about your company for all time—just the facts. Imagine if Gmail only gave you stats about your email use last quarter. Would you use it?

Of course, there will always be some discussion over whether limited storage was the only reason to discard (not persist) recorded data. There were legitimate conversations, as online travel booking emerged, for example, about why we would bother keeping records of all airline ticket search queries. Application logs were and are purged bi-weekly. But as time has progressed, and as the availability of storage has continued to increase, the prevailing trend has to keep all data for not-yet-determined inquiry and use.

Ultimately, these user expectations—to access nearly all of our data for all time—are the irresistible drivers of Big Data. Why don’t data warehouses work the same way? Why aren’t they as easy to use as Gmail? These questions are forcing data managers—CDOs and CTOs—to change the face of enterprise data architecture, data governance, and data management. 

New Call-to-Action

Contributed by Jace Frey.

Sagence is a management consulting firm advising clients in information-intensive industries. We specialize in data management and analytics and in the acquisition, evaluation, and development of critical data assets.

Topics: Data Management