What is a data warehouse?

Data Science   |   
Published August 22, 2019   |   

As any good tradesperson knows, you need the right tool for the job. Just ask someone who’s used a screwdriver to assemble furniture how much they would pay for a power drill.
The same holds true for working with data.
Today, many companies dive into machine learning, advanced analytics, or other buzzword-heavy projects with the goal of getting ahead of competitors. But without a solid understanding of what data can do for their organization, how to effectively store and harness that data, and a strategic, deliberate approach to such endeavors, their efforts may cause more harm than good.
So, what is the right tool for a company looking to draw insights from its data? A data warehouse, which is the most robust and sustainable tool to support reporting, analytics, and other advanced use cases.

What is a data warehouse? A basic definition and key concepts

A data warehouse, also called an enterprise data warehouse (EDW), is simply a system designed to support data analysis and reporting. An EDW serves as an organization’s central storehouse for integrated data (i.e., data combined from multiple sources so end users may easily source a single, comprehensible, usable view of important company data).
Data warehouses are intended to house an organization’s complete and unified repository of information. From one, you can pull key business data to make informed decisions. Data warehouses do nothing less than modernize corporate data usage through robust and scalable infrastructure.
Data warehouse content may come from a company’s operational systems (ERPs, Historian, PI System, etc.), financial systems, transactional systems, relational databases, and a variety of other sources.
This data usually includes both up-to-date and historical data, all of which internal users may access to build analytical reports. Company decision-makers, analysts, and data experts in particular require swift and efficient access to consistently reliable corporate data—which is precisely what data warehouses are designed to offer.

How data warehouses work

Let’s delve into a little more technical detail about the function of data warehouses, to illuminate the profound business possibilities they represent.
Data warehouses retain copies of all original or source data. This is crucial because it allows organizations to:

  • Gather data from multiple places and preserve it a single database and data model;
  • Improve transaction processing system efficacy by protecting customer-facing databases from large, long-running, analysis queries;
  • Give stakeholders a centralized view of your data by ensuring data from disparate sources are consolidated;
  • Maximize the quality and usability of organizational data through consistent coding and descriptions, by correcting problematic data, and by sorting and reducing duplicated data;
  • Maintain complete data history, even if data is purged from source transaction systems;
  • Format or restructure data so it’s more easily used and to improve query performance, no matter how complex, without undermining operational system effectiveness; and,
  • Improve operational business application quality, especially customer relationship management systems (CRMs) by maintaining a single, accurate, and up-to-date source of data truth.

The typical architectural components of a data warehouse

Data warehouses usually operate on an extract, transform, load (ETL) basis and typically employ staging, data integration, and access layers during this process. Key layers include:

  • A staging layer (or staging database), which houses raw data drawn from all organizational source data systems;
  • An integration layer, where multiple data sets are combined; this integrated data may then be moved to an operational data store (ODS) database;
  • A data warehouse database, where integrated data is put into hierarchical groups (or dimensions), facts, and aggregate facts; and,
  • An access layer where hierarchical groups are placed together.

Once data has been integrated and catalogued, designated business users can mine it to support a wide variety of analysis, research projects, and decision-making and strategic planning.
Part of what makes data warehouses so reliably accurate is that the data they contain cannot be altered. This ensures users can accurately track data changes over time; it also makes creating and maintaining an accurate data dictionary (a complete list of database files) possible. And a correct and up-to-date data dictionary is one of data warehousing’s key means for extracting, analyzing, transforming, and loading data.
This outline of data warehouse architecture leads us to a more complete definition of data warehousing. A robust, strategically focused data warehouse comprises not only tools to extract, transform, and load data, layers to transform it, and a dictionary to organize it; it also includes tools to manage and retrieve metadata—and business intelligence tools that both support and reflect your organization’s unique requirements.

How to build your data warehouse

Data warehouse architectures can vary widely in complexity, according to the needs of each organization. All data warehouses, however, must be built using these steps:

  • Find your data. You must collect massive amounts of data, from every corner of your company—and from any relevant external sources.
  • Cleanse your data. Scan data for mistakes, omissions, and duplication; make the appropriate corrections and deletions.
  • Convert data from a database to a warehouse format, so it becomes read-only.
  • Sort, consolidate, and summarize your data. This will ensure not only that it’s accurate and complete, but perhaps most important, easy to delve into and use.

This process will be repeated anytime you add more data, or if any of your data sources are modified.

Popular data warehouse architectures

There are three main data warehouse forms; which architectural approach an organization takes reflects variables like size, line(s) of business, and current corporate data setup.
Basic data warehouse. This simple format lets users run simple or straightforward queries like “August sales” or “new customers added in Q2”. How quickly such queries can be completed—also known as access, latency, or online analytical processing (OLAP)—is paramount in this case.
Data warehouse with staging area(s). This is key for data warehouses consolidating large quantities of important but varied business data sources; staging areas make data cleansing easier, and integrating or consolidating data from myriad sources more accurate.
Data warehouse with staging area(s) and data mart(s). This is the future—but a future you can build now. Data marts give different groups in an organization access to the specific information they need, in a way that will benefit both their particular focus (e.g., sales or OpEx) and the larger organization simultaneously.
Thus, the larger and more complex a company is, the more it would benefit from building a data warehouse with both staging areas and data marts. All data warehouses answer data queries, so smaller organizations or those with a single data source would also benefit from adopting a data warehousing approach. But what, precisely, is a data mart?

What about data marts, data lakes and databases? How are they different?

There are a lot of data sorting, storage, and accessing options available. Which will benefit your business most depends on what you use your data for.
Data mart. As already indicated, a data mart is part of a data warehouse, generally geared towards giving a group, team, or line of business and the specific information they require. Also called mini-data warehouses, they both improve response time within the already low-latency data warehouse and ensure queries are sufficiently focused to be useful to end users.
Data lake. Data lakes are simply repositories filled with unorganized, unclassified data; they’re generally helpful for collecting data the value of which isn’t yet known. Data lake data may not be cleansed, corrected, or deduplicated; useful for applications like machine learning, data lake analytics queries can produce poor results for users looking for usable, trustworthy business insights.
Database. Databases log frequent transactions and provide quick access to specific, repetitive business transactions. While designed to be good at receiving data, databases simply aren’t built to be sources from which to pull insights.

Use cases: data warehouse vs. database

Since data marts are sub-sets of, and reside within, data warehouses; and as data lakes, like databases, don’t sift, organize, cleanse, or integrate data, we’ll consider use cases only for databases and data warehouses now.
Data drawn from databases for analysis is generally used for simple, daily transactions, such as:

  • Displaying tons processed per day from a machine;
  • Hospital staff admitting or registering patients; and,
  • Recording hours worked at a job site.

Databases are relatively basic tools. The primary function of a database is to accurately and efficiently record data, sometimes very large amounts thereof—and that’s all. To maintain this simple functionality, architectural trade-offs are often required—trade-offs that can limit an organization’s ability to access, use, and analyze its own data.
In contrast, data warehouses are made to support more sophisticated activities, such as:

  • high-level reporting and analysis aimed at coming to intelligent, informed business decisions;
  • mining data for present or future needs from many, perhaps large and/or disconnected, databases;
  • closely analyzing large volumes of data in-depth for market research purposes;
  • analyzing user behaviors, both internal and customer, to tweak sales and marketing strategies, or improve processes and collaboration; or,
  • gaining usable, unique insights, whether via reports, ad hoc queries, or automated decision-making.

Organizations may pull data from multiple databases into a single data warehouse to make it easier to access insights.
Consider Wonderware Historian, a database that captures a wide variety of sensor data from multiple industrial sources. A mining company could use a data warehouse to consolidate multiple data sources (including, among many others, Historian); analysts could query this newly populated data warehouse with specific requests, such as a list of each day the mining company’s primary crusher processed less than 3,000 tons.
Not only can data warehouses do more, from a business perspective, than databases alone, they can also be connected to other, more business-focused tools to bring organizations even more competitive value. For example, specific requests such as a mining company’s measurable crusher productivity can be aggregated into other tools like dashboards or more complex models.

This is why you need a data warehouse

We hope it’s clear by now that running a data-driven organization in a competitive and quickly changing business landscape, data warehouses—with data marts as appropriate—are the surest way to win.
Data warehouses offer the most reliable and accurate way for businesses to store and access structured data; this in turn improves cross-organizational data access via reports, dashboards, and analytics tools. These help businesses better monitor performance and improve decision-making because they know their data is trustworthy. This begins a virtuous cycle:
Consider this hypothetical situation (sadly common in the real world): Organization X owns a lot of data but it’s difficult to access. It takes a long time to generate reports and end users can’t necessarily trust results; perhaps quarter-end reports don’t align or some piece of the company’s larger data puzzle is clearly missing. Such reporting insufficiencies only increase friction—data, social, collaborative, workflow, to name a few—throughout the organization.
Things don’t need to be this way. Consider the opposite scenario; Company Z has set up a data warehouse and everyone knows they can access data accurately, easily, and whenever required. What happens at companies like this is, users will read automatically generated reports and, their interest piqued, eventually start asking for more. They will realize how easy it is to get useable company and/or departmental data and will start to run experiments—to improve operations, provide better customer insight, or save money, to name just a few possibilities. This would encourage bolder activity—and ultimately transform the business.
Organizations with committed data warehousing teams can plan and move well ahead of their less data-savvy competitors in every way—from product development, marketing, pricing, production processes, and historical analysis, to forecasting, employee organization, and customer satisfaction. They can, in short, thrive where others will fail.
This article originally appeared on the 3AG blog.