Data warehouse: concepts and testing
Data warehouse: concepts and testing

Data warehouse: concepts and testing

Data Warehouse is where data from different source systems are integrated, processed and stored. It is a non-production data, which is mainly used for analyzing and reporting, in order for management team to make important business decisions. Non-production data is represented  by data that is not used by a certain sector of a business, but is a necessary data for the entire business to grow and to adapt to the new conditions. For example, a retailer might be interested in customer buying habits, so he will need data about the purchases the customer makes.

This for sure will include data that will not be used in production, but it is essential data for the retailer to make business decision in order to adapt to the customer needs. Storing this mass information, makes the topic of the Data Warehousing. Data Warehouse collects large volume of data from various sources, with many data formats. The process that handles and transforms the data into a more consistent, standard formatted data is the ETL.

ETL is the acronym for Extraction, Transformation and Loading.  The ETL process is done using an ETL tool – most used tools are Data Stage, Informatica and Oracle Warehouse Builder.

An interesting definition of Data Warehouse was written by Bill Inmon: “A Data Warehouse is a historical, subject-orientated, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.”.

But what does this mean?

  1. Historical means that the data is continuously collected from sources and loaded in the warehouse, previously data being saved for a long period of time.

  2. Subject-Orientated means that the data is grouped into a particular business area, instead of the entire business.

  3. Integrated means that the warehouse is collecting and merging data from various sources.

  4. Time-variant means that all data in the data warehouse is identified with a particular time period.

  5. Non-volatile means that data loaded in the warehouse is based on business transactions in the past, hence it is not expected to change over time.

Architectural Overview

A Data Warehouse can be can designed in many different ways, depending on the business specific needs. However, the most popular model in many organizations is “hub-and-spoke”. Below is presented the functionality of the system:

  1. Data is copied from databases used in operational systems into a data warehouse staging area

  2. Form staging, the data is moved into a data warehouse (staging is an intermediate place where data from source is copied)

  3. Data is copied into a set of conformed data marts Data is copied from one database to anther using a technology called ETL – Extract, Transform, Load. Please find below a typical Data Warehousing Environment.

Why to create a Data Warehousing?

The main reason of creating a data warehousing is because a corporation has all data assets fragmented in many systems, running on different platforms and in different physical locations. More, usually  a databases created for specific sectors of the corporation, are “relational”, and not “dimensional”, being designed for operational and data entry. The challenges of a data warehousing is to be able to quickly consolidate, cleanse and integrate data form multiple databases that run on different platforms. This job is performed using a technology called ETL, as mentioned above in this article. The ETL software extracts data, transforms values of inconsistent data, cleanses “bad” data, filters data and loads it into a target database. A very important step is scheduling the ETL jobs, because even thought there is a failure in one ETL, the rest of the jobs should respond correctly.

Above in this article I talked about a layer called “staging”. As mentioned, this is the place where data is copied from source systems. This is mainly required in a Data Warehousing for timing reasons, because all data needs to be available before integrating in the final Data Warehouse. (an example were the data needs to be available before the final integration is testing). After data is copied to staging, this is moved to Data Warehouse. However, decision makers do not use the Data Warehouse directly for queries, they use a layer called Data Mart. ETL jobs extract and populate one or more Data Marts for use by different groups in the organization. The Data Marts can be dimensional (Star Schema) or relational, depending on how the information will be used.

Data Warehousing Testing

With your mind full with the information about the concepts of data warehousing and the importance of it, let’s proceed and talk about the importance of testing the ETL. A well planned and well defined testing scope, guarantees a smooth conversion of the project to production. A business is more confident when the ETL process is checked by a different team, which has as purpose to make sure that the data warehouse is concrete and robust. Testing procedure is categorized into four different engagements:

  • New Data Warehouse Testing – a new data warehouse is build and checked from scratch.
  • Migration Testing – in this situation, the customer has a Data Warehouse, ETL jobs are running correctly, but the business needs to improve the efficiency, so the system is ported to a platform.
  • Change Request – in this situation, new data is added from different source to an existing Data Warehouse. Also situation also can mean that the customer needs to change their existing business rule or they might need to integrate a new rule.
  • Reporting Testing – reports are the end result of a Data Warehouse and basically the entire purpose of it.

ETL testing techniques

  • Check if the data that is extracted from source is transformed correctly and according to various business requirements
  • Data integrity – make sure that all project relevant data is loaded into the Data Warehouse, without any data loss or truncation
  • Check if the ETL job, rejects or replaces invalid data
  • Make sure that the ETL jobs run and loads the data within the time frames

Beside the above four main ETL testing procedures, other methods like integration testing and user acceptance testing are also carried out to make sure that the system is smooth and reliable. A well defined test strategy will make sure that correct approach has been followed meeting the testing aspiration. ETL testing, might require writing SQL statements extensively by testing team or may modify the SQLs provided by development team. In any case, testing team must be aware of the results they are trying by using the SQL statements.