My developer connection home My Developer Connection
knowledgebase for software developers

Datawarehousing - guidelines for planning and development

 Introduction

Data warehousing has established a new middleware layer in corporate application architecture. Such a middleware layer is necessary because the direct, individual access of decision support applications to data of operational, transaction-oriented applications has proved to be technically or economically infeasible. Data quality problems and complex integration requirements usually make it impossible to provide a consistent and real time integrated data to decision support systems.

 

Datawarehousing system as middleware layer.

 

This paper outlines the common guidelines followed across the Datawarehousing solutions. It also highlights the risks involved at various stages of Datawarehousing application development. Each of these guidelines needs to be discussed in detail to arrive at a solution best suitable for your business requirements.
 

Planning

Requirement Analysis guidelines:

  •  We should allow 80% of the task effort to focus on immediate requirements. The remaining 20% of the effort should be spent trying to understand what future requirements are.
  •  The source systems (accounting, ERP, sales, inventory…), which will provide information to data, warehouse needs to be identified.
  • The business rules that needs to be applied to the data needs to be well understood.
  • If medium term requirements are not well understood, it could put limit the candidate architecture to evolve and satisfy longer-term requirements.
  • Produce the logical model for the data warehouse and produce initial query profile. Expect it to change on an ongoing basis.
  • Identify initial sizing estimates for database.
  • Understand the existing IT infrastructure and identify hardware preference.
  • Refine and rework initial conclusions. Always document the present.

 

Risks:

  • Because the data warehousing requirements are never fully understood, beware of the tendency to extend this task. A common comment is “we still don’t fully understand the requirements.” A time box is required as such requirements will be impossible to define completely.
  • Focus on designing flexibility into data warehouse and business rule execution. Failure to do so ma lead to substantial cost penalties in the future.
  • Since a logical model of enterprise may not exist, spend some time trying to brainstorm what a potential one might look like.
  • If the midpoint feedback from right people does not take place, the initial decision may be suspect or risky, and further work on them may be inappropriate.


 

Architecture and design guidelines:

  • The following input are definitely required to start the architecture and blueprint of a Datawarehousing system:
    • Identification of the mechanism for the data transfer and load.
    • Database sizing and query performance expectations.
    • Access control, backup and recovery guidelines.
    • Overall scope definition. (Not just the first build definition).

IT infrastructure.

  • Data extraction might be done in one of the two ways:
    • Using flat files.
    • Using data management tool (EAI tools).
  • The worst-case scenarios must be defined in terms of the size of feed data and allowed processing time for the maximum load.
  • It is advisable to design a preliminary set extraction and load scripts and test the performance. Expect the statistics to change as the scripts evolves and database grows.
  • Design a process controller on the source system to initiate and track the load and validation process.
  • Design physical databases. Convert the logical model to a detailed database design within the overall technical blueprint.
  • Design and build scripts to generate indexes.
  • Design and build basic data validation functions, which need to be performed within Datawarehousing schema.
  • Design and build scripts to generate summary and error tables. A summary table contains the process and status information regarding extraction and load. Error tables’ records dirty data, which are rejected by Datawarehousing schema validations.
  • Document the technical blueprint and decisions on which load and validation tools will be used.
  • Access control facility could be sometime very complex, so the elapsed time on these steps could increase.
  • Design and management of the data warehouse on a MPP System is considerably more difficult than on an SMP or cluster.
  • Do not design the data warehouse around a specific tool or tool type.
  • Ensure that the data warehousing design scales as the data scales.

Development

Guidelines for development and testing Datawarehousing solution:

  • Create Physical database and partition the table spaces and disks etc.
  • Avoid “queries of desert” which never seems to be ending. These are the queries that occupy the entire resource of the machine and effectively run forever.
  • First, profiles can be used to limit the amount of resource a user process can use. When that amount is exceeded, the query will be terminated automatically. It will stop it from running forever.
  • Queuing of queries can be used to control their resource usage. If all the queries must be submitted via a query/load manager, then the degree of parallelism can be controlled.
  • Do not load data directly into the data warehouse tables until it has been cleaned up. That is, use temporary tables that emulate the structure within the data warehouse.
  • Consider partitioning the fact table on the basis of refresh cycle. For example, if data is loaded on a weekly basis, consider using weekly partition. Small partitions could be amalgamated into larger partition on a regular basis: at month end, for example.
  • Consider dropping all indexes against fact tables prior to load or bulk inserts. Create indexes once the data is cleaned or whenever required for validation performance.
  • Expect dimension information to change on a regular basis, and structure the data warehouse to minimize the cost of the change. Specially isolate dimension from fact data, and carefully select foreign keys for the fact table.
  • Populate data marts via data warehouse in order to clean up the data and ensure consistency.
  • Do not store aggregated columns within fact tables. It is usually cheaper to aggregate the columns on the fly.
  • Ensure that every byte in the column definitions within a large fact table is needed. Savings here will have a substantial effect on the size and complexity of the fact table.
  • Design and test the archiving process into the data warehouse from the beginning, even if it will not be used for sometime. This will avoid any nasty shocks down the line.
  • Direct query on data warehouse should not be allowed. Ideally all the query access to data warehouse should go via query manager.
  • Take the time to test the most complex and awkward queries that the business is likely to ask against different index and aggregation strategies.
  • Generate test data with correct balance and skew. Make sure the ratio of the fact to dimension is correct and so on.
  • Double the amount of time you would normally spend for testing. Test everything twice.
  • Test I/O operation across different disk stripe widths.
  • Spread the load source files and the destination load files to avoid I/O bottlenecks.
  • Do not allow cost or budget considerations to affect capacity estimates. In other words if the system will function with 500GB of RAID disks and super tuned DB servers, don’t do the capacity planning for less just to avoid the cost. Such limitation can cause difficulty in query tuning and load manager. Once the system is designed for best possible hardware, you can try to test the system with less sophisticated hardware in a step-by-step manner.

 

Risks:

  • Very often the complexity of extraction is underestimated.
  • It can be very difficult to determine exactly when the extraction should be initiated.
  • It is critical that a correct balance is achieved between queries loading the star flake schema and accessing it. The step to generate the summary and error tables tends to continually grow in response to new user requirements.
  • Very often the cleanliness of the data is underestimated. It results in neglecting the performance consideration for validation functions.
  • It may be impossible to tell the difference between and acceptable query, which takes up to 72 hours to run and the one that never finishes.
  • You need to be aware of any limitations, imposed by either the operating system or hardware, on striping and mirroring disks.


 

 

Glossary:

Data Mart: Data Mart differs from data warehouse in the scope of the data that it deals with. Typically, it is focused on a specific department or business process rather than being a corporate data repository. A data mart can be stand alone or fed centrally from the corporate data warehouse.

 

Dimension: Av dimension of a piece of data is any part of that data on which it can e usefully viewed, divided or summarized. Typical dimensions are items such as time, date, region, location or account_id.

 

Fact Data: The fact data is the basic core data that will be stored in the data warehouse. For example: Sales, Call records or account transactions.

 

MPP: An acronym for massively parallel processing. These are large multi-node machines with larger number of CPUs.

 

NUMA: An acronym for non-uniform memory architecture. A NUMA machine is composed of multiple nodes, but unlike cluster or MPP machines, it has shared memory address space.

 

SMP: An acronym for symmetric multi processing. An SMP machine consists of many CPUs, which share memory and disk.

Cluster: A tightly coupled group of SMP machines, with shared disks. 

References:

Technical article: The Current and Future Role of Data Warehousing in Corporate Application Architecture, by Robert Winter, Institute of Information Management, University of St. Gallen.

Inmon, W.H., Zachman, J.A., Geiger, J.G.: Data Stores Data Warehousing and the Zachman Framework, McGraw-

Hill:

Modern Data warehousing, Mining and Visualization: Core Concepts, George M. Marakas

Database Systems: A Practical Approach to design, Implementation, and Management, 3/e Thomas Connolly, Carolyn Begg