|












| |
Datawarehousing -
guidelines for planning and development
Introduction
Data warehousing has established a new middleware layer in corporate application
architecture. Such a middleware layer is necessary because the direct,
individual access of decision support applications to data of operational,
transaction-oriented applications has proved to be technically or economically
infeasible. Data quality problems and complex integration requirements usually
make it impossible to provide a consistent and real time integrated data to
decision support systems.

Datawarehousing system as middleware layer.
This paper outlines the common
guidelines followed across the Datawarehousing solutions. It also highlights the
risks involved at various stages of Datawarehousing application development.
Each of these guidelines needs to be discussed in detail to arrive at a solution
best suitable for your business requirements.
Planning
Requirement Analysis
guidelines:
- We should allow 80% of the task
effort to focus on immediate requirements. The remaining 20% of the effort
should be spent trying to understand what future requirements are.
- The source systems (accounting, ERP,
sales, inventory…), which will provide information to data, warehouse needs to
be identified.
- The business rules that needs to be
applied to the data needs to be well understood.
- If medium term requirements are not
well understood, it could put limit the candidate architecture to evolve and
satisfy longer-term requirements.
- Produce the logical model for the
data warehouse and produce initial query profile. Expect it to change on an
ongoing basis.
- Identify initial sizing estimates for
database.
- Understand the existing IT
infrastructure and identify hardware preference.
- Refine and rework initial
conclusions. Always document the present.
Risks:
- Because the data warehousing
requirements are never fully understood, beware of the tendency to extend this
task. A common comment is “we still don’t fully understand the requirements.” A
time box is required as such requirements will be impossible to define
completely.
- Focus on designing flexibility into
data warehouse and business rule execution. Failure to do so ma lead to
substantial cost penalties in the future.
- Since a logical model of enterprise
may not exist, spend some time trying to brainstorm what a potential one might
look like.
- If the midpoint feedback from right
people does not take place, the initial decision may be suspect or risky, and
further work on them may be inappropriate.
Architecture and design
guidelines:
- The following input are definitely
required to start the architecture and blueprint of a Datawarehousing system:
- Identification of the mechanism for
the data transfer and load.
- Database sizing and query performance
expectations.
- Access control, backup and recovery
guidelines.
- Overall scope definition. (Not just
the first build definition).
IT infrastructure.
- Data extraction might be done in one
of the two ways:
- Using flat files.
- Using data management tool (EAI
tools).
- The worst-case scenarios must be
defined in terms of the size of feed data and allowed processing time for the
maximum load.
- It is advisable to design a
preliminary set extraction and load scripts and test the performance. Expect the
statistics to change as the scripts evolves and database grows.
- Design a process controller on the
source system to initiate and track the load and validation process.
- Design physical databases. Convert
the logical model to a detailed database design within the overall technical
blueprint.
- Design and build scripts to generate
indexes.
- Design and build basic data
validation functions, which need to be performed within Datawarehousing schema.
- Design and build scripts to generate
summary and error tables. A summary table contains the process and status
information regarding extraction and load. Error tables’ records dirty data,
which are rejected by Datawarehousing schema validations.
- Document the technical blueprint and
decisions on which load and validation tools will be used.
- Access control facility could be
sometime very complex, so the elapsed time on these steps could increase.
- Design and management of the data
warehouse on a MPP System is considerably more difficult than on an SMP or
cluster.
- Do not design the data warehouse
around a specific tool or tool type.
- Ensure that the data warehousing
design scales as the data scales.
Development
Guidelines for
development and testing Datawarehousing solution:
- Create Physical database and
partition the table spaces and disks etc.
- Avoid “queries of desert” which never
seems to be ending. These are the queries that occupy the entire resource of the
machine and effectively run forever.
- First, profiles can be used to limit
the amount of resource a user process can use. When that amount is exceeded, the
query will be terminated automatically. It will stop it from running forever.
- Queuing of queries can be used to
control their resource usage. If all the queries must be submitted via a
query/load manager, then the degree of parallelism can be controlled.
- Do not load data directly into the
data warehouse tables until it has been cleaned up. That is, use temporary
tables that emulate the structure within the data warehouse.
- Consider partitioning the fact table
on the basis of refresh cycle. For example, if data is loaded on a weekly basis,
consider using weekly partition. Small partitions could be amalgamated into
larger partition on a regular basis: at month end, for example.
- Consider dropping all indexes against
fact tables prior to load or bulk inserts. Create indexes once the data is
cleaned or whenever required for validation performance.
- Expect dimension information to
change on a regular basis, and structure the data warehouse to minimize the cost
of the change. Specially isolate dimension from fact data, and carefully select
foreign keys for the fact table.
- Populate data marts via data
warehouse in order to clean up the data and ensure consistency.
- Do not store aggregated columns
within fact tables. It is usually cheaper to aggregate the columns on the fly.
- Ensure that every byte in the column
definitions within a large fact table is needed. Savings here will have a
substantial effect on the size and complexity of the fact table.
- Design and test the archiving process
into the data warehouse from the beginning, even if it will not be used for
sometime. This will avoid any nasty shocks down the line.
- Direct query on data warehouse should
not be allowed. Ideally all the query access to data warehouse should go via
query manager.
- Take the time to test the most
complex and awkward queries that the business is likely to ask against different
index and aggregation strategies.
- Generate test data with correct
balance and skew. Make sure the ratio of the fact to dimension is correct and so
on.
- Double the amount of time you would
normally spend for testing. Test everything twice.
- Test I/O operation across different
disk stripe widths.
- Spread the load source files and the
destination load files to avoid I/O bottlenecks.
- Do not allow cost or budget
considerations to affect capacity estimates. In other words if the system will
function with 500GB of RAID disks and super tuned DB servers, don’t do the
capacity planning for less just to avoid the cost. Such limitation can cause
difficulty in query tuning and load manager. Once the system is designed for
best possible hardware, you can try to test the system with less sophisticated
hardware in a step-by-step manner.
Risks:
- Very often the complexity of
extraction is underestimated.
- It can be very difficult to determine
exactly when the extraction should be initiated.
- It is critical that a correct balance
is achieved between queries loading the star flake schema and accessing it. The
step to generate the summary and error tables tends to continually grow in
response to new user requirements.
- Very often the cleanliness of the
data is underestimated. It results in neglecting the performance consideration
for validation functions.
- It may be impossible to tell the
difference between and acceptable query, which takes up to 72 hours to run and
the one that never finishes.
- You need to be aware of any
limitations, imposed by either the operating system or hardware, on striping and
mirroring disks.
Glossary:
Data Mart:
Data Mart differs from data warehouse in the scope of the data that it deals
with. Typically, it is focused on a specific department or business process
rather than being a corporate data repository. A data mart can be stand alone or
fed centrally from the corporate data warehouse.
Dimension:
Av dimension of a piece of data is any part of that data on which it can e
usefully viewed, divided or summarized. Typical dimensions are items such as
time, date, region, location or account_id.
Fact Data:
The fact data is the basic core data that will be stored in the data warehouse.
For example: Sales, Call records or account transactions.
MPP: An acronym for massively parallel processing. These
are large multi-node machines with larger number of CPUs.
NUMA: An acronym for non-uniform memory architecture. A
NUMA machine is composed of multiple nodes, but unlike cluster or MPP machines,
it has shared memory address space.
SMP: An acronym for symmetric multi processing. An SMP
machine consists of many CPUs, which share memory and disk.
Cluster:
A tightly coupled group of SMP machines, with shared disks.
References:
Technical article: The Current and Future Role of Data Warehousing in Corporate
Application Architecture, by Robert Winter, Institute of Information Management,
University of St. Gallen.
Inmon, W.H., Zachman, J.A.,
Geiger, J.G.: Data Stores Data Warehousing and the Zachman Framework, McGraw-
Hill:
Modern Data warehousing, Mining and Visualization: Core Concepts, George M.
Marakas
Database Systems: A
Practical Approach to design, Implementation, and Management, 3/e Thomas
Connolly, Carolyn Begg
|