Data Warehouse and Data Integration

Thursday, October 02, 2008

Exception Handling is necessary evil, which is required for every application. But if it is not done properly with appropriate design and requirement in mind, it can be one of the biggest challenge in maintaining the application in long term. The approach for exception handling should be
1. No change in data by default
2. provide easy access and traceability of error to source system
3. process in place to notify business user
4. if no action taken for specific period of time, archive and send notification

Exception Handling: Exceptions are defined in two categories namely Critical and Warning. Critical exception are those which stops the incoming row from going to business target table. Instead row information is captured in exception table(s) and presented to business team to analyze the error. Warnings exceptions are those business rule failure which does not stop the row from moving to business target table(s) but some exception warning information is captured in exception table(s) for later review.

As in any other tool, error can be pragmatically captured in Informatica. But Informatica also provide inbuilt capability to capture error. This need some architectural design and development standard for exception handling. Informatica provide four tables ( or a file) where all error can be captured. Post process and use these tables (or file) to parse the error information and load into exception data model. This will provide standard method of error capture.

Labels: Exception Handling using Informatica

# posted by Arvind @ 10:50 AM 0 comments

Friday, September 26, 2008

What is ODS ? What is Staging Area ? What these data store add value to Data warehouse processing ?
These are few question which any Data warehouse consultant should be very clear in his mind.

ODS - also known as Operational Data Store. ODS is the first persistent data layer in the data warehouse layered architecture. In practice, ODS data model is kept very close to source data model. Then what is the value of ODS layer ? Data is extracted from source using very simple queries, so that there is minimum impact on the source database. Following are typical processing/task are performed on ODS

Data Standardization

Naming standardization

Data Profiling

When data is extracted from the source, data standardization is performed to convert the data value to enterprise standard value ( It is recommended to store the original and standardize value in the ODS for any future reference). Different table column names are also standardized to enterprise standard name. This enable business user to get first preview of what is coming next in the data warehouse. This data store is also open for any data profiling activity for next data store, Integration.

Staging Area : Staging is transient data store which are used to load the data before loading into persistent data store. In staging area, data is prepared and made ready to load into persistent area. Table structure in Staging Area is kept same as persistent data store table structure. This help to move data from staging area to persistent data store effectively without keeping persistent data store locked for long time. Staging area is ETL/ELT requirement to enable effective controlled data load. Staging area is truncated after every successful data load in persistent data store. There are different approach used to load data from staging area to persistent data store

ELT - use SQL to move data

Partition pining - Move partition from Staging to Persistent Data Store.

ELT - Standard extract and load.

For further discussion please write at contact_ak@yahoo.com ( Arvind Kumar)

Labels: ODS and Staging Area

# posted by Arvind @ 5:01 PM 0 comments

Monday, March 31, 2008

Recycle process is one of the buzz word in any data warehouse or data integration process. It is very important to understand what is the recycle requirement before thinking about any solution. Improper recycle solution may become more of maintenance nightmare.
Recycle: It the the approach defined in ETL process to handle and re-process any rejected rows due to technical dependencies. Every simple example may be, reference data load is delayed and regular payload runs on schedule. This scenario can potentially create many failed rows due to reference code validation. But this failure is not due to any business error, but purely due to technical dependencies and there is no need for any business involvement to resolve the issue. Such failed records are candidate for automatic recycle. Technical design should take care of reprocessing such failed rows in regular interval and clean up the reject rows table.

Recycle Process Approach: The idea here is to design a process which can process the failed records and bring into the main stream process without any manual intervention. When the failed rows are brought back in the main stream processing, care need to be taken to make sure the failed record will not create duplicate in the main stream processing. In such case, duplicate need to handled and accordingly processed.

Labels: Recyle Process - Challange

# posted by Arvind @ 2:22 PM 1 comments

Data Integration Challenges

Data integration has become a organization requirement to survive in today's data dominated competitive world. Every organization is looking for quick data integration solution and get the result in less than couple of months and if possible in couple of weeks. This is the biggest challenge for any DI architect and designer to implement a quick solution to meet the higher management goals.

Data Integration is not a package product or solution which can be bought off the self. It is iterative solution which is very customized to organization needs. Every organization has its different setup and requirements. It is very important to understand that on solution implemented at on place can not be ported to different organization and get the end result in few weeks. It is true concept remain the same but approach and implementation are very dependent of individual organizations.

It has been always a topic for debate whether one should use available tools or build the in house code for integration. But now it understood the draw back and cost overhead in long term for the in-house built solutions. Biggest challenge with in house build solution is long term support and maintenance. But I have seen some time due to lack of time and budget ( pressure from higher management) to meet to immediate objectives, architect and designer tend to go for in-house build solution. Because that is the only skill set available at that time and it will be more time taking to get any tool at that time. And if tools are used in such case, it will be used only to satisfy management and tool will not be used effectively.

Labels: Data Integration Challenges

# posted by Arvind @ 1:45 PM 0 comments

Friday, March 28, 2008

ETL Vs ELT

For past decade ETL ( Extract Transform Load) has been very commonly used architecture for most of the ETL tools like Informatica, DataStage, Abnitio etc.... But now the database servers are getting very powerful and it is making less sense to go by ETL architecture.
What is ETL : ETL is approach where most of the heavy lifting transformation is performed outside the database server on ETL server. In this approach data is first extracted from the source and cached in ETL server. Then data is transformed and mapped back to Database native datatype and loaded. There is dependency on Database sever and ETL server Latency.

ELT approach - Data is once loaded into the database and all transformation is performed on the database server. This approach removes any dependency on ETL server and Database server Latency. And it also remove one extra hope of I/O from DB server to ETL server.

Different ETL tools are providing architecture to perform efficient ELT and still maintain all code metadata in ETL tool repository. ETL tool will also capture all run time statistics. Informatica pushdown license enable the ELT architecture.

Labels: ETL VS ELT

# posted by Arvind @ 7:23 AM 0 comments

Wednesday, October 04, 2006

Knowledge Sharing on Informatica
1. How to change the default order by clause in Lookup Transformation

Answer: Informatica by default add the order by clause in lookup Query. To over write the default query , add "--" at the end of the lookup overide query. This means that Informatica will add "ORDER BY " Clause at the end of SQL overide but will be treated as commented statemanet beacuse of --.

Any further question, send EMAIL at arvind.pushpam@gmail.com

# posted by Arvind @ 12:46 PM 1 comments

Wednesday, August 16, 2006

ETL Tool

It is interesting to see wide variety of DWH and DI ( like ETL, Data Quality, Data Profiling, BI tools) coming together to make the customer life easy. But this adds complexity to the developer knowledge base. What does this mean from a DWH/DI architect or developer? I started my career as DWH consultant in 1999 and I have seen the changes in this industry since then. Fortunately or unfortunately, I started with Informatica 3.1 and Business Object. It was not my choice of tool but it was the demand from the client. I was a mainframe ( COBOL, CICS, DB2) developer and there was need for a developer who can support the DWH initiative to extract data from Mainframe. And thus I got into first creating the flat file ( at that time PowerExchange/Striva was not there) and loading those file in DWH using informatica. Since I have loaded the data, I was the best person available with data knowledge and I end up developing Informatica mapping then developing BO reports over it. Since then Informatica as released 5 major release ( 4.x, 5.x, 6.x, 7.x). Informatica has made life easy for developer and same time posing some challenges for customer such as performance. And this is challenge for any ETL or BI tool. Other big challenge is easy of use of these tools. These tools are very user friendly due to which a person can start development without much in-depth knowledge of tool. Some time this can end up in very clumsy code.

There strong correlation between DWH, BI, data quality, Data profiling tools. And it makes sense to bring all these tools together. But this also means that the architect should understand very clearly the correlation between these. Till few year back ETL tool were mostly batch processing, loading the Data Warehouse. But now demand is to get the data as soon as possible to end user and this push the DWH and DI to real time (right time!!) . And this requires the BI reporting to be more efficient to present the data.

Time being the important criteria, Data quality is also one of the most critical component in the life cycle. No customer wants to present incorrect data to their user. This brings Data Quality tool in line with ETL and BI.

To achieve all this, it is important that we have short development cycle to achieve any new development. And Data Profiling plays a critical role for data analyst to capture the data anomalies and hidden business rule for the data.

# posted by Arvind @ 10:48 AM 0 comments