Thursday, October 02, 2008
1. No change in data by default
2. provide easy access and traceability of error to source system
3. process in place to notify business user
4. if no action taken for specific period of time, archive and send notification
Exception Handling: Exceptions are defined in two categories namely Critical and Warning. Critical exception are those which stops the incoming row from going to business target table. Instead row information is captured in exception table(s) and presented to business team to analyze the error. Warnings exceptions are those business rule failure which does not stop the row from moving to business target table(s) but some exception warning information is captured in exception table(s) for later review.
As in any other tool, error can be pragmatically captured in Informatica. But Informatica also provide inbuilt capability to capture error. This need some architectural design and development standard for exception handling. Informatica provide four tables ( or a file) where all error can be captured. Post process and use these tables (or file) to parse the error information and load into exception data model. This will provide standard method of error capture.
Labels: Exception Handling using Informatica
Friday, September 26, 2008
These are few question which any Data warehouse consultant should be very clear in his mind.
ODS - also known as Operational Data Store. ODS is the first persistent data layer in the data warehouse layered architecture. In practice, ODS data model is kept very close to source data model. Then what is the value of ODS layer ? Data is extracted from source using very simple queries, so that there is minimum impact on the source database. Following are typical processing/task are performed on ODS
- Data Standardization
- Naming standardization
- Data Profiling
When data is extracted from the source, data standardization is performed to convert the data value to enterprise standard value ( It is recommended to store the original and standardize value in the ODS for any future reference). Different table column names are also standardized to enterprise standard name. This enable business user to get first preview of what is coming next in the data warehouse. This data store is also open for any data profiling activity for next data store, Integration.
Staging Area : Staging is transient data store which are used to load the data before loading into persistent data store. In staging area, data is prepared and made ready to load into persistent area. Table structure in Staging Area is kept same as persistent data store table structure. This help to move data from staging area to persistent data store effectively without keeping persistent data store locked for long time. Staging area is ETL/ELT requirement to enable effective controlled data load. Staging area is truncated after every successful data load in persistent data store. There are different approach used to load data from staging area to persistent data store
- ELT - use SQL to move data
- Partition pining - Move partition from Staging to Persistent Data Store.
- ELT - Standard extract and load.
For further discussion please write at contact_ak@yahoo.com ( Arvind Kumar)
Labels: ODS and Staging Area
Monday, March 31, 2008
Recycle: It the the approach defined in ETL process to handle and re-process any rejected rows due to technical dependencies. Every simple example may be, reference data load is delayed and regular payload runs on schedule. This scenario can potentially create many failed rows due to reference code validation. But this failure is not due to any business error, but purely due to technical dependencies and there is no need for any business involvement to resolve the issue. Such failed records are candidate for automatic recycle. Technical design should take care of reprocessing such failed rows in regular interval and clean up the reject rows table.
Recycle Process Approach: The idea here is to design a process which can process the failed records and bring into the main stream process without any manual intervention. When the failed rows are brought back in the main stream processing, care need to be taken to make sure the failed record will not create duplicate in the main stream processing. In such case, duplicate need to handled and accordingly processed.
Labels: Recyle Process - Challange
Data integration has become a organization requirement to survive in today's data dominated competitive world. Every organization is looking for quick data integration solution and get the result in less than couple of months and if possible in couple of weeks. This is the biggest challenge for any DI architect and designer to implement a quick solution to meet the higher management goals.
Data Integration is not a package product or solution which can be bought off the self. It is iterative solution which is very customized to organization needs. Every organization has its different setup and requirements. It is very important to understand that on solution implemented at on place can not be ported to different organization and get the end result in few weeks. It is true concept remain the same but approach and implementation are very dependent of individual organizations.
It has been always a topic for debate whether one should use available tools or build the in house code for integration. But now it understood the draw back and cost overhead in long term for the in-house built solutions. Biggest challenge with in house build solution is long term support and maintenance. But I have seen some time due to lack of time and budget ( pressure from higher management) to meet to immediate objectives, architect and designer tend to go for in-house build solution. Because that is the only skill set available at that time and it will be more time taking to get any tool at that time. And if tools are used in such case, it will be used only to satisfy management and tool will not be used effectively.
Labels: Data Integration Challenges
Friday, March 28, 2008
For past decade ETL ( Extract Transform Load) has been very commonly used architecture for most of the ETL tools like Informatica, DataStage, Abnitio etc.... But now the database servers are getting very powerful and it is making less sense to go by ETL architecture.
What is ETL : ETL is approach where most of the heavy lifting transformation is performed outside the database server on ETL server. In this approach data is first extracted from the source and cached in ETL server. Then data is transformed and mapped back to Database native datatype and loaded. There is dependency on Database sever and ETL server Latency.
ELT approach - Data is once loaded into the database and all transformation is performed on the database server. This approach removes any dependency on ETL server and Database server Latency. And it also remove one extra hope of I/O from DB server to ETL server.
Different ETL tools are providing architecture to perform efficient ELT and still maintain all code metadata in ETL tool repository. ETL tool will also capture all run time statistics. Informatica pushdown license enable the ELT architecture.
Labels: ETL VS ELT
Wednesday, October 04, 2006
1. How to change the default order by clause in Lookup Transformation
Answer: Informatica by default add the order by clause in lookup Query. To over write the default query , add "--" at the end of the lookup overide query. This means that Informatica will add "ORDER BY " Clause at the end of SQL overide but will be treated as commented statemanet beacuse of --.
Any further question, send EMAIL at arvind.pushpam@gmail.com
Wednesday, August 16, 2006
ETL Tool
There strong correlation between DWH, BI, data quality, Data profiling tools. And it makes sense to bring all these tools together. But this also means that the architect should understand very clearly the correlation between these. Till few year back ETL tool were mostly batch processing, loading the Data Warehouse. But now demand is to get the data as soon as possible to end user and this push the DWH and DI to real time (right time!!) . And this requires the BI reporting to be more efficient to present the data.
Time being the important criteria, Data quality is also one of the most critical component in the life cycle. No customer wants to present incorrect data to their user. This brings Data Quality tool in line with ETL and BI.
To achieve all this, it is important that we have short development cycle to achieve any new development. And Data Profiling plays a critical role for data analyst to capture the data anomalies and hidden business rule for the data.
