Monday, March 31, 2008
Recycle process is one of the buzz word in any data warehouse or data integration process. It is very important to understand what is the recycle requirement before thinking about any solution. Improper recycle solution may become more of maintenance nightmare.
Recycle: It the the approach defined in ETL process to handle and re-process any rejected rows due to technical dependencies. Every simple example may be, reference data load is delayed and regular payload runs on schedule. This scenario can potentially create many failed rows due to reference code validation. But this failure is not due to any business error, but purely due to technical dependencies and there is no need for any business involvement to resolve the issue. Such failed records are candidate for automatic recycle. Technical design should take care of reprocessing such failed rows in regular interval and clean up the reject rows table.
Recycle Process Approach: The idea here is to design a process which can process the failed records and bring into the main stream process without any manual intervention. When the failed rows are brought back in the main stream processing, care need to be taken to make sure the failed record will not create duplicate in the main stream processing. In such case, duplicate need to handled and accordingly processed.
Recycle: It the the approach defined in ETL process to handle and re-process any rejected rows due to technical dependencies. Every simple example may be, reference data load is delayed and regular payload runs on schedule. This scenario can potentially create many failed rows due to reference code validation. But this failure is not due to any business error, but purely due to technical dependencies and there is no need for any business involvement to resolve the issue. Such failed records are candidate for automatic recycle. Technical design should take care of reprocessing such failed rows in regular interval and clean up the reject rows table.
Recycle Process Approach: The idea here is to design a process which can process the failed records and bring into the main stream process without any manual intervention. When the failed rows are brought back in the main stream processing, care need to be taken to make sure the failed record will not create duplicate in the main stream processing. In such case, duplicate need to handled and accordingly processed.
Labels: Recyle Process - Challange
Data Integration Challenges
Data integration has become a organization requirement to survive in today's data dominated competitive world. Every organization is looking for quick data integration solution and get the result in less than couple of months and if possible in couple of weeks. This is the biggest challenge for any DI architect and designer to implement a quick solution to meet the higher management goals.
Data Integration is not a package product or solution which can be bought off the self. It is iterative solution which is very customized to organization needs. Every organization has its different setup and requirements. It is very important to understand that on solution implemented at on place can not be ported to different organization and get the end result in few weeks. It is true concept remain the same but approach and implementation are very dependent of individual organizations.
It has been always a topic for debate whether one should use available tools or build the in house code for integration. But now it understood the draw back and cost overhead in long term for the in-house built solutions. Biggest challenge with in house build solution is long term support and maintenance. But I have seen some time due to lack of time and budget ( pressure from higher management) to meet to immediate objectives, architect and designer tend to go for in-house build solution. Because that is the only skill set available at that time and it will be more time taking to get any tool at that time. And if tools are used in such case, it will be used only to satisfy management and tool will not be used effectively.
Data integration has become a organization requirement to survive in today's data dominated competitive world. Every organization is looking for quick data integration solution and get the result in less than couple of months and if possible in couple of weeks. This is the biggest challenge for any DI architect and designer to implement a quick solution to meet the higher management goals.
Data Integration is not a package product or solution which can be bought off the self. It is iterative solution which is very customized to organization needs. Every organization has its different setup and requirements. It is very important to understand that on solution implemented at on place can not be ported to different organization and get the end result in few weeks. It is true concept remain the same but approach and implementation are very dependent of individual organizations.
It has been always a topic for debate whether one should use available tools or build the in house code for integration. But now it understood the draw back and cost overhead in long term for the in-house built solutions. Biggest challenge with in house build solution is long term support and maintenance. But I have seen some time due to lack of time and budget ( pressure from higher management) to meet to immediate objectives, architect and designer tend to go for in-house build solution. Because that is the only skill set available at that time and it will be more time taking to get any tool at that time. And if tools are used in such case, it will be used only to satisfy management and tool will not be used effectively.
Labels: Data Integration Challenges
Friday, March 28, 2008
ETL Vs ELT
For past decade ETL ( Extract Transform Load) has been very commonly used architecture for most of the ETL tools like Informatica, DataStage, Abnitio etc.... But now the database servers are getting very powerful and it is making less sense to go by ETL architecture.
What is ETL : ETL is approach where most of the heavy lifting transformation is performed outside the database server on ETL server. In this approach data is first extracted from the source and cached in ETL server. Then data is transformed and mapped back to Database native datatype and loaded. There is dependency on Database sever and ETL server Latency.
ELT approach - Data is once loaded into the database and all transformation is performed on the database server. This approach removes any dependency on ETL server and Database server Latency. And it also remove one extra hope of I/O from DB server to ETL server.
Different ETL tools are providing architecture to perform efficient ELT and still maintain all code metadata in ETL tool repository. ETL tool will also capture all run time statistics. Informatica pushdown license enable the ELT architecture.
For past decade ETL ( Extract Transform Load) has been very commonly used architecture for most of the ETL tools like Informatica, DataStage, Abnitio etc.... But now the database servers are getting very powerful and it is making less sense to go by ETL architecture.
What is ETL : ETL is approach where most of the heavy lifting transformation is performed outside the database server on ETL server. In this approach data is first extracted from the source and cached in ETL server. Then data is transformed and mapped back to Database native datatype and loaded. There is dependency on Database sever and ETL server Latency.
ELT approach - Data is once loaded into the database and all transformation is performed on the database server. This approach removes any dependency on ETL server and Database server Latency. And it also remove one extra hope of I/O from DB server to ETL server.
Different ETL tools are providing architecture to perform efficient ELT and still maintain all code metadata in ETL tool repository. ETL tool will also capture all run time statistics. Informatica pushdown license enable the ELT architecture.
Labels: ETL VS ELT
