Easier To Maintain
When I design ETL processes, I think about high performance fighter aircraft. Sound weird? What do F-16s and ETL processes have in common? Maintenance requirements.
If you get a close look at many modern fighter aircraft, you will notice that the skin of the aircraft is covered in what are clearly panels. You can tell a panel by the outline of the screws you have to undo to remove the panel. These panels are maintenance access hatches. They allow easy access to aircraft systems.
On some older aircraft, in order to maintain some systems, you have to remove other systems that are in the way. In particular, I am thinking of the T-38 where you have to remove the engines to perform certain maintenance functions. The engine does not need fixed, but you have to take it out anyway because the broken system is buried in the fuselage and not near a maintenance access hatch.
I build my ETL processes with a lot of maintenance hatches. When things break, I do not want to have to remove the engine to change a hydraulic line. I just want to pop open a hatch, swap out the bad part, put the hatch back, and start generating sorties.
Ok enough Air Force talk. What does all that look like in my ETL processes?
Identical Independently Executing Processes
In plain English, I build classes of things the exact same way and the execution of one thing has nothing to do with the execution of another thing.
A good example is importing data. All data is pulled into the data warehouse the exact same way. There are some variations on the implementation based on the use case, but in general:
1. Pull data into staging.
2. Clean data.
3. Mark any error records as errors so they are not processed.
Every single source of data has a process developed that is devoted to pulling that specific dataset and nothing else. That process running does not depend on any other data pull process running before or after it.
If it is a demand-pull process that pulls data from a 3rd party vendor, the process will run, but there might be no data to pull because of an error in the 3rd party system. I can only control whatever is in the bounds of the client system. That said, availability of 3rd party data should be addressed in a data contract with a specified SLA. I may not be able to work on their servers, but I sure can make their guys work on their servers.
Last updated