Overall Engineering Philosophy
I build high performing easily maintained processes with six nines of reliability. I let the SQL Server relational engine do what it was designed to do: join data. I use T-SQL to join, cleanse, and load data. I use Python to move data around on disk outside of SQL Server.
The enterprise data warehouse is not something you can cut corners on. It HAS to be right and it has to be highly available, otherwise it will not be used. People will go back to their department specific reports built on transactional systems which slows the transactional system down and everybody gets annoyed which is usually the reason I get hired in the first place. To make everybody stop being annoyed.
The data warehouse is not a place where you can say stuff like, “we can’t let the perfect be the enemy of the good”. In this specific scenario, the “good” is the enemy of a properly functioning system. I build data warehouse ETL to tight tolerances to ensure that everybody maintains faith that the system is popping out reliable, accurate, actionable data in a timely fashion. I get data to the right people, at the right place, at the right time.
ETL processes need care and feeding. There is a human element that you cannot write a workaround for. At the end of the day, somebody has to care that the thing is working right, otherwise everything I just wrote about is totally pointless.
How Does That Compare To SSIS?
SSIS is a hot mess. A lot of the things I have talked about are difficult to impossible to implement with SSIS. Trying to maintain existing processes is a nightmare.
When something breaks in SSIS, you have to open up the package in the IDE which can be a nightmare all its own. Tracking down the error in the system logs, makes you feel like Nicolas Cage’s character in National Treasure, and that is assuming you get good error information at all.
SSIS has a system of event handlers. These are actions that take place when certain events occur. The interface for managing event handlers is not well implemented. Trying to find an event handler, well, let us just say that I have solved a Rubik's Cube faster.
Like I mentioned above, SSIS allows junior engineers to create anti-patterns. A better way to put that is, SSIS allows junior engineers to create unnecessarily labyrinthine packages. They will have task all over the place like some digital Gordian Knot. It will make the wires on your home entertainment center look like a paragon of organization. They will have packages calling other packages. They will use SSIS as an orchestration tool which just unnecessarily adds another layer of orchestration. They will write a huge monolithic package that does everything instead of breaking things down into atomic task that are identical independently executing processes.
You can and should build your SSIS packages to call stored procedures. However, if a stored procedure is busted, you cannot just push the fix. Sometimes, alterations to stored procs require that you add columns to the output query (the last select statement in a proc). Any change to the columns of the output query result in new metadata. SSIS reads the metadata from stored procs. If the metadata that it is reading, does not match what it expects, SSIS will error out.
In order to fix it, you have to open up SSIS, go to the data flow tab, open up either the source or the destination task and refresh the metadata. Then you have to redeploy the package. And I ask you. Why? Why do all that when you can just fix the proc, go through change control, change your connection in SSMS, hit F5, and call it a day?
I have already talked about code deployment, but that topic also overlaps with maintenance issues. Getting SSIS packages promoted through your various environments is just harder than it needs to be.
Even if you have a senior SSIS engineer writing packages and that person knows all the tips, tricks, and tactics for writing relatively high performing packages, it still will not matter. SSIS just does not lend itself well to the concept of having many maintenance hatches that allow quick access to subsystems.
Do I sound a little nutty? That is because I have PTSD from dealing with poorly developed SSIS packages. You would not believe what I have dealt with over the years. I have been working on a book called Mastering Data Warehouse Projects. It has been put on hold due to COVID, but hopefully I can publish this year. In that book, I will be sure to add in some horror stories just so all this will be driven home.
I cannot stress enough. Clean, high performing warehouse ETL. No SSIS. Use T-SQL and Python.
Last updated