Documents
Data Warehouse ETL Framework
Data Warehouse ETL Framework
  • Introduction To The ETL Framework
  • Framework Overview
    • Framework Description
    • Why You Need An ETL Framework
      • What is the ETL Framework?
      • Framework Objectives
      • Can I use my own resources to implement The Framework?
    • Download The Framework
    • What's In The Box
    • v1.0 Release
    • Training Resources For Engineers
    • Getting Help
  • General Principles
    • Never Finished
    • Overall Engineering Philosophy
    • Self Service Analytics As The Goal
    • Strict Adherence To Globally Accepted Practice
    • Data Flow Design
    • Effective DataOps
    • Code First Philosophy
    • Melding Kimball And Inmon
    • Master Data Management
    • Open Architecture
    • Decoupled Systems
    • Independent Identically Executing Processes
    • Robust Processes
    • Self-Diagnostics
    • Data Provenance
    • Adverse Reporting
    • Error Recovery
    • Archrival And Recovery
    • Documentation As A Necessity
  • DataOps For The Uninitiated
    • DataOps Initiation
    • What Is Agile Software Development?
    • What Is DevOps?
    • What Is Statistical Process Control?
    • You Take All That And You Wind Up With...
    • The DataOps Manifesto
    • Want To Learn More About DataOps?
  • ETL Developer's Field Guide
    • The Plain English Explanation Of What You Are About To Build
    • Data Warehouse User Profiles
    • Using Data Source Names
    • Why You Should Not Use SSIS
      • SSIS Is Not Forward Compatible
      • SSIS Is Not Very Performant
      • Python Development Is Faster Than SSIS
      • SSIS Is Difficult To Deploy To Production
      • SSIS Is Difficult To Maintain
    • Python Software Engineering Considerations
    • Enterprise Data Ecosystem
      • EDW's Place In Enterprise Architecture
      • EDW ETL Overview
    • ETL Environment Databases
    • Data Acquisition Paradigms
      • Demand Pull
      • Supply Push
    • The Common Model
    • The Semantic Layer
    • Database Object Naming Conventions
      • Columns
      • Indexes
      • Primary Keys
      • Schemas
      • Stored Procedures
      • Tables
      • User Defined Functions
    • ETL Reference Architecture
    • Table Anatomy
      • Sample Dimension Table
      • Sample Fact Table
      • Sample Indexed View
      • Sample Junk Dimension Table
      • Sample Master Data Management Table
      • Sample Stage Table
    • Master Data Management
      • MDM With Exact Match On Source System Key
      • Implementation Details
    • Slowly Changing Dimension Processing
      • General Process
      • Creating Empty Records
      • SCD Processing Types
      • Implementing SCDs As Temporal Tables
    • Handling Calculated Values
    • Warehouse Load Commandments
    • Code Style Guide
      • SSIS
      • Transact SQL
      • Python
      • C#
    • The Role Of Schemas
      • Standard Schemas And Their Definitions
      • Schemas As A Security Device
      • Schemas As A DB Object Differentiator
    • Feedback And Control Systems
      • Having An Engineering Mindset
      • The Mechanics Of Checking For Unprocessed Records
      • Time Series Analysis 101
      • Passive Monitoring System Design Theory
      • Passive Monitoring System Implementation
    • Source Control
    • Loading Historical Data
    • Security Access Model
    • Wrapping Up
  • ETL Environment Set Up
    • Install Anaconda
    • Create Initial Source Control Folders
    • Create Primary Databases And Schemas
    • Customize And Run Deployment Scripts
    • Install Python Packages
    • Create Data Source Name
    • Create The SSIS Catalogue
    • Create The File I/O Directory Structure
    • Create Global Environment In The Integration Services Catalog
    • Create The SQL Server Aliases
    • Create The BIAnalytics Proxy Account
    • Create The Opt Folder
    • Configure Email Alerts
  • Standard SSIS ETL Development Package
    • Starting A New Process
    • Tracking Package Variables
    • Remove Things You Do Not Need
    • Creating New Configuration Settings
  • Sample Script Guide
    • SQL
      • Finalize And Audit Scripts
        • 18 CREATE PROC usp_RecordRowCounts
        • 19 CREATE PROC usp_MarkRecordsAsProcessed
        • 20 CREATE PROC usp_CheckForUnprocessedRecords
      • Monitoring Scripts
        • 21 CREATE PROC usp_DisplayTablesNotLoading
        • 22 CREATE PROC usp_LoadTableLoadReportingTable
        • 23 CREATE PROC usp_TableLoadMonitoring
      • Table Object Sample Scripts
        • sample type I dimension table
        • sample type II dimension table
        • sample fact table
        • sample indexed view
        • sample junk dimension table
        • sample master data management table
        • sample stage table
      • Data Processing Sample Scripts
        • Data Pull
        • batch processing
        • bulk insert
        • data cleansing
        • fact table load
        • remove dups
        • type I dimension processing
        • type II dimension processing
      • Helper Scripts
        • Create Database
        • add rowhash
        • change collation
        • configuration insert sample script
        • documentation block
        • populate fact table with fake data
        • proc execution scripts
        • show all columns in database
        • troubleshooting
        • util
        • start an agent job from T-SQL
    • Python
      • Building Blocks
        • CombineCSVsIntoOneFile
        • ConvertCommaToPipeDelimitedFile
        • ConvertExcelToCSV
        • LoopingOverFilesInADirectory
        • Process Zip File
        • QueryDatabaseAndWriteSmallFile
        • QueryDatabaseAndWriteLargeFile
        • SendEmail
        • StringMatching
        • YAMLConfigImport
      • Full Solutions
        • DownloadMoveAndStoreDataCSV
        • LoadLargeCSVsIntoDataWarehouseStagingTables
        • MoveAndStoreDataExcel
        • ReloadFromArchive
      • Jupyter Notebooks
        • Passive Monitoring System Design Theory
    • PySpark
      • ConnectToSpark
      • LoadCSV
      • ExampleDataProcessing
    • Windows Batch
  • A Methodology To Rapidly Convert OLTP Databases to OLAP Solutions
    • Step 1: Find The Nouns
    • Step 2: Find The Stuff We Want To Do Math On
    • Step 3: Analyze Relationships
  • Data Model Creation Tool
    • Record Count
    • Sample Data
    • Create Stage Table
    • Char Length Analysis
    • Column Notes
    • Column Cleansing Notes
    • Source To Target Mapping
    • Dimension List
    • Fact Table Creation Helper
    • Foreign Key Creation
    • Process Fact Script Helper
    • View Creation Helper
    • Date Role Play View Helper
    • Data Model View Creation Helper
    • List Population Values
  • Performance Monitoring
    • The Daily Chore
    • Diagnostic Tools
      • Using The Built-In Stored Procs
      • Using The Built-In Views
    • Logging Database Diagram
  • Data Warehouse Troubleshooting Guide
    • Generalized Troubleshooting Steps
  • Business Analytics Capability Maturity Model
    • What Is A Business Analytics Capability Maturity Model?
    • Level 0. Operational Reporting
    • Level 1. Rapid Delivery
    • Level 2. Self Service
    • Level 3. Central Repository
    • Level 4. Open Data 1
    • Level 5. Open Data 2
    • Level 6. Feedback 1
    • Level 7. Data Archaeology
    • Level 8. Crystal Ball
    • Level 9. Feedback 2
    • Level 10. Oil Rig
    • Level 11. The Singularity
  • Appendices
    • Appendix A. What is Medium Data®?
    • Appendix B. The Benefits Of Using Python And T-SQL Over SSIS For ETL
Powered by GitBook
On this page
  1. ETL Developer's Field Guide
  2. Feedback And Control Systems

Passive Monitoring System Design Theory

Despite the fact that volumetric data is discrete in nature, it is still useful to think of the table loads as stochastic processes. The reason for this is, while the daily delta on a table may appear to be stable over time, there is no reason to expect this to continue in the future. It is often the case that a load will exceed alarm limits randomly due to many unknown and interacting variables.

When I thought of the load as a stochastic process, I realized that I could use some of the ideas I had back when I was developing trading algorithms for statistical arbitrage to help monitor and control the flow of data into and out of the system.

When it came to writing trading algos, I was 100% self taught relying on nothing more than a year of undergrad stats at KState and some math training at Washburn. An interesting discovery arose from my research. Anybody that takes a class in time series analysis knows this, but I had to bump into it in the real world to learn about it.

A time series that results from a stochastic process will have parameters that change over time.

Let me illustrate.

I am not a fan of technical analysis. It is basically voodoo with absolutely no mathematical foundation. There is zero reason that a stock price can't blow through a line of support whenever it wants. However, it is good for making pretty pictures. In this case, I'm going to use Bollinger Bands to illustrate how the mean of a time series drifts over time.

I went to my not insignificant library of data sets and pulled out 22 years of Apple's closing stock price. I threw it in the cloud, and got to work. Below is descriptive statistics of the dataset.

import pandas as pd
import ta
import matplotlib.pyplot as plt
import numpy as np

ibm_prices = pd.read_csv('https://query.data.world/s/ovn37kwnxqy4txkj4f7s662wxz5ppu')
ibm_prices_subset = ibm_prices[0:300]
del ibm_prices

pd.set_option('display.float_format', lambda x: '%.2f' % x)
ibm_prices_subset.close.describe()

Parameter

Value

Count

300

mean

35.56

std

0.08

I can use Mk I Eyeball and tell that this data set is not useful for analysis right away just by seeing that the mean is not 0. Here is a graph of the data.

plt.plot(ibm_prices_subset.close)
plt.ylabel('value')
plt.xlabel('observation')

Its stock prices so there are no values less than 0. "So what", you say? The mean is 35.56. Just work with that right? Wrong. You are looking at a static dataset. The dataset we need to work with will be adding values everyday which changes the number of observations which is a key input to several statistical calculations.

What's that you say? Just use a moving average so you can make the number of observations a constant instead of a variable? Bro. That's like the whole thing we're trying to get away from. Let me illustrate.

We're going to take a look at 20 days of data with the standard deviation set at 3.

indicator_bb = ta.volatility.BollingerBands(close=ibm_prices_subset.close, window=20, window_dev=3)
ibm_prices_subset['bb_bbm'] = indicator_bb.bollinger_mavg()
ibm_prices_subset['bb_bbh'] = indicator_bb.bollinger_hband()
ibm_prices_subset['bb_bbl'] = indicator_bb.bollinger_lband()

ma = ibm_prices_subset['bb_bbm']
upper_band = ibm_prices_subset['bb_bbh']
lower_band = ibm_prices_subset['bb_bbl']
price = ibm_prices_subset.close

plt.plot(price)
plt.plot(upper_band)
plt.plot(lower_band)
plt.legend()
plt.ylabel('value')
plt.xlabel('observation')

As you can see, as represented by the Bollinger bands, the mean of the price moves over time. This prevents us from understanding what is and what is not weird.

In the dataset, I've already applied a first difference equation to the data. Let's look at that column.

pd.set_option('display.float_format', lambda x: '%.5f' % x)
ibm_prices_subset[['close','close_first_difference']].describe()

Parameter

Close Price

Close Price Transformed

count

300

300

mean

32.56167

-0.00013

std

0.07914

0.02158

As you can see, differencing the data gives us a mean that is arbitrarily close to zero. Here is the graph of what that looks like.

plt.plot(ibm_prices_subset.close_first_difference)
plt.ylabel('value')
plt.xlabel('observation')

We now wrap that in Bollinger Bands and, viola.

indicator_bb2 = ta.volatility.BollingerBands(close=ibm_prices_subset.close_first_difference, window=20, window_dev=3)
ibm_prices_subset['bb_bbm2'] = indicator_bb2.bollinger_mavg()
ibm_prices_subset['bb_bbh2'] = indicator_bb2.bollinger_hband()
ibm_prices_subset['bb_bbl2'] = indicator_bb2.bollinger_lband()

ma2 = ibm_prices_subset['bb_bbm2']
upper_band2 = ibm_prices_subset['bb_bbh2']
lower_band2 = ibm_prices_subset['bb_bbl2']
price2 = ibm_prices_subset.close_first_difference

plt.plot(price2)
plt.plot(upper_band2)
plt.plot(lower_band2)
plt.legend()
plt.ylabel('value')
plt.xlabel('observation')

As you can see, the mean still moves, but the change in the mean is minute and practically stationary.

Stationarity gives us the ability to apply central limit theorem to the analysis in a consistent manner no matter how far we are from the first observation. It allows us to develop an analysis that is agnostic to direction and we can just focus on the magnitude of the change or delta in nerd parlance.

Stationarity gives us model parameters that are constants and not variables that change with time. This allows us to set alarms when our variable of interest exceeds tolerances. Those tolerances are set at 3SD by default because of the properties of Gaussian Distributions. However, you will need to set your tolerances based on your system. It's not unusual to have processes that are so stable, you can use tighter tolerances.

PreviousTime Series Analysis 101NextPassive Monitoring System Implementation

Last updated 4 years ago