Record Count

Whenever you are working with a new dataset, the first thing you want to do is to get a record count. The record count will help you determine if you have what you need in terms of software and hardware to process the data.

As a general rule of thumb for SQL Server:

Record Count

Recommendation

x < 10MM

Good to go.

x > 10MM

You may want to add Apache Spark to the historical load.

This tab has four columns.

Column Name

Definition

Scenario

This column has predefined drop down values and explains the where the data is coming from.

Processing Paradigm

This column has predefined drop down values and explains how the data is processed.

Frequency

This is a free form column and explains how frequently the data is processed.

Record Count

How many records you will pull back on the historical load.

The values for the Scenario column are defined as:

Drop Down Value

Definition

From Database

This process is pulling data from another database.

From File

This process imports a flat file.

From Kafka Producer

The process pulls data from a Kafka Producer.

From Spark Streaming

This is a real time stream process that sources data directly from Apache Spark.

The values for Processing Paradigm are defined as:

Drop Down Value

Definition

Batch

Data is processed on a schedule and the data available is all the data that has piled up since the last run. Everything is processed at once.

Stream

Data is processed continually in a real time process.

Micro batch

Data is processed on an interval more frequent than batch but is not a real time process. With the advances in technology, micro batching is becoming rare. However, it is a good intermediate solution when your organization looks to start moving towards real time data. Processing data more frequently throughout the day even on intervals as long as 5 minutes can represent a drastic improvement in reporting efficiency.

Last updated