Record Count
Last updated
Last updated
Whenever you are working with a new dataset, the first thing you want to do is to get a record count. The record count will help you determine if you have what you need in terms of software and hardware to process the data.
As a general rule of thumb for SQL Server:
This tab has four columns.
The values for the Scenario column are defined as:
The values for Processing Paradigm are defined as:
Record Count
Recommendation
x < 10MM
Good to go.
x > 10MM
You may want to add Apache Spark to the historical load.
Column Name
Definition
Scenario
This column has predefined drop down values and explains the where the data is coming from.
Processing Paradigm
This column has predefined drop down values and explains how the data is processed.
Frequency
This is a free form column and explains how frequently the data is processed.
Record Count
How many records you will pull back on the historical load.
Drop Down Value
Definition
From Database
This process is pulling data from another database.
From File
This process imports a flat file.
From Kafka Producer
The process pulls data from a Kafka Producer.
From Spark Streaming
This is a real time stream process that sources data directly from Apache Spark.
Drop Down Value
Definition
Batch
Data is processed on a schedule and the data available is all the data that has piled up since the last run. Everything is processed at once.
Stream
Data is processed continually in a real time process.
Micro batch
Data is processed on an interval more frequent than batch but is not a real time process. With the advances in technology, micro batching is becoming rare. However, it is a good intermediate solution when your organization looks to start moving towards real time data. Processing data more frequently throughout the day even on intervals as long as 5 minutes can represent a drastic improvement in reporting efficiency.