Joel, Pinto Mata |

January 1, 0001

Pipeline steps

Data extraction from sources; databases, RESTAPIS, S3, datalakes, etc
Data ingestion
Processing
- Data enrichment
- Data cleaning
- Data transformation
- Data anonymization
Sink; data store

Pipeline design steps

Determine the business value

What are our objectives for this data pipeline?
What use cases will the data pipeline serve (reporting, analytics, machine learning)?

Choose the data sources

What are all the potential sources of data?
How will we connect to the data sources?
What format will the data come in (flat files, JSON, XML)?

Choose the data formats

How will the data be stored: flat files, JSON, XML, Parquet?

Choose the data storage solutions

Considered where are you going to sink your data at the final data pipeline stage or mid-way for data processing or special use-cases support

Plan the data workflow

Find job dependencies
Can jobs be run in parallel?
How to handle failed jobs?

Architectural types

ETL
ELT; useful to offload some processing away from the pipeline into the transforming layer, think of a database engine.
Lambda; offers both a batch and a speeding layer for both usecase
Kapp (streaming architecture)

Performance considerations of data pipelines

Reduce multiple data semantics to reduce translation/serialization/desirialization times
Implement semantic repositories for fast data serialization/desirialization (avro and avro schema registry)
Mix batch and stream pipelines (Lambda architetures, for example)
Parallel procesing
“Data Duplication”, duplicate data into best-purpose datasources
Remove duplicate data

Further reading:

https://www.striim.com/blog/guide-to-data-pipelines/