Pipeline steps
- Data extraction from sources; databases, RESTAPIS, S3, datalakes, etc
- Data ingestion
- Processing
- Data enrichment
- Data cleaning
- Data transformation
- Data anonymization
- Sink; data store
Pipeline design steps
Determine the business value
- What are our objectives for this data pipeline?
- What use cases will the data pipeline serve (reporting, analytics, machine learning)?
Choose the data sources
- What are all the potential sources of data?
- How will we connect to the data sources?
- What format will the data come in (flat files, JSON, XML)?
Choose the data formats
How will the data be stored: flat files, JSON, XML, Parquet?
Choose the data storage solutions
Considered where are you going to sink your data at the final data pipeline stage or mid-way for data processing or special use-cases support
Plan the data workflow
- Find job dependencies
- Can jobs be run in parallel?
- How to handle failed jobs?
Architectural types
- ETL
- ELT; useful to offload some processing away from the pipeline into the transforming layer, think of a database engine.
- Lambda; offers both a batch and a speeding layer for both usecase
- Kapp (streaming architecture)
- Reduce multiple data semantics to reduce translation/serialization/desirialization times
- Implement semantic repositories for fast data serialization/desirialization (avro and avro schema registry)
- Mix batch and stream pipelines (Lambda architetures, for example)
- Parallel procesing
- “Data Duplication”, duplicate data into best-purpose datasources
- Remove duplicate data
Further reading: