January 1, 0001

Pipeline steps

  • Data extraction from sources; databases, RESTAPIS, S3, datalakes, etc
  • Data ingestion
  • Processing
    • Data enrichment
    • Data cleaning
    • Data transformation
    • Data anonymization
  • Sink; data store

Pipeline design steps

Determine the business value

  • What are our objectives for this data pipeline?
  • What use cases will the data pipeline serve (reporting, analytics, machine learning)?

Choose the data sources

  • What are all the potential sources of data?
  • How will we connect to the data sources?
  • What format will the data come in (flat files, JSON, XML)?

Choose the data formats

How will the data be stored: flat files, JSON, XML, Parquet?

Choose the data storage solutions

Considered where are you going to sink your data at the final data pipeline stage or mid-way for data processing or special use-cases support

Plan the data workflow

  • Find job dependencies
  • Can jobs be run in parallel?
  • How to handle failed jobs?

Architectural types

  • ETL
  • ELT; useful to offload some processing away from the pipeline into the transforming layer, think of a database engine.
  • Lambda; offers both a batch and a speeding layer for both usecase
  • Kapp (streaming architecture)

Performance considerations of data pipelines

  • Reduce multiple data semantics to reduce translation/serialization/desirialization times
  • Implement semantic repositories for fast data serialization/desirialization (avro and avro schema registry)
  • Mix batch and stream pipelines (Lambda architetures, for example)
  • Parallel procesing
  • “Data Duplication”, duplicate data into best-purpose datasources
  • Remove duplicate data

Further reading: