Skip to the content.

Too many pipes pattern

Case

Situation

A machine learning’s training pipeline will not only do training, but also model tuning, experiment and evaluation. If you want to have various parameter tunings and experiments parallelly, the pipeline will be not one-way straightforward. If you have dependencies among pipeline jobs, its complexity increases. To maintain the pipeline, you may need SRE for the training platform to deal with errors and troubleshooting. When you make the training platform as one of internal systems to be maintained, it is suggested to make the pipeline as simple as possible to isolate fault and increate usability.
Another reason that makes a pipeline complex is various data sources and access methods. If your data is stored in a number of DWH (RDB, NoSQL, NAS, Web API, cloud storage, Hadoop, multiregional, and so on), defining access, authentication and authorization to each of them will be a mess. It is recommended to use a common library to manage data access object, though spreading data to multiple warehouses will give a data engineer hard time and unhealthiness. It is always better to choose appropriate and limited number of DWH.

Diagram

diagram

Pros

Cons

Work around