Train Pipeline

Source Bundle + Data Bundle (+ Hyperparams) = Trained Model

The Train Pipeline consists of two stages - Build and Train. They are separate Pachyderm pipelines but are linked together to form a cohesive process. The user is able to identify and track progress (and logs) throughout both the Build and Train stages of the Train Pipeline. Progress is shown viakaos train list, while logs are available via kaos train logs.

conceptual train pipeline


The Train Pipeline requires at least a valid source bundle and data bundle for initiating a training job.

Compression of all input bundles is handled by kaos - not by the user!

Source Bundle

The source bundle is responsible for supplying the code and environment for running a training job. Its nature should be treated as ephemeral and dynamic since versioning is handled with kaos. In other words, a user does not need to adapt chaotic naming conventions (i.e. mnist-v1, mnist-v1-latest, mnist-v1-final, etc...).

The source bundle requires, at minimum, the following basic structure.

$ tree mnist
└── model-train
└── mnist
├── Dockerfile
└── model
├── requirements.txt
└── train

Submit the above bundle with kaos train deploy -s mnist/model-train

Data Bundle

kaos supports both local and remote data bundles for training.


The data bundle is responsible for submitting local data necessary by the source bundle. The same "hands-off versioning" approach is also implemented for the data bundle. Submit data, train a model, change data, submit data, train a new model, etc... Rinse and repeat!

The data bundle **can be any shape since only thetrain script (in the source bundle) needs to be able to access its content.

The sample local data bundle for the mnist model is shown below.

$ tree mnist
└── data
└── features
├── test
│ └── test_mini.csv
├── training
│ └── training_mini.csv
└── validation
└── validation_mini.csv

Submit the above withkaos train deploy -d mnist/data


The typical workflow involves handling relatively big data when training machine learning models. For this reason, kaos allows remote datasets for training with data manifest file(s).

The sample remote data manifest bundle for mnist is shown below.

$ tree mnist
└── data_manifest_mid

Submit a manifest file with a this command:kaos train deploy -m mnist/data_manifest_mid/

Internal Structure

The following generalized JSON structure is required for the manifest file.

{"url": string, "path": string}

The "url" specifies the remote address of the desired data, while "path" specifies the relative location within kaos for ingestion based on the location of the manifest file.

A small excerpt from the mnist training manifest file saved at mnist/data_manifest_micro/ is highlighted below.

{"url": "", "path": "test/xaaa.csv"}
{"url": "", "path": "test/xaab.csv"}
{"url": "", "path": "training/xaaaa.csv"}
{"url": "", "path": "training/xaaay.csv"}
{"url": "", "path": "validation/xaaa.csv"}
{"url": "", "path": "validation/xaab.csv"}

The provided example will produce three top-level directories – training, test and validation.

Params (Optional)

The inclusion of params is solely meant for hyperparameters jobs, where multiple parameters are tested at once (in parallel). A hyperparameter job will function properly when the supplied train script exposes parameters with params variable. An example excerpt from the mnist model is shown below. More details can be found when inspecting the source bundle from kaos template get --name mnist.

def train():
# load "static" params
with open(params_fid, 'r') as src:
params = json.load(src)
# load params from "hyperopt" job
params = hyperparams(params)
<do stuff>
classifier = svm.SVC(gamma=float(params['gamma']),

The structure of a valid hyperopt input is a simple JSON with the correct keys (as per train). Note that not all parameters need to be included for a hyperoptimization - only those that should be adapted.

"degree": [
"decision_function_shape": [

Submit the above bundle with kaos train deploy -h <path/to/hyperparams.json>

Resources (Optional)

Specific resources can be attached to any training job with the following options.


kaos option




Float defining the desired compute (in cores or time)



String defining the desired memory (only valid with SI suffixes)



Integer defining the desired graphical processing (in cores)


The result of the Train Pipeline is (ideally) a trained model but the user is completely "free" to save whatever they choose in the output directory. This is defined based on the supplied train script in the Source bundle. See Examples for more information on output options.

There are absolutely no restrictions on the output from a training job!