Pipelines Quickstart

Getting started with Kubeflow Pipelines

Use this guide if you want to get a simple pipeline running quickly in Kubeflow Pipelines. If you need a more in-depth guide, see the end-to-end tutorial.

  • This quickstart guide shows you how to use one of the samples that come with the Kubeflow Pipelines installation and are visible on the Kubeflow Pipelines user interface (UI). You can use this guide as an introduction to the Kubeflow Pipelines UI.
  • The end-to-end tutorial shows you how to prepare and compile a pipeline, upload it to Kubeflow Pipelines, then run it.

Deploy Kubeflow and open the pipelines UI

Follow these steps to deploy Kubeflow and open the pipelines dashboard:

  1. Follow the guide to deploying Kubeflow on GCP, including the step to deploy Kubeflow using the Kubeflow deployment UI.

    Due to kubeflow/pipelines#345 and kubeflow/pipelines#337, some non-critical pieces of functionality are currently available only on GKE clusters.

  2. When Kubeflow is running, access the Kubeflow UI at a URL of the form https://<deployment-name>.endpoints.<project>.cloud.goog/, as described in the setup guide. The Kubeflow UI looks like this: Kubeflow UI

    If you skipped the IAP option when deploying Kubeflow, run kubectl port-forward -n kubeflow `kubectl get pods -n kubeflow --selector=service=ambassador -o jsonpath='{.items[0].metadata.name}'` 8080:80 and go to http://localhost:8080/

  3. Click Pipeline Dashboard to access the pipelines UI. The pipelines UI looks like this: Pipelines UI

Run a basic pipeline

The pipelines UI offers a few samples that you can use to try out pipelines quickly. The steps below show you how to run a basic sample that includes some Python operations, but doesn’t include a machine learning (ML) workload:

  1. Click the name of the sample, [Sample] Basic - Parallel Join, on the pipelines UI: Pipelines UI

  2. Click Create an experiment: Starting an experiment on the pipelines UI

  3. Follow the prompts to create an experiment and then create a run. The sample supplies default values for all the parameters you need. The following screenshot assumes you’ve already created an experiment named My experiment and are now creating a run named My first run: Creating a run on the pipelines UI

  4. Click Start to create the run.

  5. Click the name of the run on the experiments dashboard: Experiments dashboard on the pipelines UI

  6. Explore the graph and other aspects of your run by clicking on the components of the graph and the other UI elements: Run results on the pipelines UI

You can find the source code for the basic parallel join sample in the Kubeflow Pipelines repo.

Run an ML pipeline

This section shows you how to run the XGBoost sample available from the pipelines UI. Unlike the basic sample described above, the XGBoost sample does include ML components. Before running this sample, you need to set up some GCP services for use by the sample.

Follow these steps to set up the necessary GCP services and run the sample:

  1. In addition to the standard GCP APIs that you need for Kubeflow (see the GCP setup guide), ensure that the following APIs are enabled:

  2. Create a Cloud Storage bucket to hold the results of the pipeline run.

    • Your bucket name must be unique across all of Cloud Storage.
    • Each time you create a new run for this pipeline, Kubeflow creates a unique directory within the output bucket, so the output of each run does not override the output of the previous run.
  3. Click the name of the sample, [Sample] ML - XGBoost - Training with Confusion Matrix, on the pipelines UI: XGBoost sample on the pipelines UI

  4. Click Create an experiment.

  5. Follow the prompts to create an experiment and then create a run. Supply the following run parameters:

    • output: The Cloud Storage bucket that you created earlier to hold the results of the pipeline run.
    • project: Your GCP project ID.

    The sample supplies the values for the other parameters:

    • region: The GCP geographical region in which the training and evaluaton data are stored.
    • train-data: Cloud Storage path to the training data.
    • eval-data: Cloud Storage path to the evaluation data.
    • schema: Cloud Storage path to a JSON file describing the format of the CSV files that contain the training and evaluation data.
    • target: Column name of the target variable.
    • rounds: The number of rounds for XGBoost training.
    • workers: Number of workers used for distributed training.
    • true-label: Column to be used for text representation of the label output by the model.

    The arrows on the following screenshot indicate the run parameters that you must supply: Starting the XGBoost run on the pipelines UI

  6. Click Start to create the run.

  7. Click the name of the run on the experiments dashboard.

  8. Explore the graph and other aspects of your run by clicking on the components of the graph and the other UI elements. The following screenshot shows the graph when the pipeline has finished running: XGBoost results on the pipelines UI

You can find the source code for the XGBoost training sample in the Kubeflow Pipelines repo.

Clean up your GCP environment

As you work through this guide, your project uses billable components of GCP. To minimise costs, follow these steps to clean up resources when you’ve finished with them:

  1. Visit Deployment Manager to delete your deployment and related resources.
  2. Delete your Cloud Storage bucket when you’ve finished examining the output of the pipeline.

Next steps

  • Learn more about the important concepts in Kubeflow Pipelines.
  • Follow the end-to-end tutorial using an MNIST machine-learning model.
  • This page showed you how to run some of the examples supplied in the Kubeflow Pipelines UI. Next, you may want to run a pipeline from a notebook, or compile and run a sample from the code. See the guide to experimenting with the Kubeflow Pipelines samples.