Getting Started

Get started with Training Operator

This guide describes how to get started with Training Operator and run a few simple examples.

Prerequisites

You need to install the following components to run examples:

Getting Started with PyTorchJob

You can create your first Training Operator distributed PyTorchJob using Python SDK. Define the training function that implements end-to-end model training. Each Worker will execute this function on the appropriate Kubernetes Pod. Usually, this function contains logic to download dataset, create model, and train the model.

Training Operator will automatically set WORLD_SIZE and RANK for the appropriate PyTorchJob worker to perform PyTorch Distributed Data Parallel (DDP).

For example:

def train_func():
    import torch
    import torch.nn.functional as F
    from torch.utils.data import DistributedSampler
    from torchvision import datasets, transforms

    # [1] Setup PyTorch DDP. WORLD_SIZE and RANK environments will be set by Training Operator.
    torch.distributed.init_process_group(backend="nccl")
    Distributor = torch.nn.parallel.DistributedDataParallel

    # [2] Create PyTorch CNN Model.
    class Net(torch.nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
            self.conv2 = torch.nn.Conv2d(20, 50, 5, 1)
            self.fc1 = torch.nn.Linear(4 * 4 * 50, 500)
            self.fc2 = torch.nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # [3] Attach model to GPU and distributor.
    device = "cuda"
    model = Net().to(device)
    model = Distributor(model)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

    # [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers.
    dataset = datasets.FashionMNIST(
        "./data",
        download=True,
        train=True,
        transform=transforms.Compose([transforms.ToTensor()]),
    )
    train_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=128,
        sampler=DistributedSampler(dataset),
    )

    # [5] Start model Training.
    for epoch in range(3):
        for batch_idx, (data, target) in enumerate(train_loader):
            # Attach Tensors to the device.
            data = data.to(device)
            target = target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )


from kubeflow.training import TrainingClient

# Start PyTorchJob with 3 Workers and 1 GPUs per Worker.
TrainingClient().create_job(
    name="pytorch-ddp",
    train_func=train_func,
    num_workers=3,
    resources_per_worker={"gpu": "1"},
)

Getting Started with TFJob

Similar to PyTorchJob example, you can use the Python SDK to create your first distributed TensorFlow job. Run the following script to create TFJob with pre-created Docker image: docker.io/kubeflow/tf-mnist-with-summaries:latest that contains distributed TensorFlow code:

from kubeflow.training import TrainingClient

TrainingClient().create_job(
    name="tensorflow-dist",
    job_kind="TFJob",
    base_image="docker.io/kubeflow/tf-mnist-with-summaries:latest",
    num_workers=3,
)

Run the following API to get logs from your TFJob:

TrainingClient().get_job_logs(
    name="tensorflow-dist",
    job_kind="TFJob",
    follow=True,
)

Next steps

Feedback

Was this page helpful?