Develop a Scaleout Edge project

This guide explains how to set up and implement the machine learning code used within a Scaleout Edge project.

Overview

A Scaleout Edge project is a convention for packaging/wrapping machine learning code that will be executed on edge nodes. At the core, a project is a directory of files (often a Git repository), containing your machine learning code, Scaleout Edge project file, and a specification of the runtime environment for the client (Python environment). The Scaleout Edge command-line tools provide functionality to help a user automate deployment and management of a project that follows the conventions.

The structure of a Scaleout Edge project

We recommend that projects have the following folder and file structure, here illustrated by the ‘mnist-pytorch’ example:

project/
├── client/
│   ├── scaleout.yaml
│   ├── python_env.yaml (optional)
│   ├── .scaleoutignore (optional)
│   ├── data.py
│   ├── model.py
│   └── startup.py
└── README.rst

The content of the client folder is what we commonly refer to as the compute package.

The compute package (client folder)

The Project File (scaleout.yaml)

In version 1.0, the project file defines a build function and a startup script that registers callback functions for training and validation.

There are two main entry points:

build - used for any kind of setup that needs to be done before the client starts up, such as initializing the global seed model.
startup - invoked immediately after the client starts up and the environment has been initialized. Whatever script that is invoked by this entry point should register your train and validate callbacks.

To illustrate this, we look at the scaleout.yaml from the ‘mnist-pytorch’ example:

python_env: python_env.yaml

entry_points:
  build:
    model.py
  startup:
    startup.py

In this example, the build entrypoint points to a build() function in the model.py file:

import torch
from scaleoututil.helpers.helpers import get_helper

HELPER_MODULE = "numpyhelper"
helper = get_helper(HELPER_MODULE)


def build():
    model = Net()  # Instantiate your model architecture
    with torch.no_grad():
        parameters_np = [val.detach().cpu().numpy() for _, val in model.state_dict().items()]
    helper.save(parameters_np, "seed.npz")

This will create a seed model file “seed.npz” with random parameters when you run:

scaleout run build -p client

The startup entrypoint points to a startup() function in the startup.py file:

import torch
from data import load_data, prepare_data
from model import load_parameters, save_parameters
from scaleout import EdgeClient
from scaleoututil.utils.model import ScaleoutModel


def startup(client: EdgeClient):
    prepare_data()
    MyClient(client)


class MyClient:
    def __init__(self, client: EdgeClient):
        self.client = client
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        client.set_train_callback(self.train)
        client.set_validate_callback(self.validate)

    def train(self, scaleout_model: ScaleoutModel, settings, epochs=1, lr=0.01, batch_size=32):
        x_train, y_train = load_data()
        model = load_parameters(scaleout_model)
        model.to(self.device).train()

        optimizer = torch.optim.SGD(model.parameters(), lr=lr)
        criterion = torch.nn.NLLLoss()

        for epoch in range(epochs):
            for batch_x, batch_y in ...:  # iterate over batches
                self.client.check_task_abort()
                optimizer.zero_grad()
                loss = criterion(model(batch_x), batch_y)
                loss.backward()
                optimizer.step()
            self.client.log_metric({"training_loss": float(loss.item())})

        metadata = {"num_examples": len(x_train), "epochs": epochs, "lr": lr}
        return save_parameters(model), {"training_metadata": metadata}

    def validate(self, scaleout_model: ScaleoutModel):
        x_test, y_test = load_data(is_train=False)
        model = load_parameters(scaleout_model)
        model.to(self.device).eval()

        with torch.no_grad():
            preds = torch.argmax(model(x_test), dim=1)
            accuracy = (preds == y_test).float().mean()

        return {"test_accuracy": float(accuracy)}

As shown, the startup() function initializes the client (EdgeClient) and sets up the callbacks for training and validation. There is also an example of a custom command callback my_command which can be invoked from the server. The various callbacks contain placeholder logic that you would replace with your actual machine learning code:

train - receives the current model and training settings, performs training, and returns the updated model and metadata

The callback receives:

scaleoutmodel: A ScaleoutModel object containing the model parameters to train. Load parameters using scaleoutmodel.get_training_model(helper).
settings: A dictionary containing training settings such as number of epochs, batch size, learning rate, etc.

The callback must return:

A tuple containing the updated model and a metadata dictionary. The metadata dictionary can include any relevant information about the training process (e.g., number of training steps, loss values, etc.). This metadata can be utilized in the aggregation process or for logging purposes.

Key features of the train callback:

Progress tracking: Use edge_client.log_metric(key, value) to log metrics during training for real-time monitoring
Task abortion: Call edge_client.check_task_abort() regularly to allow graceful stopping when a session is terminated from the server (can be invoked by the admin user).
Flexible metadata: Include any additional information in the metadata dictionary (hyperparameters, loss values, etc.) that will be stored in the backend

validate (optional) - receives the current model, performs validation, and returns validation metrics

The callback receives:

scaleoutmodel: A ScaleoutModel object containing the model parameters to validate. Load parameters using scaleoutmodel.get_training_model(helper).

The callback must return:

A dictionary containing validation metrics. All scalar metrics in this dictionary will be captured and visualized in the Scaleout Edge UI. The entire content is stored in the backend database and accessible via the API and UI.

my_command (optional) - a custom command that can be invoked from the server with parameters. This can be used for custom operations outside of the standard training/validation flow.

The callback receives:

command_params: A dictionary containing parameters for the custom command.

Note

The command can be invoked from the server using the Scaleout Edge API or CLI by specifying the command name and parameters. However, currently storing command results in the backend is not supported. The callback must still return a dictionary.

The callback must return:

A dictionary containing the results of the custom command execution. This can include any relevant information about the command’s outcome (e.g., success status, output data, etc.).

Environment (python_env.yaml)

In version 1.0, Python environment management is user-controlled by default. You have several options:

Manual environment management (default): Install the dependencies specified in python_env.yaml manually using scaleout run install -p client. This gives you full control over your Python environment.

Managed environment mode (optional): Create a virtual environment in the client root directory, activate it, install Scaleout Edge, and start the client with the --managed-env flag. Scaleout will then manage package installation from python_env.yaml.

Custom environments: You can use Docker containers or other custom environments as needed. Remove the python_env tag from scaleout.yaml if you’re managing everything yourself.

Note

The previous automatic virtual environment creation is no longer the default. Users now have more flexibility and control over their runtime environments.

Packaging for training on Scaleout Edge

To run a project on Scaleout Edge we compress the entire client folder as a .tgz file. There is a utility command in the Scaleout Edge CLI to do this:

scaleout package create --path client

You can include a .scaleoutignore file in the client folder to exclude files from the package. This is useful for excluding large data files, temporary files, etc.

Note

You don’t have to create and use the compressed package. If you want to avoid distributing executable code over the network, you can stage the project folder on each client node manually yourself. This is the default behavior — no extra flag is needed:

scaleout client start --api-url <API_URL>

This will assume there is a client folder in the current working directory. To have the client download and extract the package from the server instead, use the --remote-package flag.

How is Scaleout Edge using the project?

With an understanding of the Scaleout Edge project and the compute package, we can take a closer look at how Scaleout Edge uses the project during federated training.

Version 1.0 - Importing Client Architecture:

In version 1.0, the architecture has been simplified and made more flexible:

A session is initiated by the controller, which pushes round configurations to the combiner(s)
The Combiner publishes a training request to its ClientManager queue
The Scaleout Edge Client is polling the ClientManager (unary RPC) for new task requests
The client imports your startup module and calls the startup() function, which registers your callbacks
When a training request arrives, the client calls your registered train callback with the current model
Your callback performs the training update and returns the new model and metadata
The client streams the model update back to the combiner for aggregation
For validation requests, the same pattern applies with the validate callback after a new global model has been produced

We recommend using the new importing client architecture.

Key advantages of the new architecture:

Direct import: Your code runs in the same process as the client, improving performance and simplifying debugging
Callback-based: More flexible and easier to integrate with existing ML frameworks
Real-time monitoring: Use log_metric() to track training progress in real-time
Graceful termination: Use check_task_abort() to handle session stops cleanly
Better error handling: Exceptions in your callbacks are properly caught and reported

Legacy Dispatcher Architecture:

The previous dispatcher-based architecture is still available using the --dispatcher flag. In this mode:

The Dispatcher reads the Project File (scaleout.yaml) and executes shell commands for train/validate
The client writes model data to temporary files and executes the commands as separate processes
After execution, the client reads the results from files and streams them to the combiner

The dispatcher mode is currently maintained for backward compatibility but might be deprecated in future releases. We recommend migrating to the new importing client architecture for better performance and flexibility.

Where to go from here?

With an understanding of how Scaleout Edge Projects are structured and created, you can explore our library of example projects. They demonstrate different use case scenarios of Scaleout Edge and its integration with popular machine learning frameworks like PyTorch and TensorFlow.

Examples: