Develop a Scaleout Edge project

This guide explains how to set up and implement the machine learning code used within a Scaleout Edge project.

Overview

A Scaleout Edge project is a convention for packaging/wrapping machine learning code that will be executed on edge nodes. At the core, a project is a directory of files (often a Git repository), containing your machine learning code, Scaleout Edge project file, and a specification of the runtime environment for the client (Python environment). The Scaleout Edge command-line tools provide functionality to help a user automate deployment and management of a project that follows the conventions.

The structure of a Scaleout Edge project

We recommend that projects have the following folder and file structure, here illustrated by the dummy example ‘importer-client’:

project/
├-- client/
│   ├-- scaleout.yaml
│   ├-- python_env.yaml (optional)
│   ├-- build.py
│   ├-- startup.py
│   └-- .scaleoutignore (optional)
└- README.rst

The content of the client folder is what we commonly refer to as the compute package.

The compute package (client folder)

The Project File (scaleout.yaml)

In version 1.0, the project file defines a build function and a startup script that registers callback functions for training, validation, and prediction.

There are two main entry points:

  • build - used for any kind of setup that needs to be done before the client starts up, such as initializing the global seed model.

  • startup - invoked immediately after the client starts up and the environment has been initialized. Whatever script that is invoked by this entry point should register your train, validate, and predict callbacks.

To illustrate this, we look at the scaleout.yaml from the dummy example ‘importer-client’:

python_env: python_env.yaml

entry_points:
    build:
        build.py
    startup:
        startup.py

In this example, the build entrypoint points to a build() function in the build.py file:

import os
from scaleoututil.helpers.helpers import get_helper
import numpy as np

HELPER_MODULE = "numpyhelper"
helper = get_helper(HELPER_MODULE)


def build():
    output_dir = os.environ.get("SCALEOUT_BUILD_OUTPUT_DIR", ".")
    np.random.seed(42)
    params = np.random.rand(10).astype(np.float32)
    helper.save([params], os.path.join(output_dir, "seed.npz"))
    print(f"Created seed.npz with 10 random parameters.")

This will create a seed model file “seed.npz” with random parameters when you run:

scaleout run build -p client

The startup entrypoint points to a startup() function in the startup.py file:

from scaleout import EdgeClient, ScaleoutModel
from scaleoututil.helpers.helpers import get_helper


HELPER_MODULE = "numpyhelper"
helper = get_helper(HELPER_MODULE)

def startup(client: EdgeClient):
    MyClient(client)

class MyClient:
    def __init__(self, client: EdgeClient):
        self.client = client
        client.set_train_callback(self.train)
        client.set_validate_callback(self.validate)

        client.set_custom_callback("my_command", self.my_command)

    def train(self, model: ScaleoutModel, settings):
        """Train the model with the given parameters and settings."""
        # Implement training logic here
        print("Training with model parameters:", model)
        model_params = model.get_model_params(helper)
        iterations = 100
        for i in range(iterations):
            if i % 10 == 0:
                # It is possible to log metrics during training
                print(f"Training iteration {i}/{iterations}")
                self.client.log_metric({"train_iteration": i})
            # Regularly check if the task has been aborted
            self.client.check_task_abort()  # Throws an exception if the task has been aborted
        # After training, return the updated model parameters and metadata
        new_model = ScaleoutModel.from_model_params(model_params, helper=helper)
        # Train returns updated model parameters and {"training_metadata": {num_examples: int}, ...}
        return new_model, {"training_metadata": {"num_examples": 1}}

    def validate(self, model: ScaleoutModel):
        """Validate the model with the given parameters."""
        # Implement validation logic here
        model_params = model.get_model_params(helper)
        print("Validating with model parameters")
        # Return validation metrics
        return {"validation_accuracy": 0.95}


    def my_command(self, command_params):
        """Handle a custom command with the given parameters."""
        print("Hello from my_command with parameters: ", command_params)
        return {"status": "custom command executed"}

As shown, the startup() function initializes the client (EdgeClient) and sets up the callbacks for training, validation, and prediction. There is also an example of a custom command callback my_command which can be invoked from the server. The various callbacks contain placeholder logic that you would replace with your actual machine learning code:

train - receives the current model and training settings, performs training, and returns the updated model and metadata

The callback receives:

  • scaleoutmodel: A ScaleoutModel object containing the model parameters to train. Load parameters using scaleoutmodel.get_model_params(helper).

  • settings: A dictionary containing training settings such as number of epochs, batch size, learning rate, etc.

The callback must return:

  • A tuple containing the updated model and a metadata dictionary. The metadata dictionary can include any relevant information about the training process (e.g., number of training steps, loss values, etc.). This metadata can be utilized in the aggregation process or for logging purposes.

Key features of the train callback:

  1. Progress tracking: Use edge_client.log_metric(key, value) to log metrics during training for real-time monitoring

  2. Task abortion: Call edge_client.check_task_abort() regularly to allow graceful stopping when a session is terminated from the server (can be invoked by the admin user).

  3. Flexible metadata: Include any additional information in the metadata dictionary (hyperparameters, loss values, etc.) that will be stored in the backend

validate (optional) - receives the current model, performs validation, and returns validation metrics

The callback receives:

  • scaleoutmodel: A ScaleoutModel object containing the model parameters to validate. Load parameters using scaleoutmodel.get_model_params(helper).

The callback must return:

  • A dictionary containing validation metrics. All scalar metrics in this dictionary will be captured and visualized in the Scaleout Edge UI. The entire content is stored in the backend database and accessible via the API and UI.

my_command (optional) - a custom command that can be invoked from the server with parameters. This can be used for custom operations outside of the standard training/validation/prediction flow.

The callback receives:

  • command_params: A dictionary containing parameters for the custom command.

Note

The command can be invoked from the server using the Scaleout Edge API or CLI by specifying the command name and parameters. However, currently storing command results in the backend is not supported. The callback must still return a dictionary.

The callback must return:

  • A dictionary containing the results of the custom command execution. This can include any relevant information about the command’s outcome (e.g., success status, output data, etc.).

Environment (python_env.yaml)

In version 1.0, Python environment management is user-controlled by default. You have several options:

  1. Manual environment management (default): Install the dependencies specified in python_env.yaml manually using scaleout run install -p client. This gives you full control over your Python environment.

  2. Managed environment mode (optional): Create a virtual environment in the client root directory, activate it, install Scaleout Edge, and start the client with the --managed-env flag. Scaleout will then manage package installation from python_env.yaml.

  3. Custom environments: You can use Docker containers or other custom environments as needed. Remove the python_env tag from scaleout.yaml if you’re managing everything yourself.

Note

The previous automatic virtual environment creation is no longer the default. Users now have more flexibility and control over their runtime environments.

Packaging for training on Scaleout Edge

To run a project on Scaleout Edge we compress the entire client folder as a .tgz file. There is a utility command in the Scaleout Edge CLI to do this:

scaleout package create --path client

You can include a .scaleoutignore file in the client folder to exclude files from the package. This is useful for excluding large data files, temporary files, etc.

Note

You don’t have to create and use the compressed package. If you want to avoid distributing executable code over the network, you can stage the project folder on each client node manually yourself and then use the --local-package flag when starting the client:

scaleout client start --api-url <API_URL> --local-package

This will assume there is a client folder in the current working directory.

How is Scaleout Edge using the project?

With an understanding of the Scaleout Edge project and the compute package, we can take a closer look at how Scaleout Edge uses the project during federated training.

Version 1.0 - Importing Client Architecture:

In version 1.0, the architecture has been simplified and made more flexible:

  1. A session is initiated by the controller, which pushes round configurations to the combiner(s)

  2. The Combiner publishes a training request to its ClientManager queue

  3. The Scaleout Edge Client is polling the ClientManager (unary RPC) for new task requests

  4. The client imports your startup module and calls the startup() function, which registers your callbacks

  5. When a training request arrives, the client calls your registered train callback with the current model

  6. Your callback performs the training update and returns the new model and metadata

  7. The client streams the model update back to the combiner for aggregation

  8. For validation requests, the same pattern applies with the validate callback after a new global model has been produced

We recommend using the new importing client architecture.

Key advantages of the new architecture:

  • Direct import: Your code runs in the same process as the client, improving performance and simplifying debugging

  • Callback-based: More flexible and easier to integrate with existing ML frameworks

  • Real-time monitoring: Use log_metric() to track training progress in real-time

  • Graceful termination: Use check_task_abort() to handle session stops cleanly

  • Better error handling: Exceptions in your callbacks are properly caught and reported

Legacy Dispatcher Architecture:

The previous dispatcher-based architecture is still available using the --dispatcher flag. In this mode:

  1. The Dispatcher reads the Project File (scaleout.yaml) and executes shell commands for train/validate

  2. The client writes model data to temporary files and executes the commands as separate processes

  3. After execution, the client reads the results from files and streams them to the combiner

The dispatcher mode is currently maintained for backward compatibility but might be deprecated in future releases. We recommend migrating to the new importing client architecture for better performance and flexibility.

Where to go from here?

With an understanding of how Scaleout Edge Projects are structured and created, you can explore our library of example projects. They demonstrate different use case scenarios of Scaleout Edge and its integration with popular machine learning frameworks like PyTorch and TensorFlow.

Version 1.0 examples (importing client):

Legacy examples (dispatcher-based):