Develop a Scaleout Edge project
This guide explains how to set up and implement the machine learning code used within a Scaleout Edge project.
Overview
A Scaleout Edge project is a convention for packaging/wrapping machine learning code that will be executed on edge nodes. At the core, a project is a directory of files (often a Git repository), containing your machine learning code, Scaleout Edge project file, and a specification of the runtime environment for the client (Python environment). The Scaleout Edge command-line tools provide functionality to help a user automate deployment and management of a project that follows the conventions.
The structure of a Scaleout Edge project
We recommend that projects have the following folder and file structure, here illustrated by the dummy example ‘importer-client’:
project/
├-- client/
│ ├-- scaleout.yaml
│ ├-- python_env.yaml (optional)
│ ├-- build.py
│ ├-- startup.py
│ └-- .scaleoutignore (optional)
└- README.rst
The content of the client folder is what we commonly refer to as the compute package.
The compute package (client folder)
The Project File (scaleout.yaml)
In version 1.0, the project file defines a build function and a startup script that registers callback functions for training, validation, and prediction.
There are two main entry points:
build - used for any kind of setup that needs to be done before the client starts up, such as initializing the global seed model.
startup - invoked immediately after the client starts up and the environment has been initialized. Whatever script that is invoked by this entry point should register your train, validate, and predict callbacks.
To illustrate this, we look at the scaleout.yaml from the dummy example ‘importer-client’:
python_env: python_env.yaml
entry_points:
build:
build.py
startup:
startup.py
In this example, the build entrypoint points to a build() function in the build.py file:
import os
from scaleoututil.helpers.helpers import get_helper
import numpy as np
HELPER_MODULE = "numpyhelper"
helper = get_helper(HELPER_MODULE)
def build():
output_dir = os.environ.get("SCALEOUT_BUILD_OUTPUT_DIR", ".")
np.random.seed(42)
params = np.random.rand(10).astype(np.float32)
helper.save([params], os.path.join(output_dir, "seed.npz"))
print(f"Created seed.npz with 10 random parameters.")
This will create a seed model file “seed.npz” with random parameters when you run:
scaleout run build -p client
The startup entrypoint points to a startup() function in the startup.py file:
from scaleout import EdgeClient, ScaleoutModel
from scaleoututil.helpers.helpers import get_helper
HELPER_MODULE = "numpyhelper"
helper = get_helper(HELPER_MODULE)
def startup(client: EdgeClient):
MyClient(client)
class MyClient:
def __init__(self, client: EdgeClient):
self.client = client
client.set_train_callback(self.train)
client.set_validate_callback(self.validate)
client.set_custom_callback("my_command", self.my_command)
def train(self, model: ScaleoutModel, settings):
"""Train the model with the given parameters and settings."""
# Implement training logic here
print("Training with model parameters:", model)
model_params = model.get_model_params(helper)
iterations = 100
for i in range(iterations):
if i % 10 == 0:
# It is possible to log metrics during training
print(f"Training iteration {i}/{iterations}")
self.client.log_metric({"train_iteration": i})
# Regularly check if the task has been aborted
self.client.check_task_abort() # Throws an exception if the task has been aborted
# After training, return the updated model parameters and metadata
new_model = ScaleoutModel.from_model_params(model_params, helper=helper)
# Train returns updated model parameters and {"training_metadata": {num_examples: int}, ...}
return new_model, {"training_metadata": {"num_examples": 1}}
def validate(self, model: ScaleoutModel):
"""Validate the model with the given parameters."""
# Implement validation logic here
model_params = model.get_model_params(helper)
print("Validating with model parameters")
# Return validation metrics
return {"validation_accuracy": 0.95}
def my_command(self, command_params):
"""Handle a custom command with the given parameters."""
print("Hello from my_command with parameters: ", command_params)
return {"status": "custom command executed"}
As shown, the startup() function initializes the client (EdgeClient) and sets up the callbacks for training, validation, and prediction. There is also an example of a custom command callback my_command which can be invoked from the server.
The various callbacks contain placeholder logic that you would replace with your actual machine learning code:
train - receives the current model and training settings, performs training, and returns the updated model and metadata
The callback receives:
scaleoutmodel: A ScaleoutModel object containing the model parameters to train. Load parameters usingscaleoutmodel.get_model_params(helper).settings: A dictionary containing training settings such as number of epochs, batch size, learning rate, etc.
The callback must return:
A tuple containing the updated model and a metadata dictionary. The metadata dictionary can include any relevant information about the training process (e.g., number of training steps, loss values, etc.). This metadata can be utilized in the aggregation process or for logging purposes.
Key features of the train callback:
Progress tracking: Use
edge_client.log_metric(key, value)to log metrics during training for real-time monitoringTask abortion: Call
edge_client.check_task_abort()regularly to allow graceful stopping when a session is terminated from the server (can be invoked by the admin user).Flexible metadata: Include any additional information in the metadata dictionary (hyperparameters, loss values, etc.) that will be stored in the backend
validate (optional) - receives the current model, performs validation, and returns validation metrics
The callback receives:
scaleoutmodel: A ScaleoutModel object containing the model parameters to validate. Load parameters usingscaleoutmodel.get_model_params(helper).
The callback must return:
A dictionary containing validation metrics. All scalar metrics in this dictionary will be captured and visualized in the Scaleout Edge UI. The entire content is stored in the backend database and accessible via the API and UI.
my_command (optional) - a custom command that can be invoked from the server with parameters. This can be used for custom operations outside of the standard training/validation/prediction flow.
The callback receives:
command_params: A dictionary containing parameters for the custom command.
Note
The command can be invoked from the server using the Scaleout Edge API or CLI by specifying the command name and parameters. However, currently storing command results in the backend is not supported. The callback must still return a dictionary.
The callback must return:
A dictionary containing the results of the custom command execution. This can include any relevant information about the command’s outcome (e.g., success status, output data, etc.).
Environment (python_env.yaml)
In version 1.0, Python environment management is user-controlled by default. You have several options:
Manual environment management (default): Install the dependencies specified in
python_env.yamlmanually usingscaleout run install -p client. This gives you full control over your Python environment.Managed environment mode (optional): Create a virtual environment in the client root directory, activate it, install Scaleout Edge, and start the client with the
--managed-envflag. Scaleout will then manage package installation frompython_env.yaml.Custom environments: You can use Docker containers or other custom environments as needed. Remove the
python_envtag fromscaleout.yamlif you’re managing everything yourself.
Note
The previous automatic virtual environment creation is no longer the default. Users now have more flexibility and control over their runtime environments.
Packaging for training on Scaleout Edge
To run a project on Scaleout Edge we compress the entire client folder as a .tgz file. There is a utility command in the Scaleout Edge CLI to do this:
scaleout package create --path client
You can include a .scaleoutignore file in the client folder to exclude files from the package. This is useful for excluding large data files, temporary files, etc.
Note
You don’t have to create and use the compressed package. If you want to avoid distributing executable code over the network, you can stage the project folder on each client node manually yourself and then use the --local-package flag when starting the client:
scaleout client start --api-url <API_URL> --local-package
This will assume there is a client folder in the current working directory.
How is Scaleout Edge using the project?
With an understanding of the Scaleout Edge project and the compute package, we can take a closer look at how Scaleout Edge uses the project during federated training.
Version 1.0 - Importing Client Architecture:
In version 1.0, the architecture has been simplified and made more flexible:
A session is initiated by the controller, which pushes round configurations to the combiner(s)
The Combiner publishes a training request to its ClientManager queue
The Scaleout Edge Client is polling the ClientManager (unary RPC) for new task requests
The client imports your startup module and calls the
startup()function, which registers your callbacksWhen a training request arrives, the client calls your registered
traincallback with the current modelYour callback performs the training update and returns the new model and metadata
The client streams the model update back to the combiner for aggregation
For validation requests, the same pattern applies with the
validatecallback after a new global model has been produced
We recommend using the new importing client architecture.
Key advantages of the new architecture:
Direct import: Your code runs in the same process as the client, improving performance and simplifying debugging
Callback-based: More flexible and easier to integrate with existing ML frameworks
Real-time monitoring: Use
log_metric()to track training progress in real-timeGraceful termination: Use
check_task_abort()to handle session stops cleanlyBetter error handling: Exceptions in your callbacks are properly caught and reported
Legacy Dispatcher Architecture:
The previous dispatcher-based architecture is still available using the --dispatcher flag. In this mode:
The Dispatcher reads the Project File (
scaleout.yaml) and executes shell commands for train/validateThe client writes model data to temporary files and executes the commands as separate processes
After execution, the client reads the results from files and streams them to the combiner
The dispatcher mode is currently maintained for backward compatibility but might be deprecated in future releases. We recommend migrating to the new importing client architecture for better performance and flexibility.
Where to go from here?
With an understanding of how Scaleout Edge Projects are structured and created, you can explore our library of example projects. They demonstrate different use case scenarios of Scaleout Edge and its integration with popular machine learning frameworks like PyTorch and TensorFlow.
Version 1.0 examples (importing client):
Legacy examples (dispatcher-based):