.. _projects-label: ================================================ Develop a Scaleout Edge project ================================================ This guide explains how to set up and implement the machine learning code used within a Scaleout Edge project. Overview ========== A Scaleout Edge project is a convention for packaging/wrapping machine learning code that will be executed on edge nodes. At the core, a project is a directory of files (often a Git repository), containing your machine learning code, Scaleout Edge project file, and a specification of the runtime environment for the client (Python environment). The Scaleout Edge command-line tools provide functionality to help a user automate deployment and management of a project that follows the conventions. The structure of a Scaleout Edge project ======================================== We recommend that projects have the following folder and file structure, here illustrated by the dummy example 'importer-client': .. code-block:: text project/ ├-- client/ │ ├-- scaleout.yaml │ ├-- python_env.yaml (optional) │ ├-- build.py │ ├-- startup.py │ └-- .scaleoutignore (optional) └- README.rst The content of the ``client`` folder is what we commonly refer to as the *compute package*. The compute package (client folder) ==================================== **The Project File (scaleout.yaml)** In version 1.0, the project file defines a build function and a startup script that registers callback functions for training, validation, and prediction. There are two main entry points: - **build** - used for any kind of setup that needs to be done before the client starts up, such as initializing the global seed model. - **startup** - invoked immediately after the client starts up and the environment has been initialized. Whatever script that is invoked by this entry point should register your train, validate, and predict callbacks. To illustrate this, we look at the ``scaleout.yaml`` from the dummy example 'importer-client': .. code-block:: yaml python_env: python_env.yaml entry_points: build: build.py startup: startup.py In this example, the ``build`` entrypoint points to a ``build()`` function in the ``build.py`` file: .. code-block:: python import os from scaleoututil.helpers.helpers import get_helper import numpy as np HELPER_MODULE = "numpyhelper" helper = get_helper(HELPER_MODULE) def build(): output_dir = os.environ.get("SCALEOUT_BUILD_OUTPUT_DIR", ".") np.random.seed(42) params = np.random.rand(10).astype(np.float32) helper.save([params], os.path.join(output_dir, "seed.npz")) print(f"Created seed.npz with 10 random parameters.") This will create a seed model file "seed.npz" with random parameters when you run: .. code-block:: bash scaleout run build -p client The ``startup`` entrypoint points to a ``startup()`` function in the ``startup.py`` file: .. code-block:: python from scaleout import EdgeClient, ScaleoutModel from scaleoututil.helpers.helpers import get_helper HELPER_MODULE = "numpyhelper" helper = get_helper(HELPER_MODULE) def startup(client: EdgeClient): MyClient(client) class MyClient: def __init__(self, client: EdgeClient): self.client = client client.set_train_callback(self.train) client.set_validate_callback(self.validate) client.set_custom_callback("my_command", self.my_command) def train(self, model: ScaleoutModel, settings): """Train the model with the given parameters and settings.""" # Implement training logic here print("Training with model parameters:", model) model_params = model.get_model_params(helper) iterations = 100 for i in range(iterations): if i % 10 == 0: # It is possible to log metrics during training print(f"Training iteration {i}/{iterations}") self.client.log_metric({"train_iteration": i}) # Regularly check if the task has been aborted self.client.check_task_abort() # Throws an exception if the task has been aborted # After training, return the updated model parameters and metadata new_model = ScaleoutModel.from_model_params(model_params, helper=helper) # Train returns updated model parameters and {"training_metadata": {num_examples: int}, ...} return new_model, {"training_metadata": {"num_examples": 1}} def validate(self, model: ScaleoutModel): """Validate the model with the given parameters.""" # Implement validation logic here model_params = model.get_model_params(helper) print("Validating with model parameters") # Return validation metrics return {"validation_accuracy": 0.95} def my_command(self, command_params): """Handle a custom command with the given parameters.""" print("Hello from my_command with parameters: ", command_params) return {"status": "custom command executed"} As shown, the ``startup()`` function initializes the client (EdgeClient) and sets up the callbacks for training, validation, and prediction. There is also an example of a custom command callback ``my_command`` which can be invoked from the server. The various callbacks contain placeholder logic that you would replace with your actual machine learning code: **train** - receives the current model and training settings, performs training, and returns the updated model and metadata The callback receives: - ``scaleoutmodel``: A ScaleoutModel object containing the model parameters to train. Load parameters using ``scaleoutmodel.get_model_params(helper)``. - ``settings``: A dictionary containing training settings such as number of epochs, batch size, learning rate, etc. The callback must return: - A tuple containing the updated model and a metadata dictionary. The metadata dictionary can include any relevant information about the training process (e.g., number of training steps, loss values, etc.). This metadata can be utilized in the aggregation process or for logging purposes. **Key features of the train callback:** 1. **Progress tracking**: Use ``edge_client.log_metric(key, value)`` to log metrics during training for real-time monitoring 2. **Task abortion**: Call ``edge_client.check_task_abort()`` regularly to allow graceful stopping when a session is terminated from the server (can be invoked by the admin user). 3. **Flexible metadata**: Include any additional information in the metadata dictionary (hyperparameters, loss values, etc.) that will be stored in the backend **validate (optional)** - receives the current model, performs validation, and returns validation metrics The callback receives: - ``scaleoutmodel``: A ScaleoutModel object containing the model parameters to validate. Load parameters using ``scaleoutmodel.get_model_params(helper)``. The callback must return: - A dictionary containing validation metrics. All **scalar metrics** in this dictionary will be captured and visualized in the Scaleout Edge UI. The entire content is stored in the backend database and accessible via the API and UI. **my_command (optional)** - a custom command that can be invoked from the server with parameters. This can be used for custom operations outside of the standard training/validation/prediction flow. The callback receives: - ``command_params``: A dictionary containing parameters for the custom command. .. note:: The command can be invoked from the server using the Scaleout Edge API or CLI by specifying the command name and parameters. However, currently storing command results in the backend is not supported. The callback must still return a dictionary. The callback must return: - A dictionary containing the results of the custom command execution. This can include any relevant information about the command's outcome (e.g., success status, output data, etc.). **Environment (python_env.yaml)** In version 1.0, Python environment management is user-controlled by default. You have several options: 1. **Manual environment management (default)**: Install the dependencies specified in ``python_env.yaml`` manually using ``scaleout run install -p client``. This gives you full control over your Python environment. 2. **Managed environment mode (optional)**: Create a virtual environment in the client root directory, activate it, install Scaleout Edge, and start the client with the ``--managed-env`` flag. Scaleout will then manage package installation from ``python_env.yaml``. 3. **Custom environments**: You can use Docker containers or other custom environments as needed. Remove the ``python_env`` tag from ``scaleout.yaml`` if you're managing everything yourself. .. note:: The previous automatic virtual environment creation is no longer the default. Users now have more flexibility and control over their runtime environments. Packaging for training on Scaleout Edge ======================================= To run a project on Scaleout Edge we compress the entire client folder as a .tgz file. There is a utility command in the Scaleout Edge CLI to do this: .. code-block:: bash scaleout package create --path client You can include a .scaleoutignore file in the client folder to exclude files from the package. This is useful for excluding large data files, temporary files, etc. .. note:: You don't have to create and use the compressed package. If you want to avoid distributing executable code over the network, you can stage the project folder on each client node manually yourself and then use the ``--local-package`` flag when starting the client: .. code-block:: bash scaleout client start --api-url --local-package This will assume there is a ``client`` folder in the current working directory. How is Scaleout Edge using the project? ======================================= With an understanding of the Scaleout Edge project and the compute package, we can take a closer look at how Scaleout Edge uses the project during federated training. **Version 1.0 - Importing Client Architecture:** In version 1.0, the architecture has been simplified and made more flexible: 1. A session is initiated by the controller, which pushes round configurations to the combiner(s) 2. The Combiner publishes a training request to its ClientManager queue 3. The Scaleout Edge Client is polling the ClientManager (unary RPC) for new task requests 4. The client **imports** your startup module and calls the ``startup()`` function, which registers your callbacks 5. When a training request arrives, the client calls your registered ``train`` callback with the current model 6. Your callback performs the training update and returns the new model and metadata 7. The client streams the model update back to the combiner for aggregation 8. For validation requests, the same pattern applies with the ``validate`` callback after a new global model has been produced We recommend using the new importing client architecture. **Key advantages of the new architecture:** - **Direct import**: Your code runs in the same process as the client, improving performance and simplifying debugging - **Callback-based**: More flexible and easier to integrate with existing ML frameworks - **Real-time monitoring**: Use ``log_metric()`` to track training progress in real-time - **Graceful termination**: Use ``check_task_abort()`` to handle session stops cleanly - **Better error handling**: Exceptions in your callbacks are properly caught and reported **Legacy Dispatcher Architecture:** The previous dispatcher-based architecture is still available using the ``--dispatcher`` flag. In this mode: 1. The Dispatcher reads the Project File (``scaleout.yaml``) and executes shell commands for train/validate 2. The client writes model data to temporary files and executes the commands as separate processes 3. After execution, the client reads the results from files and streams them to the combiner The dispatcher mode is currently maintained for backward compatibility but might be deprecated in future releases. We recommend migrating to the new importing client architecture for better performance and flexibility. Where to go from here? ====================== With an understanding of how Scaleout Edge Projects are structured and created, you can explore our library of example projects. They demonstrate different use case scenarios of Scaleout Edge and its integration with popular machine learning frameworks like PyTorch and TensorFlow. **Version 1.0 examples (importing client):** - `Importing Client Example `__ - `MNIST with Keras (updated) `__ **Legacy examples (dispatcher-based):** - `Scaleout Edge + PyTorch `__ - `Scaleout Edge + TensorFlow/Keras `__ - `Scaleout Edge + Hugging Face `__ - `Scaleout Edge + Self-supervised learning `__ .. meta:: :description lang=en: A Scaleout Edge project is a convention for packaging/wrapping machine learning code to be used for federated learning with Scaleout Edge. :keywords: Federated Learning, Machine Learning, Federated Learning Framework, Federated Learning Platform, FEDn, Scaleout Systems, Scaleout Edge