Skip to content

TensorFlow

In this Get Started guide, you will perform Distributed Training using TensorFlow against your remote Ray endpoint. This guide assumes the following:

  • You have already created a "Ray as Service" tenant using Rafay
  • You have the https URL and Access Credentials to the remote endpoint.
  • You have Python 3 installed on your laptop

Watch a brief video showcasing what you will experience with this guide.


Review Code

Download the source code file "tensorflow_test.py" and open it in your favorite IDE such as VS Code to review it. This code performs distributed training of a simple TensorFlow model using Ray and TensorFlow's MirroredStrategy.

Distributed Training

Uses MirroredStrategy to automatically distribute the model and data across multiple GPUs, enabling faster training when multiple GPUs are available.

Simple Setup

This example simplifies the setup by using mock data and a basic model architecture, making it easy to understand how to distribute training with TensorFlow.

Integration with Ray

While Ray is initialized in the script, the actual distribution of the model training is managed by TensorFlow’s MirroredStrategy. Ray's initialization sets up a distributed environment, which could be useful if the script were part of a larger workflow managed by Ray.


Step-by-Step

Below is a detailed breakdown of what each part of the code does.

Import Libraries

import tensorflow as tf
import numpy as np
import ray

TensorFlow (tf): Used for building and training neural networks.

NumPy (np): Used to generate random data for training.

Ray: A distributed computing library used here to initialize a distributed environment.


Initialize Ray

ray.init()
ray.init() starts the Ray runtime, enabling the script to use Ray’s distributed computing features.

Note

Ray is not directly involved in managing the TensorFlow training but sets up a distributed context, which could be useful if further distributed tasks were added.


Define the Distributed Training Function

def train_distributed_model():
    strategy = tf.distribute.MirroredStrategy()
    ...
    with strategy.scope():
        model = tf.keras.Sequential([...])
        model.compile(...)
        model.fit(...)
        return model.summary()

train_distributed_model is a function that encapsulates the TensorFlow model definition and training process using a distribution strategy.

tf.distribute.MirroredStrategy(): This strategy is used to distribute the training across multiple GPUs on a single machine. It mirrors all variables across the GPUs and performs synchronous training, where each batch is divided among the GPUs, and gradients are aggregated before updating the model.

The strategy.scope() ensures that all layers and model parameters are mirrored across available GPUs, enabling them to be trained in parallel.


Generate Random Data

X_train = np.random.randn(1000, 10)
y_train = np.random.randn(1000, 1)

This step generates 1000 samples of training data. This is mock data used for demonstration purposes, not real-world training data.

  • X_train: Features with 10 input variables, each generated randomly.
  • y_train: Corresponding target values, generated randomly.

Define and Compile the Model

with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
        tf.keras.layers.Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse')

with strategy.scope(): Ensures that the entire model is built within the distribution strategy, allowing it to be distributed across multiple GPUs.

Model Architecture: A simple feedforward neural network with:

  • Dense layer with 64 units and ReLU activation.

  • Output layer with 1 unit, representing a regression output.

Our model takes input data with 10 features.

model.compile():

  • Uses Adam optimizer for gradient-based optimization.

  • Uses mean squared error (MSE) loss, which is standard for regression problems.


Train the Model

model.fit(X_train, y_train, epochs=5)
Trains the model on the generated data for 5 epochs. epochs=5 means that the entire dataset is passed through the model 5 times during training. Since the model is distributed using MirroredStrategy, each batch of data is split across the GPUs, allowing parallel training on available devices.


Return Model Summary

return model.summary()
model.summary() provides a textual summary of the model architecture, including the layers, the number of parameters in each layer, and the total number of trainable parameters. It also prints the architecture of the model, helping users understand its structure and size.


Run the Training Function

if __name__ == "__main__":
    print("Starting distributed training with TensorFlow...")
    train_distributed_model()

This block checks if the script is being run directly (not imported as a module). It then prints a message indicating that distributed training is starting and then calls the train_distributed_model() function, initiating the training process.


Job Submission Code

Download the source code file "run.py" and open it in your favorite IDE such as VS Code to review it. As you can see from the code snippet below, we will be using Ray's Job Submission Client to submit a job to the remote Ray endpoint.

from ray.job_submission import JobSubmissionClient
import ray
import urllib3

# Suppress the warning about unverified HTTPS requests
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Ray client
client = JobSubmissionClient(
    "https://<URL> for Ray Endpoint>", 
    headers={"Authorization": "Basic <Base64 Encoded Credentials>"}, 
    verify=False  # Disable SSL verification
)

# Submit job
client.submit_job(entrypoint="python tensorflow_test.py", runtime_env={"working_dir": "./"})

Now, update the authorization credentials with the base64 encoded credentials for your Ray endpoint. You can use the following command to perform the encoding.

echo -n 'admin:PASSWORD' | base64

Submit Job

In order to submit the job to your remote Ray endpoint,

  • First, in your web browser, access the Ray Dashboard's URL and keep it open. We will monitor the status and progress of the submitted job here.
  • Now, open Terminal and enter the following command
python3 ./run.py 

This will submit the job to the configured Ray endpoint and you can review progress and the results on the Ray Dashboard. Once the Ray endpoint receives the job, it will be pending for a few seconds.

The script will print a summary of the TensorFlow model, showing the layers and the number of parameters. Training logs will display the loss value after each epoch, indicating how well the model is learning to fit the randomly generated data.

Shown below is an illustrative example of what can expect to see

Starting distributed training with TensorFlow...
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten (Flatten)            (None, 10)                0         
dense (Dense)                (None, 64)                704       
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 769
Trainable params: 769
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
32/32 [==============================] - 1s 19ms/step - loss: 0.9876
Epoch 2/5
32/32 [==============================] - 0s 11ms/step - loss: 0.7564
...