Skip to content

Google Cloud Storage

In this section, we will describe how Kubeflow administrators can provide users with a user friendly way to seamlessly access data stored/managed in Google Cloud Storage aka GCS directly from their Kubeflow notebooks. This guide should help you set up the integration between your Kubeflow deployment and GCS, enabling seamless querying of data from GCS backed object storage directly from a Kubeflow notebook.

Info

Note that the approach described below is conceptually identical to the integration with Google BigQuery. Admins may prefer to provide permissions to GCS via the same service account.


Assumptions

Administrators should make sure that they have the following in place to configure and set this up.

  • Operational Rafay's Kubeflow based MLOps offering
  • A Google Cloud Service Account JSON file with access to GCS
  • Kubectl access to the user's namespace where the secret will be created

Info

Please ensure that the GCS permissions and buckets are setup before accessing. Also ensure that the service account has 'Storage Object Viewer' role, which allows read only access to the bucket


Step-by-Step - As Administrator

Create a Kubernetes Secret

Create a secret in the desired namespace that holds your Google Cloud service account credentials

kubectl create secret generic gcs-secret \
  --from-file=service-account.json=gcp.json \
  --namespace <my-namespace>

Info

Replace with your Kubeflow namespace and gcp.json with the path to your service account JSON file.


Create the PodDefault Configuration

Create a poddefault.yaml file with the following content to allow access to GCS.

apiVersion: "kubeflow.org/v1alpha1"
kind: PodDefault
metadata:
  name: add-gcp-secret
spec:
  selector:
    matchLabels:
      gcs-secret: "true"
  desc: "Allow access to Google Cloud Storage"
  volumeMounts:
    - name: gcs-secret-volume
      mountPath: /var/secrets/google
  volumes:
    - name: gcs-secret-volume
      secret:
        defaultMode: 420
        secretName: gcs-secret
  env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: /var/secrets/google/service-account.json

Info

This configuration mounts the GCS secret into any pod with the matching label.


Apply the PodDefault

Apply the poddefault.yaml file to your namespace:

kubectl apply -f poddefault.yaml -n <my-namespace>

Step-by-Step - As End User

Create a Kubeflow Notebook

  • Log in to the Kubeflow Dashboard.
  • Create a new Jupyter notebook server, provide a name for the notebook.
  • In the configurations, select the option that allows access to Google Cloud Storage by selecting the PodDefault you applied.
  • Click Launch.

Once the notebook server is running, connect to it and create a new Jupyter notebook.


Install Python Libraries and Run Queries against GCS

In your Jupyter notebook, install the required Python libraries

!pip install google-cloud-storage

Use the following Python script to connect to GCS and run queries.

from google.cloud import storage
from google.auth.exceptions import DefaultCredentialsError
from google.api_core.exceptions import GoogleAPIError, NotFound, BadRequest

# Read data from GCS bucket
def read_from_gcs(bucket_name, blob_name, destination_file_name):
    try:
        storage_client = storage.Client()
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(blob_name)

        blob.download_to_filename(destination_file_name)
        print(f"Blob {blob_name} downloaded to {destination_file_name}.")

        with open(destination_file_name, 'r') as file:
            data = file.read()
            print("File contents:")
            print(data)
            return data
    except Exception as e:
        print("An error occurred while reading the file.")
        print(f"Error: {e}")
        return None
    except NotFound as e:
        print("The specified file or bucket was not found.")
        print(f"Error: {e}")
    except BadRequest as e:
        print("There was an issue with the request or parameters.")
        print(f"Error: {e}")
    except GoogleAPIError as e:
        print("An API error occurred while reading from GCS.")
        print(f"Error: {e}")
    except Exception as e:
        print("An unexpected error occurred while reading the file.")
        print(f"Error: {e}")
        return None

if __name__ == "__main__":
        bucket_name = "my-gcs-bucket"
        source_file_name = "/home/jovyan/abcd.txt"
        destination_blob_name = "abcd.txt"  
        downloaded_file_name = "/home/jovyan/aaaa.txt"
        read_from_gcs(bucket_name, destination_blob_name, downloaded_file_name)

Run and Verify the Notebook

Execute the code cells in your Jupyter notebook. If everything is configured correctly, you should be able to access data in the GCS bucket.

Considerations

  • Ensure that the GCS object and bucket exists and has the necessary permissions for the service account.
  • If issues arise, verify that the GOOGLE_APPLICATION_CREDENTIALS environment variable points to the correct path and that the secret is properly mounted.