Skip to content

Google BigQuery

In this section, we will describe how Kubeflow administrators can provide users with a user friendly way to seamlessly access data stored/managed in Google BigQuery directly from their Kubeflow notebooks. This guide should help you set up the integration between your Kubeflow deployment and Google BigQuery, enabling seamless querying of BigQuery datasets directly from a Kubeflow notebook.


Assumptions

Administrators should make sure that they have the following in place to configure and set this up.

  • Operational Rafay's Kubeflow based MLOps offering
  • A Google Cloud Service Account JSON file with access to BigQuery
  • Kubectl access to the user's namespace where the secret will be created

Info

Please ensure that the service account has the BigQuery Job User and BigQuery Data Viewer roles for querying BigQuery datasets.


Benefits

Integrating Google BigQuery with Kubeflow using this approach provides several benefits especially for organizations looking to leverage cloud-based data analytics alongside scalable machine learning workflows. Some of the advantages of this approach are:

Seamless Data Access for Machine Learning

By mounting Google BigQuery credentials directly into Kubeflow notebooks, data scientists and engineers can directly access and query large datasets stored in BigQuery without needing to download data to local environments. This enables more efficient data handling, especially for large datasets that would be impractical to store locally. It facilitates real-time data integration, where users can update their models with fresh data directly from BigQuery, improving the accuracy and relevance of the models.

Enhanced Security with Kubernetes Secrets

Using Kubernetes Secrets to store the service account credentials ensures that sensitive information, like API keys, is kept secure and separate from application logic. The secrets are mounted into the environment and are only accessible to authorized pods, adhering to best practices for secure access to cloud resources. This approach minimizes the risk of credential leakage, as the sensitive JSON file is not stored directly in the notebook or code repositories.

Scalability and Flexibility

Kubeflow is designed for distributed and scalable machine learning on Kubernetes, allowing data processing workloads to scale according to the needs of the model training process. Integrating BigQuery allows users to seamlessly work with large datasets, leveraging the processing power of Google’s infrastructure. This integration enables the development of pipelines where data ingestion, preprocessing, and model training are all automated, reducing the manual intervention required in typical workflows.

Improved Collaboration and Reproducibility:

By creating a PodDefault configuration for BigQuery access, multiple users and teams can set up new notebooks with consistent access configurations, leading to a standardized environment for accessing data. This setup ensures that different team members working on the same project can reproduce data access and training conditions, making it easier to collaborate on experiments and validate results.

Cost Efficiency

BigQuery operates on a serverless architecture, meaning users are charged only for the queries they run rather than for maintaining a database server. This pay-as-you-go model can be more cost-effective than managing and scaling databases manually. Integrating it directly with Kubeflow ensures that machine learning workflows leverage BigQuery’s serverless querying capabilities, allowing organizations to save on infrastructure costs while processing large volumes of data.

Faster Prototyping and Iteration

Data scientists can prototype models more quickly by pulling data directly from BigQuery, running analyses, and updating models without needing complex data transfer procedures. This integration reduces the friction between data storage and machine learning, leading to a more efficient development cycle. Direct access to BigQuery data allows users to iteratively adjust queries and model parameters based on the results, helping to fine-tune models rapidly and make adjustments to datasets as needed.

In summary, this method of integrating BigQuery with Kubeflow allows for a secure, scalable, and efficient approach to managing data access for machine learning tasks. It simplifies the workflow of accessing cloud-based data directly within a scalable machine learning environment, making it particularly beneficial for teams working with large datasets and complex models.


Step-by-Step - As Administrator

Create a Kubernetes Secret

Create a secret in the desired namespace that holds your Google Cloud service account credentials

kubectl create secret generic bigquery-secret \
  --from-file=service-account.json=gcp.json \
  --namespace <my-namespace>

Info

Replace with your Kubeflow namespace and gcp.json with the path to your service account JSON file.


Create the PodDefault Configuration

Create a poddefault.yaml file with the following content to allow access to BigQuery.

apiVersion: "kubeflow.org/v1alpha1"
kind: PodDefault
metadata:
  name: add-gcp-secret
spec:
  selector:
    matchLabels:
      bigquery-secret: "true"
  desc: "Allow access to Google BigQuery"
  volumeMounts:
    - name: bigquery-secret-volume
      mountPath: /var/secrets/google
  volumes:
    - name: bigquery-secret-volume
      secret:
        defaultMode: 420
        secretName: bigquery-secret
  env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: /var/secrets/google/service-account.json

Info

This configuration mounts the BigQuery secret into any pod with the matching label.


Apply the PodDefault

Apply the poddefault.yaml file to your namespace:

kubectl apply -f poddefault.yaml -n <my-namespace>

Step-by-Step - As End User

Create a Kubeflow Notebook

  • Log in to the Kubeflow Dashboard.
  • Create a new Jupyter notebook server, provide a name for the notebook.
  • In the configurations, select the option that allows access to Google BigQuery by selecting the PodDefault you applied.
  • Click Launch.

Once the notebook server is running, connect to it and create a new Jupyter notebook.


Install Python Libraries and Run BigQuery Queries

In your Jupyter notebook, install the required Python libraries

!pip install google-cloud-bigquery
!pip install db-dtypes

Use the following Python script to connect to BigQuery and run queries.

from google.cloud import bigquery
from google.auth.exceptions import DefaultCredentialsError
from google.api_core.exceptions import GoogleAPIError, NotFound, BadRequest

def connect_to_bigquery():
    try:
        client = bigquery.Client()
        print("Connected to BigQuery successfully.")
        return client
    except DefaultCredentialsError as e:
        print("Could not find or authenticate credentials. Please check the configuration.")
        print(f"Error: {e}")
    except GoogleAPIError as e:
        print("An error occurred while connecting to BigQuery.")
        print(f"Error: {e}")
    except Exception as e:
        print("An unexpected error occurred.")
        print(f"Error: {e}")

def run_query(client, query):
    try:
        query_job = client.query(query)
        results = query_job.result()
        return results
    except NotFound as e:
        print("The specified table or dataset was not found.")
        print(f"Error: {e}")
    except BadRequest as e:
        print("There was an issue with the query syntax or logic.")
        print(f"Error: {e}")
    except GoogleAPIError as e:
        print("An API error occurred during the query execution.")
        print(f"Error: {e}")
    except Exception as e:
        print("An unexpected error occurred while running the query.")
        print(f"Error: {e}")

Example Code

Here is an example showing how you can access BigQuery from a Kubeflow notebook.

client = connect_to_bigquery()
if client:
    dataset_id = "mydata_set"  # Replace with your dataset ID
    table_id = f"{dataset_id}.sampletable"  # Replace with your table name
    query = f"SELECT * FROM `{table_id}` LIMIT 100"
    results = run_query(client, query)

    if results:
        for row in results:
            print(f"{row.get('first_name')}: {row.get('last_name')}")
    else:
        print("No rows to display.")

Info

Replace mydata_set and sampletable with your dataset ID and table name.

Here is an example of what the output could look like for the user.

Connected to BigQuery successfully.
John: Doe
Jane: Smith
Alice: Johnson
Bob: Brown

Run and Verify the Notebook

Execute the code cells in your Jupyter notebook. If everything is configured correctly, you should see the output of the BigQuery query.

Considerations

  • Ensure that the BigQuery table exists and has the necessary permissions for the service account.
  • If issues arise, verify that the GOOGLE_APPLICATION_CREDENTIALS environment variable points to the correct path and that the secret is properly mounted.