Troubleshooting

While notebooks offer a powerful and flexible environment for data scientists and machine learning engineers to work in a cloud-native infrastructure, there are some common issues and challenges that users may encounter when setting up, running, or managing notebooks.

Facilities for Troubleshooting¶

End Users¶

End users that launch and use notebooks have integrated facilities to troubleshoot issues with their notebooks. Hover on the notebook name and click on it to view additional details about the state of the notebook. This will present the user with three menus

Notebook Troubleshooting Menus

Overview¶

Click on the overview tab to view details about the notebook

Notebook Troubleshooting Overview

Logs¶

Click on the logs tab to view the notepod's logs from the host Kubernetes cluster.

Notebook Troubleshooting Logs

Events¶

Click on the events tab to view Kubernetes events for the pods backing the notebook

Notebook Troubleshooting Events

YAML¶

Click on the yaml tab to view the YAML based spec for the notebook pod.

Notebook Troubleshooting YAML

Administrators¶

Administrators will have elevated privileges (i.e. required roles) in the Rafay Platform. This will allow them to leverage the integrated facilities (k8s dashboard, alerting, notifications, zero trust kubectl etc) to troubleshoot issues quickly and efficiently.

Common Scenarios¶

Resource Management and Scaling¶

Insufficient Resources (CPU, Memory, GPU)¶

Issue: Notebooks may fail to start or run slowly if they are not allocated sufficient resources (e.g., CPU, memory, GPU). This is especially problematic for resource-intensive machine learning workloads.
Solution: Ensure that appropriate resources are requested when creating the notebook server by setting CPU, memory, and GPU limits according to the workload. Resource requests and limits can be adjusted in the web console.

Note

Once a notebook has been created, it is not possible to update its resources and configuration. Users need to create a new notebook with required configurations instead.

Resource Contention¶

Issue: In a multi-tenant environment, multiple users running notebooks on the same cluster may compete for limited resources, leading to slower performance or job failures.
Solution: Ensure you are configuring resource quotas and limits to ensure fair resource allocation between different users or teams. Setting up resource quotas for namespaces can help avoid contention and ensure users get the resources they need.

Unused Notebooks Consuming Resources¶

Issue: Idle notebooks can consume cluster resources without being actively used, leading to wasted compute resources and higher costs.
Solution: Implement automatic notebook shutdown (i.e. culling) for idle notebooks. You can configure notebook servers to automatically stop after a period of inactivity, freeing up resources for other tasks.

Storage and Data Management¶

Data Persistence Issues¶

Issue: If data is not stored in persistent volumes (PVs), it can be lost when the notebook server is stopped or deleted. This results in the need to re-upload or re-process data.
Solution: Use Persistent Volume Claims (PVCs) to attach persistent storage to notebooks. Ensure that important datasets, models, and intermediate results are stored in a mounted volume that persists across sessions.

Note

A workspace volume is created by default for every notebook to ensure this issue can be avoided.

Insufficient Storage Allocation¶

Issue: Notebooks may fail or crash if the allocated storage is insufficient for the size of the datasets or the model artifacts being used.
Solution: Monitor the storage usage of your notebooks and ensure that the PVC has enough storage capacity. You can resize PVCs or attach additional volumes as needed.

Networking Issues¶

Difficulty Accessing External Resources¶

Issue: Notebooks may struggle to access external data sources or APIs due to misconfigured networking, firewalls, or permissions.
Solution: Ensure that the necessary networking rules (e.g., firewall or Virtual Private Cloud (VPC) settings) allow the notebook to connect to external resources. Configure proper Kubernetes network policies if needed.

Slow Data Transfer¶

Issue: If the notebook requires large datasets or connects to remote data sources, data transfer speeds can become a bottleneck.
Solution: Use data sources that are co-located within the same cloud provider or region to minimize network latency. Consider caching or staging large datasets within the cluster's storage.

User Access and Permissions¶

Incorrect Permissions (RBAC)¶

Issue: Users may encounter access issues if they do not have the proper Role-Based Access Control (RBAC) permissions to create or manage notebooks, access datasets, or view resources in specific namespaces.
Solution: Properly configure RBAC roles and bindings in Kubernetes. Ensure that users are assigned the correct permissions to perform actions such as creating notebooks, accessing volumes, and running pipelines.

Namespace Isolation¶

Issue: Users may accidentally create resources (e.g., notebooks, pipelines) in the wrong namespace, leading to confusion or unauthorized access.
Solution: Ensure that users are assigned to their specific workspace (i.e. maps to a k8s namespace). For users with access to multiple workspaces, provide clear instructions reminding them that they need to select their workspace first.

Environment Management¶

Inconsistent Environments¶

Issue: Notebook environments may vary between different users or sessions, leading to inconsistent results (e.g., different package versions or dependencies).
Solution: Use predefined or custom Docker images that contain all the necessary dependencies, libraries, and versions required for the project. This ensures consistency across environments.

Difficulty Reproducing Results¶

Issue: If environments or dependencies are not properly managed, it can be difficult to reproduce results across different notebook sessions or by different users.
Solution: Leverage the platform's capabilities to manage and standardize environments, including using versioned Docker images and keeping environment configuration files (e.g., requirements.txt or environment.yml) alongside the project.

Performance Issues¶

Slow Start-Up Times¶

Issue: Notebook servers may take a long time to start, especially if they need to pull large container images or initialize many dependencies.
Solution: Use optimized container images that have all necessary libraries and dependencies pre-installed. Also, leverage caching mechanisms to speed up the retrieval of commonly used images.

Slow Performance Due to Resource Allocation¶

Issue: Notebooks may run slowly due to under-allocated resources such as CPU or memory.
Solution: Ensure proper allocation of resources when starting the notebook server. For workloads requiring GPUs, make sure GPU resources are requested and assigned to the notebook server.

Version Control and Collaboration¶

Git Clone Issues¶

Issue: When using a jupyter image like "jupyter/scipy-notebook:python-3.11.X" and cloning a repo into the notebook, an error like the one below occurs.

> Traceback (most recent call last):
      File "/opt/conda/lib/python3.11/site-packages/jupyterlab_git/git.py", line 175, in execute
        code, output, error = await call_subprocess_with_authentication(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/jupyterlab_git/git.py", line 119, in call_subprocess_with_authentication
        i = await p.expect(["Username for .*: ", "Password for .*:"], async_=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/pexpect/spawnbase.py", line 343, in expect
        return self.expect_list(compiled_pattern_list,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/pexpect/spawnbase.py", line 369, in expect_list
        from ._async import expect_async
      File "/opt/conda/lib/python3.11/site-packages/pexpect/_async.py", line 7, in <module>
        @asyncio.coroutine
         ^^^^^^^^^^^^^^^^^
    AttributeError: module 'asyncio' has no attribute 'coroutine'

Solution: This Error appears when using Kubeflow Version 4.8.0 with Python 3.11.X. The bug comes from pexpect library being incompatable with Python 3.11.X. Upgrade pexpect to 4.9.0 will resolve the issue. To upgrade, run the following command in a shell.

pip install pexpect==4.9.0 --no-deps

This can be fixed from a jupyter cell using the ! operator to access the terminal

# In notebook
!pip install pexpect==4.9.0 --no-deps

For a persistent change within a notebook this can be fixed by using the user tag and removing the root pexpect.

pip uninstall pexpect
pip install pexpect==4.9.0 --no-deps --user

NOTE: Pip uninstall will prompt you to confirm the uninstall.

Difficulty Collaborating on Notebooks¶

Issue: Multiple users may find it difficult to collaborate on the same notebook in real-time, leading to versioning issues or overwritten work.
Solution: Encourage the use of shared volumes or integrate with version control systems like Git. Notebooks can be configured to push changes to a Git repository, enabling version control and collaboration.

Model Versioning¶

Issue: Managing multiple versions of models, especially across different notebook sessions or users, can become difficult.
Solution: Use model versioning tools or leverage the built in integration with MLflow for model storage to keep track of different model versions.

Security and Compliance¶

Data Security Concerns¶

Issue: Sensitive data may be inadvertently exposed if proper security measures are not in place.
Solution: Ensure that data is stored securely using appropriate encryption and access controls. Use Kubernetes network policies and the built-in multi-user isolation to enforce secure access to data.

Container Security¶

Issue: Containers running notebook servers may introduce security risks if they use outdated or vulnerable images.
Solution: Regularly update the container images used for notebooks and apply security patches. Use security-hardened base images and follow best practices for container security.

Dependency and Compatibility Issues¶

Dependency Conflicts¶

Issue: Different notebooks may require conflicting versions of libraries or packages, leading to compatibility issues.
Solution: Use isolated environments or custom Docker images for each project to manage dependencies. This ensures that each notebook can run with the correct versions of packages without affecting other notebooks.

Library Compatibility with GPUs¶

Issue: Some libraries, particularly those using GPUs, may be difficult to configure correctly (e.g., ensuring that TensorFlow or PyTorch uses the GPU).
Solution: Make sure the GPU drivers, CUDA, and cuDNN versions are compatible with the libraries in use. Use pre-configured notebook images provided or from cloud providers that come with the necessary GPU configurations.