nvidia_device_time_slicing_replicas

Description

The nvidia_device_time_slicing_replicas parameter enables NVIDIA GPU time slicing by configuring the number of virtual replicas each physical GPU should be divided into. This feature allows multiple workloads to share a single physical GPU by time-sharing the GPU resources, enabling better utilization and cost optimization of expensive GPU hardware.

Default Value

The default value for nvidia_device_time_slicing_replicas is not set (disabled).

Use Cases

  • Cost Optimization: Maximize GPU utilization by running multiple workloads on expensive GPU hardware.
  • Resource Efficiency: Better allocation of GPU resources for workloads that don't require full GPU capacity.
  • Improved Throughput: Support more concurrent workloads without additional hardware investment.
  • Flexible Workload Scheduling: Enable smaller workloads like inference serving or development tasks to coexist.
  • Development and Testing: Allow multiple developers to share GPU resources for testing and development purposes.

Prerequisites

The NVIDIA device plugin must be enabled on your rack before using time slicing:

$ convox rack params set nvidia_device_plugin_enable=true -r rackName
Setting parameters... OK

Setting Parameters

To configure GPU time slicing, set the number of virtual replicas each physical GPU should be divided into:

$ convox rack params set nvidia_device_time_slicing_replicas=5 -r rackName
Setting parameters... OK

To disable GPU time slicing:

$ convox rack params unset nvidia_device_time_slicing_replicas -r rackName
Unsetting nvidia_device_time_slicing_replicas... OK

Understanding Replicas

The replica count determines how many virtual GPU resources each physical GPU provides:

  • replicas=2 = Each virtual GPU gets ~50% of physical GPU capacity
  • replicas=4 = Each virtual GPU gets ~25% of physical GPU capacity
  • replicas=5 = Each virtual GPU gets ~20% of physical GPU capacity
  • replicas=8 = Each virtual GPU gets ~12.5% of physical GPU capacity

Verification

After applying the configuration, you can verify that nodes are advertising the correct number of GPU resources:

$ kubectl describe node <gpu-node-name>

You should see the GPU capacity multiplied by your replica count (e.g., 1 physical GPU × 5 replicas = 5 advertised GPU resources).

Important Considerations

  • Shared Access: Time slicing provides shared access to GPU compute, not exclusive access to GPU resources.
  • No Memory Isolation: There is no memory isolation between workloads sharing the same GPU. All workloads share the same GPU memory space.
  • Fault Domain: All workloads sharing a GPU run in the same fault domain. If one workload crashes or behaves poorly, it may affect other workloads on the same GPU.
  • No Proportional Guarantees: Requesting multiple time-sliced GPU resources does not guarantee proportional compute power or performance.
  • Performance Variability: GPU performance may vary depending on the workload characteristics and timing of concurrent tasks.

Best Practices

  • Start with lower replica counts (2-4) and increase based on workload requirements and performance testing.
  • Monitor GPU utilization and memory usage to optimize the replica count for your specific workloads.
  • Consider the memory requirements of your workloads when setting replica counts, as GPU memory is shared.
  • Use time slicing primarily for workloads that can tolerate performance variability and shared resources.
  • Reserve dedicated GPUs for performance-critical workloads that require guaranteed resources.

Use in convox.yml

Services can request time-sliced GPU resources using the same syntax as regular GPU resources:

services:
  inference-service:
    build: .
    command: python inference.py
    scale:
      count: 3
      gpu: 1

With time slicing enabled, multiple instances of this service can share the same physical GPU.

Version Requirements

This feature requires at least Convox rack version 3.21.4.