nvidia_device_time_slicing_replicas

Description

The nvidia_device_time_slicing_replicas parameter enables NVIDIA GPU time slicing by configuring the number of virtual replicas each physical GPU should be divided into. This feature allows multiple workloads to share a single physical GPU by time-sharing the GPU resources, enabling better utilization and cost optimization of expensive GPU hardware.

Default Value

The default value for nvidia_device_time_slicing_replicas is not set (disabled).

Use Cases

Cost Optimization: Maximize GPU utilization by running multiple workloads on expensive GPU hardware.
Resource Efficiency: Better allocation of GPU resources for workloads that don't require full GPU capacity.
Improved Throughput: Support more concurrent workloads without additional hardware investment.
Flexible Workload Scheduling: Enable smaller workloads like inference serving or development tasks to coexist.
Development and Testing: Allow multiple developers to share GPU resources for testing and development purposes.

Prerequisites

The NVIDIA device plugin must be enabled on your rack before using time slicing:

$ convox rack params set nvidia_device_plugin_enable=true -r rackName
Setting parameters... OK

Setting Parameters

To configure GPU time slicing, set the number of virtual replicas each physical GPU should be divided into:

$ convox rack params set nvidia_device_time_slicing_replicas=5 -r rackName
Setting parameters... OK

To disable GPU time slicing:

$ convox rack params unset nvidia_device_time_slicing_replicas -r rackName
Unsetting nvidia_device_time_slicing_replicas... OK

Understanding Replicas

The replica count determines how many virtual GPU resources each physical GPU provides:

replicas=2 = Each virtual GPU gets ~50% of physical GPU capacity
replicas=4 = Each virtual GPU gets ~25% of physical GPU capacity
replicas=5 = Each virtual GPU gets ~20% of physical GPU capacity
replicas=8 = Each virtual GPU gets ~12.5% of physical GPU capacity

Verification

After applying the configuration, you can verify that nodes are advertising the correct number of GPU resources:

$ kubectl describe node <gpu-node-name>

You should see the GPU capacity multiplied by your replica count (e.g., 1 physical GPU × 5 replicas = 5 advertised GPU resources).

Important Considerations

Shared Access: Time slicing provides shared access to GPU compute, not exclusive access to GPU resources.
No Memory Isolation: There is no memory isolation between workloads sharing the same GPU. All workloads share the same GPU memory space.
Fault Domain: All workloads sharing a GPU run in the same fault domain. If one workload crashes or behaves poorly, it may affect other workloads on the same GPU.
No Proportional Guarantees: Requesting multiple time-sliced GPU resources does not guarantee proportional compute power or performance.
Performance Variability: GPU performance may vary depending on the workload characteristics and timing of concurrent tasks.

Best Practices

Start with lower replica counts (2-4) and increase based on workload requirements and performance testing.
Monitor GPU utilization and memory usage to optimize the replica count for your specific workloads.
Consider the memory requirements of your workloads when setting replica counts, as GPU memory is shared.
Use time slicing primarily for workloads that can tolerate performance variability and shared resources.
Reserve dedicated GPUs for performance-critical workloads that require guaranteed resources.

Use in convox.yml

Services can request time-sliced GPU resources using the same syntax as regular GPU resources:

services:
  inference-service:
    build: .
    command: python inference.py
    scale:
      count: 3
      gpu: 1

With time slicing enabled, multiple instances of this service can share the same physical GPU.

nvidia_device_plugin_enable: Required - Must be enabled before using time slicing.
gpu_tag_enable: Enables GPU tagging for resource tracking.
node_type: Should be set to GPU-enabled instance types (e.g., p3.2xlarge, g4dn.xlarge).

Version Requirements

This feature requires at least Convox rack version 3.21.4.

nvidia_device_time_slicing_replicas

Description

Default Value

The default value for nvidia_device_time_slicing_replicas is not set (disabled).

Use Cases

Cost Optimization: Maximize GPU utilization by running multiple workloads on expensive GPU hardware.
Resource Efficiency: Better allocation of GPU resources for workloads that don't require full GPU capacity.
Improved Throughput: Support more concurrent workloads without additional hardware investment.
Flexible Workload Scheduling: Enable smaller workloads like inference serving or development tasks to coexist.
Development and Testing: Allow multiple developers to share GPU resources for testing and development purposes.

Prerequisites

The NVIDIA device plugin must be enabled on your rack before using time slicing:

$ convox rack params set nvidia_device_plugin_enable=true -r rackName
Setting parameters... OK

Setting Parameters

To configure GPU time slicing, set the number of virtual replicas each physical GPU should be divided into:

$ convox rack params set nvidia_device_time_slicing_replicas=5 -r rackName
Setting parameters... OK

To disable GPU time slicing:

$ convox rack params unset nvidia_device_time_slicing_replicas -r rackName
Unsetting nvidia_device_time_slicing_replicas... OK

Understanding Replicas

The replica count determines how many virtual GPU resources each physical GPU provides:

replicas=2 = Each virtual GPU gets ~50% of physical GPU capacity
replicas=4 = Each virtual GPU gets ~25% of physical GPU capacity
replicas=5 = Each virtual GPU gets ~20% of physical GPU capacity
replicas=8 = Each virtual GPU gets ~12.5% of physical GPU capacity

Verification

After applying the configuration, you can verify that nodes are advertising the correct number of GPU resources:

$ kubectl describe node <gpu-node-name>

You should see the GPU capacity multiplied by your replica count (e.g., 1 physical GPU × 5 replicas = 5 advertised GPU resources).

Important Considerations

Shared Access: Time slicing provides shared access to GPU compute, not exclusive access to GPU resources.
No Memory Isolation: There is no memory isolation between workloads sharing the same GPU. All workloads share the same GPU memory space.
Fault Domain: All workloads sharing a GPU run in the same fault domain. If one workload crashes or behaves poorly, it may affect other workloads on the same GPU.
No Proportional Guarantees: Requesting multiple time-sliced GPU resources does not guarantee proportional compute power or performance.
Performance Variability: GPU performance may vary depending on the workload characteristics and timing of concurrent tasks.

Best Practices

Start with lower replica counts (2-4) and increase based on workload requirements and performance testing.
Monitor GPU utilization and memory usage to optimize the replica count for your specific workloads.
Consider the memory requirements of your workloads when setting replica counts, as GPU memory is shared.
Use time slicing primarily for workloads that can tolerate performance variability and shared resources.
Reserve dedicated GPUs for performance-critical workloads that require guaranteed resources.

Use in convox.yml

Services can request time-sliced GPU resources using the same syntax as regular GPU resources:

services:
  inference-service:
    build: .
    command: python inference.py
    scale:
      count: 3
      gpu: 1

With time slicing enabled, multiple instances of this service can share the same physical GPU.

nvidia_device_plugin_enable: Required - Must be enabled before using time slicing.
gpu_tag_enable: Enables GPU tagging for resource tracking.
node_type: Should be set to GPU-enabled instance types (e.g., p3.2xlarge, g4dn.xlarge).

Version Requirements

This feature requires at least Convox rack version 3.21.4.

Find

Home Product Pricing Enterprise Blog Documentation

LinkedIn X GitHub

Chat

sales@convox.com

nvidia_device_time_slicing_replicas

Description

Default Value

Use Cases

Prerequisites

Setting Parameters

Understanding Replicas

Verification

Important Considerations

Best Practices

Use in convox.yml

Related Parameters

Version Requirements

nvidia_device_time_slicing_replicas

Description

Default Value

Use Cases

Prerequisites

Setting Parameters

Understanding Replicas

Verification

Important Considerations

Best Practices

Use in convox.yml

Related Parameters

Version Requirements