Autoscaling
Convox allows you to scale any Service on the following dimensions:
- Horizontal concurrency (number of Processes)
- CPU allocation (in CPU units where 1000 units is one full CPU)
- Memory allocation (in MB)
- GPU allocation (number of GPUs per process)
Initial Defaults
You can specify the scale for any Service in your convox.yml
services:
web:
scale:
count: 2
cpu: 250
memory: 512
If you specify a static
countit will only be used on first deploy. Subsequent changes must be made using theconvoxCLI.
For GPU-accelerated workloads, you can specify the number of GPUs required:
services:
ml-worker:
scale:
count: 1
cpu: 1000
memory: 2048
gpu: 1
Manual Scaling
Determine Current Scale
$ convox scale
SERVICE DESIRED RUNNING CPU MEMORY
web 2 2 250 512
Scaling Count Horizontally
$ convox scale web --count=3
Scaling web...
2026-01-15T14:30:00Z system/k8s/web Scaled up replica set web-65f45567d to 2
2026-01-15T14:30:00Z system/k8s/web-65f45567d Created pod: web-65f45567d-c7sdw
2026-01-15T14:30:00Z system/k8s/web-65f45567d-c7sdw Successfully assigned dev-convox/web-65f45567d-c7sdw to node
2026-01-15T14:30:00Z system/k8s/web-65f45567d-c7sdw Container image "registry.dev.convox/convox:web.BABCDEFGHI" already present on machine
2026-01-15T14:30:01Z system/k8s/web-65f45567d-c7sdw Created container main
2026-01-15T14:30:01Z system/k8s/web-65f45567d-c7sdw Started container main
OK
Changes to
cpu,memory, orgpushould be done in yourconvox.yml, and a new release of your app deployed.
Horizontal Autoscaling (HPA)
To use autoscaling you must specify a range for allowable Process count and target values for CPU and Memory utilization (in percent):
services:
web:
scale:
count: 1-10
targets:
cpu: 70
memory: 90
The number of Processes will be continually adjusted to maintain your target metrics.
You must consider that the targets for CPU and Memory use the service replicas limits to calculate the utilization percentage. So if you set the target for CPU as 70 and have two replicas, it will trigger the auto-scale only if the utilization percentage sum divided by the replica's count is bigger than 70%. The desired replicas will be calculated to satisfy the percentage. Being the currentMetricValue computed by taking the average of the given metric across all service replicas.
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
GPU Scaling
For workloads that require GPU acceleration, Convox supports requesting GPU resources at the service level. This is particularly useful for machine learning, video processing, and scientific computing applications.
Prerequisites for GPU Scaling
Before using GPU scaling:
- Your rack must be running on GPU-capable instances:
- AWS: EC2 p3, p4, g4, or g5 instance families
- Azure: NC, ND, or NV series virtual machines
- The NVIDIA device plugin must be enabled on your rack:
$ convox rack params set nvidia_device_plugin_enable=true -r rackName
See the NVIDIA device plugin rack parameter for your provider: AWS | Azure.
Configuring GPU Requirements
You can specify GPU requirements in the scale section of your service definition:
services:
ml-trainer:
build: .
command: python train.py
scale:
count: 1-3
cpu: 1000
memory: 4096
gpu: 1
This configuration requests 1 GPU for each process of the ml-trainer service.
You can also specify the GPU vendor using the map form:
services:
ml-trainer:
build: .
command: python train.py
scale:
count: 1-3
cpu: 1000
memory: 4096
gpu:
count: 1
vendor: nvidia
See the Service scale.gpu reference for the full GPU configuration options.
Important Notes About GPU Scaling
- GPUs are allocated as whole units (you cannot request a fraction of a GPU)
- Services requesting GPUs will only be scheduled on nodes that have available GPUs
- Each process will receive the specified number of GPUs
- If you specify a GPU count without specifying CPU or memory resources, the defaults for those resources will be removed to allow for pure GPU-based scheduling
- When using GPUs, you may need to use a base image that includes the NVIDIA CUDA toolkit
Combining GPU with Autoscaling
GPU-enabled services can be configured with autoscaling:
services:
ml-inference:
build: .
command: python serve_model.py
scale:
count: 1-5
cpu: 1000
memory: 2048
gpu: 1
targets:
cpu: 80
The service will scale based on CPU utilization while ensuring that each process has access to a GPU.
Troubleshooting Cluster Scale-Down
If your cluster is not scaling down despite low resource usage, the Kubernetes Cluster Autoscaler may be blocked from removing nodes. Common causes:
- Restrictive PodDisruptionBudgets (PDBs): A PDB with
minAvailable: 1on a service with one replica prevents that pod from being evicted. Adjust with thepdb_default_min_available_percentagerack parameter. - System pods: Pods in the
kube-systemnamespace may have rules preventing eviction. - Pods without a controller: Pods not managed by a Deployment or ReplicaSet will not be evicted.
- Pods with local storage: Pods using
hostPathoremptyDirvolumes cannot be moved. - Scheduling constraints: Node selectors or anti-affinity rules may prevent rescheduling onto other nodes.
To diagnose, inspect the Cluster Autoscaler logs:
$ kubectl logs -n kube-system deployment/cluster-autoscaler
Look for messages like pod <namespace>/<pod_name> is blocking scale down. You can also check for restrictive PDBs:
$ kubectl get pdb -A
A PDB with ALLOWED DISRUPTIONS of 0 will block evictions on that node.
See Also
- convox.yml for configuring scale defaults
- VPA for automatic resource right-sizing
- KEDA Autoscaling for event-driven autoscaling
- Datadog Metrics for Datadog-based autoscaling