Key concepts
Over-quotas
If a namespace subject to an ElasticQuota
(or, equivalently, to a CompositeElasticQuota
) is using all the resources guaranteed by the min
field of its quota, it can still host new pods by "borrowing" quotas from other namespaces which has available resources (e.g. from namespaces subject to other quotas where min
resources are not being completely used).
Info
Pods that are scheduled "borrowing" unused quotas from other namespaces are called over-quota pods.
Over-quota pods can be preempted at any time to free up resources if any of the namespaces lending the quotas claims back its resources.
You can check whether a Pod is in over-quota by checking the value of the label nos.nebuly.com/capacity
, which is automatically created and updated by the nos operator for every Pod created in a namespace subject to an ElasticQuota or to a CompositeElasticQuota. The two possible values for this label are in-quota
and over-quota
.
You can use this label to easily find out at any time which are the over-quota pods subject to preemption risk:
How over-quota pods are labelled
All the pods created within a namespace subject to a quota are labelled as in-quota
as long as the used
resources of the quota do not exceed its min
resources. When this happens and news pods are created in that namespace, they are labelled as over-quota
when they reach the running phase.
nos
re-evaluates the over-quota status of each Pod of a namespace every time a new Pod in that namespace changes its phase to/from "Running". With the default configuration, nos
sorts the pods by creation date and, if the creation timestamp is the same, by requested resources, placing first the pods with older creation timestamp and with fewer requested resources. After the pods are sorted, nos
computes the aggregated requested resources by summing the request of each Pod, and it marks as over-quota
all the pods for which used
is greater than min
.
🚧 Soon it will be possible to customize the order criteria used for sorting the pods during this process through the nos-operator configuration.
Over-quota fair sharing
In order to prevent a single namespace from consuming all the over-quotas available in the cluster and starving the others, nos
implements a fair-sharing mechanism that guarantees that each namespace subject to an ElasticQuota has right to a part of the available over-quotas proportional to its min
field.
The fair-sharing mechanism does not enforce any hard limit on the amount of over-quotas pods that a namespace can have, but instead it implements fair sharing by preemption. Specifically, a Pod-A subject to elastic-quota-A can preempt Pod-b subject to elastic-quota-B if the following conditions are met:
- Pod-B is in over-quota
used
field of Elastic-quota-A + Pod-A request <= guaranteed over-quotas A- used over-quotas of Elastic-quota-B > guaranteed over-quotas B
Where:
- guaranteed over-quotas A = percentage of guaranteed over-quotas A * tot. available over-quotas
- percentage of guaranteed over-quotas A = min A / sum(min_i) * 100
- tot. available over-quotas = sum( max(0, min_i - used_i ) )
Example
Let's assume we have a K8s cluster with the following Elastic Quota resources:
Elastic Quota | Min | Max |
---|---|---|
Elastic Quota A | nos.nebuly.com/gpu-memory: 40 | None |
Elastic Quota B | nos.nebuly.com/gpu-memory: 10 | None |
Elastic Quota C | nos.nebuly.com/gpu-memory: 30 | None |
The table below shows the quotas usage of the cluster at two different times:
Time | Elastic Quota A | Elastic Quota B | Elastic Quota C |
---|---|---|---|
t1 | Used: 40/40 GB | Used: 40/10 GB Over-quota: 30 GB |
Used: 0 GB |
t2 | Used: 50/40 GB | Used 30/10 GB Over-quota: 20 GB |
Used: 0 GB |
The cluster has a total of 30 GB of memory of available over-quotas, which at time t1 are all being consumed by the pods in the namespace subject to Elastic Quota B.
At time t2, a new Pod is created in the namespace subject to Elastic Quota A. Even though all the quotas of the cluster are currently being used, the fair sharing mechanism grants to Elastic Quota A a certain amount of over-quotas that it can use, and in order to grant these quotas nos can preempt one or more over-quota pods from the namespace subject to Elastic Quota B.
Specifically, the following are the amounts of over-quotas guaranteed to each of the namespaces subject to the Elastic Quotas defined in the table above:
- guaranteed over-quota A = 40 / (40 + 10 + 30) * (0 + 0 + (30 - 0)) = 15
- guaranteed over-quota B = 10 / (40 + 10 + 30) * (0 + 0 + (30 - 0)) = 3
Assuming that all the pods in the cluster are requesting only 10 GB of GPU memory, an over-quota Pod from Elastic Quota B is preempted because the following conditions are true:
- ✅ used quotas A + new Pod A <= min quota A + guaranteed over-quota A
- 40 + 10 <= 40 + 15
- ✅ used over-quotas B > guaranteed over-quotas
- 30 > 3
GPU memory limits
Both ElasticQuota
and CompositeElasticQuota
resources support the custom resource nos.nebuly.com/gpu-memory
.
You can use this resource in the min
and max
fields of the elastic quotas specification to define the minimum amount of GPU memory (expressed in GB) guaranteed to a certain namespace and its maximum limit, respectively.
This resource is particularly useful if you use Elastic Quotas together with automatic GPU partitioning, since it allows you to assign resources to different teams (e.g. namespaces) in terms of GPU memory instead of in number of GPUs, and the users can than consume request in the same terms by claiming GPU slices with a specific amount of memory, enabling an overall fine-grained control over the GPUs of the cluster.
nos
automatically computes the GPU memory requested by each Pod from the GPU resources requested by its containers and enforces the limits accordingly. The amount of memory GB corresponding to the generic resource nvidia.com/gpu
is defined by the field global.nvidiaGpuResourceMemoryGB
of the installation chart, which is 32
by default.
For instance, using the default configuration, the value of the resource nos.nebuly.com/gpu-memory
computed from the Pod specification below is 10+32=42
.