0

I have openstack instance with one controller node and 2 compute nodes, our production works fine until 3 days ago, but now if we want to create instance with RTX-3090-flavor instance creating fails and the nova-conductor and nova-scheduler modules return this error:

==> /var/log/nova/nova-scheduler.log <==

2023-12-06 15:55:13.473 4056084 WARNING nova.scheduler.host_manager [req-806e9ab9-1440-48a0-8afa-b262e3b2d5fa 9ea1804cf324402e835199247f9dcd5e 6b51e04295ca42b9a5e3b5cb2d7afef7 - default default] Selected host: compute-23 failed to consume from instance. Error: PCI device request [InstancePCIRequest(alias_name='x1',count=1,is_new=,numa_policy='legacy',request_id=,requester_id=,spec=[{dev_type='type-PCI',product_id='2204',vendor_id='10de'}]), InstancePCIRequest(alias_name='x2',count=1,is_new=,numa_policy='legacy',request_id=,requester_id=,spec=[{dev_type='type-PCI',product_id='1aef',vendor_id='10de'}])] failed: nova.exception.PciDeviceRequestFailed: PCI device request [InstancePCIRequest(alias_name='x1',count=1,is_new=,numa_policy='legacy',request_id=,requester_id=,spec=[{dev_type='type-PCI',product_id='2204',vendor_id='10de'}]), InstancePCIRequest(alias_name='x2',count=1,is_new=,numa_policy='legacy',request_id=,requester_id=,spec=[{dev_type='type-PCI',product_id='1aef',vendor_id='10de'}])] failed

==> /var/log/nova/nova-conductor.log <==

[req-806e9ab9-1440-48a0-8afa-b262e3b2d5fa 9ea1804cf324402e835199247f9dcd5e 6b51e04295ca42b9a5e3b5cb2d7afef7 - default default] [instance: 9a1a1875-789b-4174-8d33-4b59b79d14c7] Error from last host: compute-23 (node compute-23): ['Traceback (most recent call last):\n', ' File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2416, in _build_and_run_instance\n with self.rt.instance_claim(context, instance, node, allocs,\n', ' File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 171, in instance_claim\n claim = claims.Claim(context, instance, nodename, self, cn,\n', ' File "/usr/lib/python3/dist-packages/nova/compute/claims.py", line 72, in init\n self._claim_test(compute_node, limits)\n', ' File "/usr/lib/python3/dist-packages/nova/compute/claims.py", line 113, in _claim_test\n raise exception.ComputeResourcesUnavailable(reason=\n', 'nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: Claim pci failed.\n', '\nDuring handling of the above exception, another exception occurred:\n\n', 'Traceback (most recent call last):\n', ' File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2268, in _do_build_and_run_instance\n
self._build_and_run_instance(context, instance, image,\n', ' File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2467, in _build_and_run_instance\n raise exception.RescheduledException(\n', 'nova.exception.RescheduledException: Build of instance 9a1a1875-789b-4174-8d33-4b59b79d14c7 was re-scheduled: Insufficient compute resources: Claim pci failed.\n'] 2023-12-06 15:55:14.832 4056074 WARNING nova.scheduler.utils [req-806e9ab9-1440-48a0-8afa-b262e3b2d5fa 9ea1804cf324402e835199247f9dcd5e 6b51e04295ca42b9a5e3b5cb2d7afef7 - default default] Failed to compute_task_build_instances: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 9a1a1875-789b-4174-8d33-4b59b79d14c7.: nova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 9a1a1875-789b-4174-8d33-4b59b79d14c7. 2023-12-06 15:55:14.833 4056074 WARNING nova.scheduler.utils [req-806e9ab9-1440-48a0-8afa-b262e3b2d5fa 9ea1804cf324402e835199247f9dcd5e 6b51e04295ca42b9a5e3b5cb2d7afef7 - default default] [instance: 9a1a1875-789b-4174-8d33-4b59b79d14c7] Setting instance to ERROR state.: nova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 9a1a1875-789b-4174-8d33-4b59b79d14c7.2023-12-06 15:55:14.828 4056074 ERROR nova.scheduler.utils

We have 4 RTX-3090 in each compute nodes. It is fine if we want to create 3 Graphical-instance but the fourth instance returns the error above.

I can create virtual machine from hypervisor (KVM) itself and GPU passthrough is ok for last PCI-device. but from Openstack, nova compute returns "fail to clam PCI", it thinks all four GPU are used but three of them are just used.

does Openstack cache these kind of available resources? it works 3 days ago fine.

I double Check the nova-compute and nova-api configuration files all pci passthrough configs are fine.

thanks in advanced.

0

You must log in to answer this question.

Browse other questions tagged .