Inside the virt plugin
======================
Originally written: 20161111
Last updated: 20161124
This document will explain the new domain tag support introduced
in the virt plugin, and will provide one important use case for this feature.
In the reminder of this document, we consider
* libvirt <= 2.0.0
* QEMU <= 2.6.0
Domain tags and domains partitioning across virt reader instances
-----------------------------------------------------------------
The virt plugin gained the `Instances` option. It allows to start
more than one reader instance, so the the libvirt domains could be queried
by more than one reader thread.
The default value for `Instances` is `1`.
With default settings, the plugin will behave in a fully transparent,
backward compatible way.
It is recommended to set this value to one multiple of the
daemon `ReadThreads` value.
Each reader instance will query only a subset of the libvirt domain.
The subset is identified as follows:
1. Each virt reader instance is named `virt-$NUM`, where `NUM` is
the progressive order of instances. If you configure `Instances 3`
you will have `virt-0`, `virt-1`, `virt-2`. Please note: the `virt-0`
instance is special, and will always be available.
2. Each virt reader instance will iterate over all the libvirt active domains,
and will look for one `tag` attribute (see below) in the domain metadata section.
3. Each virt reader instance will take care *only* of the libvirt domains whose
tag matches with its own
4. The special `virt-0` instance will take care of all the libvirt domains with
no tags, or with tags which are not in the set \[virt-0 ... virt-$NUM\]
Collectd will just use the domain tags, but never enforces or requires them.
It is up to an external entity, like a software management system,
to attach and manage the tags to the domain.
Please note that unless you have such tag-aware management software,
it most likely make no sense to enable more than one reader instance on your
setup.
Libvirt tag metadata format
----------------------------
This is the snipped to be added to libvirt domains:
$TAG
it must be included in the section.
Check the `src/virt_test.c` file for really minimal example of libvirt domains.
Examples
--------
### Example one: 10 libvirt domains named "domain-A" ... "domain-J", virt plugin with instances=5, using 5 different tags
libvirt domain name - tag - read instance - reason
domain-A virt-0 0 tag match
domain-B virt-1 1 tag match
domain-C virt-2 2 tag match
domain-D virt-3 3 tag match
domain-E virt-4 4 tag match
domain-F virt-0 0 tag match
domain-G virt-1 1 tag match
domain-H virt-2 2 tag match
domain-I virt-3 3 tag match
domain-J virt-4 4 tag match
Because the domain where properly tagged, all the read instances have even load. Please note that the the virt plugin
knows nothing, and should know nothing, about *how* the libvirt domain are tagged. This is entirely up to the
management system.
Example two: 10 libvirt domains named "domain-A" ... "domain-J", virt plugin with instances=3, using 5 different tags
libvirt domain name - tag - read instance - reason
domain-A virt-0 0 tag match
domain-B virt-1 1 tag match
domain-C virt-2 2 tag match
domain-D virt-3 0 adopted by instance #0
domain-E virt-4 0 adopted by instance #0
domain-F virt-0 0 rag match
domain-G virt-1 1 tag match
domain-H virt-2 2 tag match
domain-I virt-3 0 adopted by instance #0
domain-J virt-4 0 adopted by instance #0
In this case we have uneven load, but no domain is ignored.
### Example three: 10 libvirt domains named "domain-A" ... "domain-J", virt plugin with instances=5, using 3 different tags
libvirt domain name - tag - read instance - reason
domain-A virt-0 0 tag match
domain-B virt-1 1 tag match
domain-C virt-2 2 tag match
domain-D virt-0 0 tag match
domain-E virt-1 1 tag match
domain-F virt-2 2 tag match
domain-G virt-0 0 tag match
domain-H virt-1 1 tag match
domain-I virt-2 2 tag match
domain-J virt-0 0 tag match
Once again we have uneven load and two idle read instances, but besides that no domain is left unmonitored
### Example four: 10 libvirt domains named "domain-A" ... "domain-J", virt plugin with instances=5, partial tagging
libvirt domain name - tag - read instance - reason
domain-A virt-0 0 tag match
domain-B virt-1 1 tag match
domain-C virt-2 2 tag match
domain-D virt-0 0 tag match
domain-E 0 adopted by instance #0
domain-F 0 adopted by instance #0
domain-G 0 adopted by instance #0
domain-H 0 adopted by instance #0
domain-I 0 adopted by instance #0
domain-J 0 adopted by instance #0
The lack of tags causes uneven load, but no domain are unmonitored.
Possible extensions - custom tag format
---------------------------------------
The aformentioned approach relies on fixed tag format, `virt-$N`. The algorithm works fine with any tag, which
is just one string, compared for equality. However, using custom strings for tags creates the need for a mapping
between tags and the read instances.
This mapping needs to be updated as long as domain are created or destroyed, and the virt plugin needs to be
notified of the changes.
This adds a significant amount of complexity, with little gain with respect to the fixed schema adopted initially.
For this reason, the introdution of dynamic, custom mapping was not implemented.
Dealing with datacenters: libvirt, qemu, shared storage
-------------------------------------------------------
When used in a datacenter, QEMU is most often configured to use shared storage. This is
the default configuration of datacenter management solutions like [oVirt](http://www.ovirt.org).
The actual shared storage could be implemented on top of NFS for small installations, or most likely
ISCSI or Fiber Channel. The key takeaway is that the storage is accessed over the network,
not using e.g. the SATA or PCI bus of any given host, so any network issue could cause
one or more storage operations to delay, or to be lost entirely.
In that case, the userspace process that requested the operation can end up in the D state,
and become unresponsive, and unkillable.
Dealing with unresponsive domains
---------------------------------
All the above considered, one robust management or monitoring application must deal with the fact that
the libvirt API can block for a long time, or forever. This is not an issue or a bug of one specific
API, but it is rather a byproduct of how libvirt and QEMU interact.
Whenever we query more than one VM, we should take care to avoid that one blocked VM prevent other,
well behaving VMs to be queried. We don't want one rogue VM to disrupt well-behaving VMs.
Unfortunately, any way we enumerate VMs, either implicitly, using the libvirt bulk stats API,
or explicitly, listing all libvirt domains and query each one in turn, we may unpredictably encounter
one unresponsive VM.
There are many possible approaches to deal with this issue. The virt plugin supports
a simple but effective approach partitioning the domains, as follows.
1. The virt plugin always register one or more `read` callbacks. The `zero` read callback is guaranteed to
be always present, so it performs special duties (more details later)
Each callback will be named 'virt-$N', where `N` ranges from 0 (zero) to M-1, where M is the number of instances configured.
`M` equals to `5` by default, because this is the same default number of threads in the libvirt worker pool.
2. Each of the read callbacks queries libvirt for the list of all the active domains, and retrieves the libvirt domain metadata.
Both of those operations are safe wrt domain blocked in I/O (they affect only the libvirtd daemon).
3. Each of the read callbacks extracts the `tag` from the domain metadata using a well-known format (see below).
Each of the read callbacks discards any domain which has no tag, or whose tag doesn't match with the read callback tag.
3.a. The read callback tag equals to the read callback name, thus `virt-$N`. Remember that `virt-0` is guaranteed to be
always present.
3.b. Since the `virt-0` reader is always present, it will take care of domains with no tag, or with unrecognized tag.
One unrecognized tag is any tag which has not the scheme `virt-$N`.
4. Each read callback only samples the subset of domains with matching tag. The `virt-0` reader will possibly do more,
but worst case the load will be unbalanced, no domain will be left unsampled.
To make this approach work, some entity must attach the tags to the libvirt domains, in such a way that all
the domains which run on a given host and insist on the same network-based storage share the same tag.
This minimizes the disruption, because when using the shared storage, if one domain becomes unresponsive because
of unavailable storage, the most likely thing to happen is that others domain using the same storage will soon become
unavailable; should the box run other libvirt domains using other network-based storage, they could be monitored
safely.
In case of [oVirt](http://www.ovirt.org), the aforementioned tagging is performed by the host agent.
Please note that this approach is ineffective if the host completely lose network access to the storage network.
In this case, however, no recovery is possible and no damage limitation is possible.
Lastly, please note that if the virt plugin is configured with instances=1, it behaves exactly like before.
Addendum: high level overview: libvirt client, libvirt daemon, qemu
--------------------------------------------------------------------
Let's review how the client application (collectd + virt plugin), the libvirtd daemon and the
QEMU processes interact with each other.
The libvirt daemon talks to QEMU using the JSON QMP protcol over one unix domain socket.
The details of the protocol are not important now, but the key part is that the protocol
is a simple request/response, meaning that libvirtd must serialize all the interactions
with the QEMU monitor, and must protects its endpoint with a lock.
No out of order request/responses are possible (e.g. no pipelining or async replies).
This means that if for any reason one QMP request could not be completed, any other caller
trying to access the QEMU monitor will block until the blocked caller returns.
To retrieve some key informations, most notably about the block device state or the balloon
device state, the libvirtd daemon *must* use the QMP protocol.
The QEMU core, including the handling of the QMP protocol, is single-threaded.
All the above combined make it possible for a client to block forever waiting for one QMP
request, if QEMU itself is blocked. The most likely cause of block is I/O, and this is especially
true considering how QEMU is used in a datacenter.