4 Originally written: 20161111
8 This document will explain the new domain tag support introduced
9 in the virt plugin, and will provide one important use case for this feature.
10 In the reminder of this document, we consider
16 Domain tags and domains partitioning across virt reader instances
17 -----------------------------------------------------------------
19 The virt plugin gained the `Instances` option. It allows to start
20 more than one reader instance, so the the libvirt domains could be queried
21 by more than one reader thread.
22 The default value for `Instances` is `1`.
23 With default settings, the plugin will behave in a fully transparent,
24 backward compatible way.
25 It is recommended to set this value to one multiple of the
26 daemon `ReadThreads` value.
28 Each reader instance will query only a subset of the libvirt domain.
29 The subset is identified as follows:
31 1. Each virt reader instance is named `virt-$NUM`, where `NUM` is
32 the progressive order of instances. If you configure `Instances 3`
33 you will have `virt-0`, `virt-1`, `virt-2`. Please note: the `virt-0`
34 instance is special, and will always be available.
35 2. Each virt reader instance will iterate over all the libvirt active domains,
36 and will look for one `tag` attribute (see below) in the domain metadata section.
37 3. Each virt reader instance will take care *only* of the libvirt domains whose
38 tag matches with its own
39 4. The special `virt-0` instance will take care of all the libvirt domains with
40 no tags, or with tags which are not in the set \[virt-0 ... virt-$NUM\]
42 Collectd will just use the domain tags, but never enforces or requires them.
43 It is up to an external entity, like a software management system,
44 to attach and manage the tags to the domain.
46 Please note that unless you have such tag-aware management sofware,
47 it most likely make no sense to enable more than one reader instance on your
51 Libvirt tag metadata format
52 ----------------------------
54 This is the snipped to be added to libvirt domains:
56 <ovirtmap:tag xmlns:ovirtmap="http://ovirt.org/ovirtmap/tag/1.0">
60 it must be included in the <metadata> section.
62 Check the `src/virt_test.c` file for really minimal example of libvirt domains.
68 ### Example one: 10 libvirt domains named "domain-A" ... "domain-J", virt plugin with instances=5, using 5 different tags
71 libvirt domain name - tag - read instance - reason
72 domain-A virt-0 0 tag match
73 domain-B virt-1 1 tag match
74 domain-C virt-2 2 tag match
75 domain-D virt-3 3 tag match
76 domain-E virt-4 4 tag match
77 domain-F virt-0 0 tag match
78 domain-G virt-1 1 tag match
79 domain-H virt-2 2 tag match
80 domain-I virt-3 3 tag match
81 domain-J virt-4 4 tag match
84 Because the domain where properly tagged, all the read instances have even load. Please note that the the virt plugin
85 knows nothing, and should know nothing, about *how* the libvirt domain are tagged. This is entirely up to the
89 Example two: 10 libvirt domains named "domain-A" ... "domain-J", virt plugin with instances=3, using 5 different tags
92 libvirt domain name - tag - read instance - reason
93 domain-A virt-0 0 tag match
94 domain-B virt-1 1 tag match
95 domain-C virt-2 2 tag match
96 domain-D virt-3 0 adopted by instance #0
97 domain-E virt-4 0 adopted by instance #0
98 domain-F virt-0 0 rag match
99 domain-G virt-1 1 tag match
100 domain-H virt-2 2 tag match
101 domain-I virt-3 0 adopted by instance #0
102 domain-J virt-4 0 adopted by instance #0
105 In this case we have uneven load, but no domain is ignored.
108 ### Example three: 10 libvirt domains named "domain-A" ... "domain-J", virt plugin with instances=5, using 3 different tags
111 libvirt domain name - tag - read instance - reason
112 domain-A virt-0 0 tag match
113 domain-B virt-1 1 tag match
114 domain-C virt-2 2 tag match
115 domain-D virt-0 0 tag match
116 domain-E virt-1 1 tag match
117 domain-F virt-2 2 tag match
118 domain-G virt-0 0 tag match
119 domain-H virt-1 1 tag match
120 domain-I virt-2 2 tag match
121 domain-J virt-0 0 tag match
124 Once again we have uneven load and two idle read instances, but besides that no domain is left unmonitored
127 ### Example four: 10 libvirt domains named "domain-A" ... "domain-J", virt plugin with instances=5, partial tagging
130 libvirt domain name - tag - read instance - reason
131 domain-A virt-0 0 tag match
132 domain-B virt-1 1 tag match
133 domain-C virt-2 2 tag match
134 domain-D virt-0 0 tag match
135 domain-E 0 adopted by instance #0
136 domain-F 0 adopted by instance #0
137 domain-G 0 adopted by instance #0
138 domain-H 0 adopted by instance #0
139 domain-I 0 adopted by instance #0
140 domain-J 0 adopted by instance #0
143 The lack of tags causes uneven load, but no domain are unmonitored.
146 Possible extensions - custom tag format
147 ---------------------------------------
149 The aformentioned approach relies on fixed tag format, `virt-$N`. The algorithm works fine with any tag, which
150 is just one string, compared for equality. However, using custom strings for tags creates the need for a mapping
151 between tags and the read instances.
152 This mapping needs to be updated as long as domain are created or destroyed, and the virt plugin needs to be
153 notified of the changes.
155 This adds a significant amount of complexity, with little gain with respect to the fixed schema adopted initially.
156 For this reason, the introdution of dynamic, custom mapping was not implemented.
159 Dealing with datacenters: libvirt, qemu, shared storage
160 -------------------------------------------------------
162 When used in a datacenter, QEMU is most often configured to use shared storage. This is
163 the default configuration of datacenter management solutions like [oVirt](http://www.ovirt.org).
164 The actual shared storage could be implemented on top of NFS for small installations, or most likely
165 ISCSI or Fiber Channel. The key takeaway is that the storage is accessed over the network,
166 not using e.g. the SATA or PCI bus of any given host, so any network issue could cause
167 one or more storage operations to delay, or to be lost entirely.
169 In that case, the userspace process that requested the operation can end up in the D state,
170 and become unresponsive, and unkillable.
173 Dealing with unresponsive domains
174 ---------------------------------
176 All the above considered, one robust management or monitoring application must deal with the fact that
177 the libvirt API can block for a long time, or forever. This is not an issue or a bug of one specific
178 API, but it is rather a byproduct of how libvirt and QEMU interact.
180 Whenever we query more than one VM, we should take care to avoid that one blocked VM prevent other,
181 well behaving VMs to be queried. We don't want one rogue VM to disrupt well-behaving VMs.
182 Unfortunately, any way we enumerate VMs, either implicitely, using the libvirt bulk stats API,
183 or explicitely, listing all libvirt domains and query each one in turn, we may unpredictably encounter
186 There are many possible approaches to deal with this issue. The virt plugin supports
187 a simple but effective approach partitioning the domains, as follows.
189 1. The virt plugin always register one or more `read` callbacks. The `zero` read callback is guaranteed to
190 be always present, so it performs special duties (more details later)
191 Each callback will be named 'virt-$N', where `N` ranges from 0 (zero) to M-1, where M is the number of instances configured.
192 `M` equals to `5` by default, because this is the same default number of threads in the libvirt worker pool.
193 2. Each of the read callbacks queries libvirt for the list of all the active domains, and retrieves the libvirt domain metadata.
194 Both of those operations are safe wrt domain blocked in I/O (they affect only the libvirtd daemon).
195 3. Each of the read callbacks extracts the `tag` from the domain metadata using a well-known format (see below).
196 Each of the read callbacks discards any domain which has no tag, or whose tag doesn't match with the read callback tag.
197 3.a. The read callback tag equals to the read callback name, thus `virt-$N`. Remember that `virt-0` is guaranteed to be
199 3.b. Since the `virt-0` reader is always present, it will take care of domains with no tag, or with unrecognized tag.
200 One unrecognized tag is any tag which has not the scheme `virt-$N`.
201 4. Each read callback only samples the subset of domains with matching tag. The `virt-0` reader will possibly do more,
202 but worst case the load will be unbalanced, no domain will be left unsampled.
204 To make this approach work, some entity must attach the tags to the libvirt domains, in such a way that all
205 the domains which run on a given host and insist on the same network-based storage share the same tag.
206 This minimizes the disruption, because when using the shared storage, if one domain becomes unresponsive because
207 of unavailable storage, the most likely thing to happen is that others domain using the same storage will soon become
208 unavailable; should the box run other libvirt domains using other network-based storage, they could be monitored
211 In case of [oVirt](http://www.ovirt.org), the aforementioned tagging is performed by the host agent.
213 Please note that this approach is ineffective if the host completely lose network access to the storage network.
214 In this case, however, no recovery is possible and no damage limitation is possible.
216 Lastly, please note that if the virt plugin is configured with instances=1, it behaves exactly like before.
219 Addendum: high level overview: libvirt client, libvirt daemon, qemu
220 --------------------------------------------------------------------
222 Let's review how the client application (collectd + virt plugin), the libvirtd daemon and the
223 QEMU processes interact with each other.
225 The libvirt daemon talks to QEMU using the JSON QMP protcol over one unix domain socket.
226 The details of the protocol are not important now, but the key part is that the protocol
227 is a simple request/response, meaning that libvirtd must serialize all the interactions
228 with the QEMU monitor, and must protects its endpoint with a lock.
229 No out of order request/responses are possible (e.g. no pipelining or async replies).
230 This means that if for any reason one QMP request could not be completed, any other caller
231 trying to access the QEMU monitor will block until the blocked caller returns.
233 To retrieve some key informations, most notably about the block device state or the balloon
234 device state, the libvirtd daemon *must* use the QMP protocol.
236 The QEMU core, including the handling of the QMP protocol, is single-threaded.
237 All the above combined make it possible for a client to block forever waiting for one QMP
238 request, if QEMU itself is blocked. The most likely cause of block is I/O, and this is especially
239 true considering how QEMU is used in a datacenter.