blob: 9a4273a0fe4e77361c7fcb75465d5648bf4022a0 [file] [log] [blame]
Patrick Williamsc124f4f2015-09-15 14:41:29 -05001<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
2"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
3[<!ENTITY % poky SYSTEM "../poky.ent"> %poky; ] >
4
5<chapter id='profile-manual-usage'>
6
7<title>Basic Usage (with examples) for each of the Yocto Tracing Tools</title>
8
9<para>
10 This chapter presents basic usage examples for each of the tracing
11 tools.
12</para>
13
14<section id='profile-manual-perf'>
15 <title>perf</title>
16
17 <para>
18 The 'perf' tool is the profiling and tracing tool that comes
19 bundled with the Linux kernel.
20 </para>
21
22 <para>
23 Don't let the fact that it's part of the kernel fool you into thinking
24 that it's only for tracing and profiling the kernel - you can indeed
25 use it to trace and profile just the kernel, but you can also use it
26 to profile specific applications separately (with or without kernel
27 context), and you can also use it to trace and profile the kernel
28 and all applications on the system simultaneously to gain a system-wide
29 view of what's going on.
30 </para>
31
32 <para>
33 In many ways, perf aims to be a superset of all the tracing and profiling
34 tools available in Linux today, including all the other tools covered
35 in this HOWTO. The past couple of years have seen perf subsume a lot
36 of the functionality of those other tools and, at the same time, those
37 other tools have removed large portions of their previous functionality
38 and replaced it with calls to the equivalent functionality now
39 implemented by the perf subsystem. Extrapolation suggests that at
40 some point those other tools will simply become completely redundant
41 and go away; until then, we'll cover those other tools in these pages
42 and in many cases show how the same things can be accomplished in
43 perf and the other tools when it seems useful to do so.
44 </para>
45
46 <para>
47 The coverage below details some of the most common ways you'll likely
48 want to apply the tool; full documentation can be found either within
49 the tool itself or in the man pages at
50 <ulink url='http://linux.die.net/man/1/perf'>perf(1)</ulink>.
51 </para>
52
53 <section id='perf-setup'>
54 <title>Setup</title>
55
56 <para>
57 For this section, we'll assume you've already performed the basic
58 setup outlined in the General Setup section.
59 </para>
60
61 <para>
62 In particular, you'll get the most mileage out of perf if you
Patrick Williamsc0f7c042017-02-23 20:41:17 -060063 profile an image built with the following in your
64 <filename>local.conf</filename> file:
65 <literallayout class='monospaced'>
66 <ulink url='&YOCTO_DOCS_REF_URL;#var-INHIBIT_PACKAGE_STRIP'>INHIBIT_PACKAGE_STRIP</ulink> = "1"
67 </literallayout>
Patrick Williamsc124f4f2015-09-15 14:41:29 -050068 </para>
69
70 <para>
71 perf runs on the target system for the most part. You can archive
72 profile data and copy it to the host for analysis, but for the
73 rest of this document we assume you've ssh'ed to the host and
74 will be running the perf commands on the target.
75 </para>
76 </section>
77
78 <section id='perf-basic-usage'>
79 <title>Basic Usage</title>
80
81 <para>
82 The perf tool is pretty much self-documenting. To remind yourself
83 of the available commands, simply type 'perf', which will show you
84 basic usage along with the available perf subcommands:
85 <literallayout class='monospaced'>
86 root@crownbay:~# perf
87
88 usage: perf [--version] [--help] COMMAND [ARGS]
89
90 The most commonly used perf commands are:
91 annotate Read perf.data (created by perf record) and display annotated code
92 archive Create archive with object files with build-ids found in perf.data file
93 bench General framework for benchmark suites
94 buildid-cache Manage build-id cache.
95 buildid-list List the buildids in a perf.data file
96 diff Read two perf.data files and display the differential profile
97 evlist List the event names in a perf.data file
98 inject Filter to augment the events stream with additional information
99 kmem Tool to trace/measure kernel memory(slab) properties
100 kvm Tool to trace/measure kvm guest os
101 list List all symbolic event types
102 lock Analyze lock events
103 probe Define new dynamic tracepoints
104 record Run a command and record its profile into perf.data
105 report Read perf.data (created by perf record) and display the profile
106 sched Tool to trace/measure scheduler properties (latencies)
107 script Read perf.data (created by perf record) and display trace output
108 stat Run a command and gather performance counter statistics
109 test Runs sanity tests.
110 timechart Tool to visualize total system behavior during a workload
111 top System profiling tool.
112
113 See 'perf help COMMAND' for more information on a specific command.
114 </literallayout>
115 </para>
116
117 <section id='using-perf-to-do-basic-profiling'>
118 <title>Using perf to do Basic Profiling</title>
119
120 <para>
121 As a simple test case, we'll profile the 'wget' of a fairly large
122 file, which is a minimally interesting case because it has both
123 file and network I/O aspects, and at least in the case of standard
124 Yocto images, it's implemented as part of busybox, so the methods
125 we use to analyze it can be used in a very similar way to the whole
126 host of supported busybox applets in Yocto.
127 <literallayout class='monospaced'>
128 root@crownbay:~# rm linux-2.6.19.2.tar.bz2; \
129 wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
130 </literallayout>
131 The quickest and easiest way to get some basic overall data about
132 what's going on for a particular workload is to profile it using
133 'perf stat'. 'perf stat' basically profiles using a few default
134 counters and displays the summed counts at the end of the run:
135 <literallayout class='monospaced'>
136 root@crownbay:~# perf stat wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
137 Connecting to downloads.yoctoproject.org (140.211.169.59:80)
138 linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA
139
140 Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':
141
142 4597.223902 task-clock # 0.077 CPUs utilized
143 23568 context-switches # 0.005 M/sec
144 68 CPU-migrations # 0.015 K/sec
145 241 page-faults # 0.052 K/sec
146 3045817293 cycles # 0.663 GHz
147 &lt;not supported&gt; stalled-cycles-frontend
148 &lt;not supported&gt; stalled-cycles-backend
149 858909167 instructions # 0.28 insns per cycle
150 165441165 branches # 35.987 M/sec
151 19550329 branch-misses # 11.82% of all branches
152
153 59.836627620 seconds time elapsed
154 </literallayout>
155 Many times such a simple-minded test doesn't yield much of
156 interest, but sometimes it does (see Real-world Yocto bug
157 (slow loop-mounted write speed)).
158 </para>
159
160 <para>
161 Also, note that 'perf stat' isn't restricted to a fixed set of
162 counters - basically any event listed in the output of 'perf list'
163 can be tallied by 'perf stat'. For example, suppose we wanted to
164 see a summary of all the events related to kernel memory
165 allocation/freeing along with cache hits and misses:
166 <literallayout class='monospaced'>
167 root@crownbay:~# perf stat -e kmem:* -e cache-references -e cache-misses wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
168 Connecting to downloads.yoctoproject.org (140.211.169.59:80)
169 linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA
170
171 Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':
172
173 5566 kmem:kmalloc
174 125517 kmem:kmem_cache_alloc
175 0 kmem:kmalloc_node
176 0 kmem:kmem_cache_alloc_node
177 34401 kmem:kfree
178 69920 kmem:kmem_cache_free
179 133 kmem:mm_page_free
180 41 kmem:mm_page_free_batched
181 11502 kmem:mm_page_alloc
182 11375 kmem:mm_page_alloc_zone_locked
183 0 kmem:mm_page_pcpu_drain
184 0 kmem:mm_page_alloc_extfrag
185 66848602 cache-references
186 2917740 cache-misses # 4.365 % of all cache refs
187
188 44.831023415 seconds time elapsed
189 </literallayout>
190 So 'perf stat' gives us a nice easy way to get a quick overview of
191 what might be happening for a set of events, but normally we'd
192 need a little more detail in order to understand what's going on
193 in a way that we can act on in a useful way.
194 </para>
195
196 <para>
197 To dive down into a next level of detail, we can use 'perf
198 record'/'perf report' which will collect profiling data and
199 present it to use using an interactive text-based UI (or
200 simply as text if we specify --stdio to 'perf report').
201 </para>
202
203 <para>
204 As our first attempt at profiling this workload, we'll simply
205 run 'perf record', handing it the workload we want to profile
206 (everything after 'perf record' and any perf options we hand
207 it - here none - will be executed in a new shell). perf collects
208 samples until the process exits and records them in a file named
209 'perf.data' in the current working directory.
210 <literallayout class='monospaced'>
211 root@crownbay:~# perf record wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
212
213 Connecting to downloads.yoctoproject.org (140.211.169.59:80)
214 linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA
215 [ perf record: Woken up 1 times to write data ]
216 [ perf record: Captured and wrote 0.176 MB perf.data (~7700 samples) ]
217 </literallayout>
218 To see the results in a 'text-based UI' (tui), simply run
219 'perf report', which will read the perf.data file in the current
220 working directory and display the results in an interactive UI:
221 <literallayout class='monospaced'>
222 root@crownbay:~# perf report
223 </literallayout>
224 </para>
225
226 <para>
227 <imagedata fileref="figures/perf-wget-flat-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
228 </para>
229
230 <para>
231 The above screenshot displays a 'flat' profile, one entry for
232 each 'bucket' corresponding to the functions that were profiled
233 during the profiling run, ordered from the most popular to the
234 least (perf has options to sort in various orders and keys as
235 well as display entries only above a certain threshold and so
236 on - see the perf documentation for details). Note that this
237 includes both userspace functions (entries containing a [.]) and
238 kernel functions accounted to the process (entries containing
239 a [k]). (perf has command-line modifiers that can be used to
240 restrict the profiling to kernel or userspace, among others).
241 </para>
242
243 <para>
244 Notice also that the above report shows an entry for 'busybox',
245 which is the executable that implements 'wget' in Yocto, but that
246 instead of a useful function name in that entry, it displays
247 a not-so-friendly hex value instead. The steps below will show
248 how to fix that problem.
249 </para>
250
251 <para>
252 Before we do that, however, let's try running a different profile,
253 one which shows something a little more interesting. The only
254 difference between the new profile and the previous one is that
255 we'll add the -g option, which will record not just the address
256 of a sampled function, but the entire callchain to the sampled
257 function as well:
258 <literallayout class='monospaced'>
259 root@crownbay:~# perf record -g wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
260 Connecting to downloads.yoctoproject.org (140.211.169.59:80)
261 linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA
262 [ perf record: Woken up 3 times to write data ]
263 [ perf record: Captured and wrote 0.652 MB perf.data (~28476 samples) ]
264
265
266 root@crownbay:~# perf report
267 </literallayout>
268 </para>
269
270 <para>
271 <imagedata fileref="figures/perf-wget-g-copy-to-user-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
272 </para>
273
274 <para>
275 Using the callgraph view, we can actually see not only which
276 functions took the most time, but we can also see a summary of
277 how those functions were called and learn something about how the
278 program interacts with the kernel in the process.
279 </para>
280
281 <para>
282 Notice that each entry in the above screenshot now contains a '+'
283 on the left-hand side. This means that we can expand the entry and
284 drill down into the callchains that feed into that entry.
285 Pressing 'enter' on any one of them will expand the callchain
286 (you can also press 'E' to expand them all at the same time or 'C'
287 to collapse them all).
288 </para>
289
290 <para>
291 In the screenshot above, we've toggled the __copy_to_user_ll()
292 entry and several subnodes all the way down. This lets us see
293 which callchains contributed to the profiled __copy_to_user_ll()
294 function which contributed 1.77% to the total profile.
295 </para>
296
297 <para>
298 As a bit of background explanation for these callchains, think
299 about what happens at a high level when you run wget to get a file
300 out on the network. Basically what happens is that the data comes
301 into the kernel via the network connection (socket) and is passed
302 to the userspace program 'wget' (which is actually a part of
303 busybox, but that's not important for now), which takes the buffers
304 the kernel passes to it and writes it to a disk file to save it.
305 </para>
306
307 <para>
308 The part of this process that we're looking at in the above call
309 stacks is the part where the kernel passes the data it's read from
310 the socket down to wget i.e. a copy-to-user.
311 </para>
312
313 <para>
314 Notice also that here there's also a case where the hex value
315 is displayed in the callstack, here in the expanded
316 sys_clock_gettime() function. Later we'll see it resolve to a
317 userspace function call in busybox.
318 </para>
319
320 <para>
321 <imagedata fileref="figures/perf-wget-g-copy-from-user-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
322 </para>
323
324 <para>
325 The above screenshot shows the other half of the journey for the
326 data - from the wget program's userspace buffers to disk. To get
327 the buffers to disk, the wget program issues a write(2), which
328 does a copy-from-user to the kernel, which then takes care via
329 some circuitous path (probably also present somewhere in the
330 profile data), to get it safely to disk.
331 </para>
332
333 <para>
334 Now that we've seen the basic layout of the profile data and the
335 basics of how to extract useful information out of it, let's get
336 back to the task at hand and see if we can get some basic idea
337 about where the time is spent in the program we're profiling,
338 wget. Remember that wget is actually implemented as an applet
339 in busybox, so while the process name is 'wget', the executable
340 we're actually interested in is busybox. So let's expand the
341 first entry containing busybox:
342 </para>
343
344 <para>
345 <imagedata fileref="figures/perf-wget-busybox-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
346 </para>
347
348 <para>
349 Again, before we expanded we saw that the function was labeled
350 with a hex value instead of a symbol as with most of the kernel
351 entries. Expanding the busybox entry doesn't make it any better.
352 </para>
353
354 <para>
355 The problem is that perf can't find the symbol information for the
356 busybox binary, which is actually stripped out by the Yocto build
357 system.
358 </para>
359
360 <para>
Patrick Williamsc0f7c042017-02-23 20:41:17 -0600361 One way around that is to put the following in your
362 <filename>local.conf</filename> file when you build the image:
Patrick Williamsc124f4f2015-09-15 14:41:29 -0500363 <literallayout class='monospaced'>
Patrick Williamsc0f7c042017-02-23 20:41:17 -0600364 <ulink url='&YOCTO_DOCS_REF_URL;#var-INHIBIT_PACKAGE_STRIP'>INHIBIT_PACKAGE_STRIP</ulink> = "1"
Patrick Williamsc124f4f2015-09-15 14:41:29 -0500365 </literallayout>
366 However, we already have an image with the binaries stripped,
367 so what can we do to get perf to resolve the symbols? Basically
368 we need to install the debuginfo for the busybox package.
369 </para>
370
371 <para>
372 To generate the debug info for the packages in the image, we can
373 add dbg-pkgs to EXTRA_IMAGE_FEATURES in local.conf. For example:
374 <literallayout class='monospaced'>
375 EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile dbg-pkgs"
376 </literallayout>
377 Additionally, in order to generate the type of debuginfo that
Brad Bishop1a4b7ee2018-12-16 17:11:34 -0800378 perf understands, we also need to set
379 <ulink url='&YOCTO_DOCS_REF_URL;#var-PACKAGE_DEBUG_SPLIT_STYLE'><filename>PACKAGE_DEBUG_SPLIT_STYLE</filename></ulink>
380 in the <filename>local.conf</filename> file:
Patrick Williamsc124f4f2015-09-15 14:41:29 -0500381 <literallayout class='monospaced'>
382 PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory'
383 </literallayout>
384 Once we've done that, we can install the debuginfo for busybox.
385 The debug packages once built can be found in
386 build/tmp/deploy/rpm/* on the host system. Find the
387 busybox-dbg-...rpm file and copy it to the target. For example:
388 <literallayout class='monospaced'>
389 [trz@empanada core2]$ scp /home/trz/yocto/crownbay-tracing-dbg/build/tmp/deploy/rpm/core2_32/busybox-dbg-1.20.2-r2.core2_32.rpm root@192.168.1.31:
390 root@192.168.1.31's password:
391 busybox-dbg-1.20.2-r2.core2_32.rpm 100% 1826KB 1.8MB/s 00:01
392 </literallayout>
393 Now install the debug rpm on the target:
394 <literallayout class='monospaced'>
395 root@crownbay:~# rpm -i busybox-dbg-1.20.2-r2.core2_32.rpm
396 </literallayout>
397 Now that the debuginfo is installed, we see that the busybox
398 entries now display their functions symbolically:
399 </para>
400
401 <para>
402 <imagedata fileref="figures/perf-wget-busybox-debuginfo.png" width="6in" depth="7in" align="center" scalefit="1" />
403 </para>
404
405 <para>
406 If we expand one of the entries and press 'enter' on a leaf node,
407 we're presented with a menu of actions we can take to get more
408 information related to that entry:
409 </para>
410
411 <para>
412 <imagedata fileref="figures/perf-wget-busybox-dso-zoom-menu.png" width="6in" depth="2in" align="center" scalefit="1" />
413 </para>
414
415 <para>
416 One of these actions allows us to show a view that displays a
417 busybox-centric view of the profiled functions (in this case we've
418 also expanded all the nodes using the 'E' key):
419 </para>
420
421 <para>
422 <imagedata fileref="figures/perf-wget-busybox-dso-zoom.png" width="6in" depth="7in" align="center" scalefit="1" />
423 </para>
424
425 <para>
426 Finally, we can see that now that the busybox debuginfo is
427 installed, the previously unresolved symbol in the
428 sys_clock_gettime() entry mentioned previously is now resolved,
429 and shows that the sys_clock_gettime system call that was the
430 source of 6.75% of the copy-to-user overhead was initiated by
431 the handle_input() busybox function:
432 </para>
433
434 <para>
435 <imagedata fileref="figures/perf-wget-g-copy-to-user-expanded-debuginfo.png" width="6in" depth="7in" align="center" scalefit="1" />
436 </para>
437
438 <para>
439 At the lowest level of detail, we can dive down to the assembly
440 level and see which instructions caused the most overhead in a
441 function. Pressing 'enter' on the 'udhcpc_main' function, we're
442 again presented with a menu:
443 </para>
444
445 <para>
446 <imagedata fileref="figures/perf-wget-busybox-annotate-menu.png" width="6in" depth="2in" align="center" scalefit="1" />
447 </para>
448
449 <para>
450 Selecting 'Annotate udhcpc_main', we get a detailed listing of
451 percentages by instruction for the udhcpc_main function. From the
452 display, we can see that over 50% of the time spent in this
453 function is taken up by a couple tests and the move of a
454 constant (1) to a register:
455 </para>
456
457 <para>
458 <imagedata fileref="figures/perf-wget-busybox-annotate-udhcpc.png" width="6in" depth="7in" align="center" scalefit="1" />
459 </para>
460
461 <para>
462 As a segue into tracing, let's try another profile using a
463 different counter, something other than the default 'cycles'.
464 </para>
465
466 <para>
467 The tracing and profiling infrastructure in Linux has become
468 unified in a way that allows us to use the same tool with a
469 completely different set of counters, not just the standard
470 hardware counters that traditional tools have had to restrict
471 themselves to (of course the traditional tools can also make use
472 of the expanded possibilities now available to them, and in some
473 cases have, as mentioned previously).
474 </para>
475
476 <para>
477 We can get a list of the available events that can be used to
478 profile a workload via 'perf list':
479 <literallayout class='monospaced'>
480 root@crownbay:~# perf list
481
482 List of pre-defined events (to be used in -e):
483 cpu-cycles OR cycles [Hardware event]
484 stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
485 stalled-cycles-backend OR idle-cycles-backend [Hardware event]
486 instructions [Hardware event]
487 cache-references [Hardware event]
488 cache-misses [Hardware event]
489 branch-instructions OR branches [Hardware event]
490 branch-misses [Hardware event]
491 bus-cycles [Hardware event]
492 ref-cycles [Hardware event]
493
494 cpu-clock [Software event]
495 task-clock [Software event]
496 page-faults OR faults [Software event]
497 minor-faults [Software event]
498 major-faults [Software event]
499 context-switches OR cs [Software event]
500 cpu-migrations OR migrations [Software event]
501 alignment-faults [Software event]
502 emulation-faults [Software event]
503
504 L1-dcache-loads [Hardware cache event]
505 L1-dcache-load-misses [Hardware cache event]
506 L1-dcache-prefetch-misses [Hardware cache event]
507 L1-icache-loads [Hardware cache event]
508 L1-icache-load-misses [Hardware cache event]
509 .
510 .
511 .
512 rNNN [Raw hardware event descriptor]
513 cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]
514 (see 'perf list --help' on how to encode it)
515
516 mem:&lt;addr&gt;[:access] [Hardware breakpoint]
517
518 sunrpc:rpc_call_status [Tracepoint event]
519 sunrpc:rpc_bind_status [Tracepoint event]
520 sunrpc:rpc_connect_status [Tracepoint event]
521 sunrpc:rpc_task_begin [Tracepoint event]
522 skb:kfree_skb [Tracepoint event]
523 skb:consume_skb [Tracepoint event]
524 skb:skb_copy_datagram_iovec [Tracepoint event]
525 net:net_dev_xmit [Tracepoint event]
526 net:net_dev_queue [Tracepoint event]
527 net:netif_receive_skb [Tracepoint event]
528 net:netif_rx [Tracepoint event]
529 napi:napi_poll [Tracepoint event]
530 sock:sock_rcvqueue_full [Tracepoint event]
531 sock:sock_exceed_buf_limit [Tracepoint event]
532 udp:udp_fail_queue_rcv_skb [Tracepoint event]
533 hda:hda_send_cmd [Tracepoint event]
534 hda:hda_get_response [Tracepoint event]
535 hda:hda_bus_reset [Tracepoint event]
536 scsi:scsi_dispatch_cmd_start [Tracepoint event]
537 scsi:scsi_dispatch_cmd_error [Tracepoint event]
538 scsi:scsi_eh_wakeup [Tracepoint event]
539 drm:drm_vblank_event [Tracepoint event]
540 drm:drm_vblank_event_queued [Tracepoint event]
541 drm:drm_vblank_event_delivered [Tracepoint event]
542 random:mix_pool_bytes [Tracepoint event]
543 random:mix_pool_bytes_nolock [Tracepoint event]
544 random:credit_entropy_bits [Tracepoint event]
545 gpio:gpio_direction [Tracepoint event]
546 gpio:gpio_value [Tracepoint event]
547 block:block_rq_abort [Tracepoint event]
548 block:block_rq_requeue [Tracepoint event]
549 block:block_rq_issue [Tracepoint event]
550 block:block_bio_bounce [Tracepoint event]
551 block:block_bio_complete [Tracepoint event]
552 block:block_bio_backmerge [Tracepoint event]
553 .
554 .
555 writeback:writeback_wake_thread [Tracepoint event]
556 writeback:writeback_wake_forker_thread [Tracepoint event]
557 writeback:writeback_bdi_register [Tracepoint event]
558 .
559 .
560 writeback:writeback_single_inode_requeue [Tracepoint event]
561 writeback:writeback_single_inode [Tracepoint event]
562 kmem:kmalloc [Tracepoint event]
563 kmem:kmem_cache_alloc [Tracepoint event]
564 kmem:mm_page_alloc [Tracepoint event]
565 kmem:mm_page_alloc_zone_locked [Tracepoint event]
566 kmem:mm_page_pcpu_drain [Tracepoint event]
567 kmem:mm_page_alloc_extfrag [Tracepoint event]
568 vmscan:mm_vmscan_kswapd_sleep [Tracepoint event]
569 vmscan:mm_vmscan_kswapd_wake [Tracepoint event]
570 vmscan:mm_vmscan_wakeup_kswapd [Tracepoint event]
571 vmscan:mm_vmscan_direct_reclaim_begin [Tracepoint event]
572 .
573 .
574 module:module_get [Tracepoint event]
575 module:module_put [Tracepoint event]
576 module:module_request [Tracepoint event]
577 sched:sched_kthread_stop [Tracepoint event]
578 sched:sched_wakeup [Tracepoint event]
579 sched:sched_wakeup_new [Tracepoint event]
580 sched:sched_process_fork [Tracepoint event]
581 sched:sched_process_exec [Tracepoint event]
582 sched:sched_stat_runtime [Tracepoint event]
583 rcu:rcu_utilization [Tracepoint event]
584 workqueue:workqueue_queue_work [Tracepoint event]
585 workqueue:workqueue_execute_end [Tracepoint event]
586 signal:signal_generate [Tracepoint event]
587 signal:signal_deliver [Tracepoint event]
588 timer:timer_init [Tracepoint event]
589 timer:timer_start [Tracepoint event]
590 timer:hrtimer_cancel [Tracepoint event]
591 timer:itimer_state [Tracepoint event]
592 timer:itimer_expire [Tracepoint event]
593 irq:irq_handler_entry [Tracepoint event]
594 irq:irq_handler_exit [Tracepoint event]
595 irq:softirq_entry [Tracepoint event]
596 irq:softirq_exit [Tracepoint event]
597 irq:softirq_raise [Tracepoint event]
598 printk:console [Tracepoint event]
599 task:task_newtask [Tracepoint event]
600 task:task_rename [Tracepoint event]
601 syscalls:sys_enter_socketcall [Tracepoint event]
602 syscalls:sys_exit_socketcall [Tracepoint event]
603 .
604 .
605 .
606 syscalls:sys_enter_unshare [Tracepoint event]
607 syscalls:sys_exit_unshare [Tracepoint event]
608 raw_syscalls:sys_enter [Tracepoint event]
609 raw_syscalls:sys_exit [Tracepoint event]
610 </literallayout>
611 </para>
612
613 <informalexample>
614 <emphasis>Tying it Together:</emphasis> These are exactly the same set of events defined
615 by the trace event subsystem and exposed by
616 ftrace/tracecmd/kernelshark as files in
617 /sys/kernel/debug/tracing/events, by SystemTap as
618 kernel.trace("tracepoint_name") and (partially) accessed by LTTng.
619 </informalexample>
620
621 <para>
622 Only a subset of these would be of interest to us when looking at
623 this workload, so let's choose the most likely subsystems
624 (identified by the string before the colon in the Tracepoint events)
625 and do a 'perf stat' run using only those wildcarded subsystems:
626 <literallayout class='monospaced'>
627 root@crownbay:~# perf stat -e skb:* -e net:* -e napi:* -e sched:* -e workqueue:* -e irq:* -e syscalls:* wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
628 Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':
629
630 23323 skb:kfree_skb
631 0 skb:consume_skb
632 49897 skb:skb_copy_datagram_iovec
633 6217 net:net_dev_xmit
634 6217 net:net_dev_queue
635 7962 net:netif_receive_skb
636 2 net:netif_rx
637 8340 napi:napi_poll
638 0 sched:sched_kthread_stop
639 0 sched:sched_kthread_stop_ret
640 3749 sched:sched_wakeup
641 0 sched:sched_wakeup_new
642 0 sched:sched_switch
643 29 sched:sched_migrate_task
644 0 sched:sched_process_free
645 1 sched:sched_process_exit
646 0 sched:sched_wait_task
647 0 sched:sched_process_wait
648 0 sched:sched_process_fork
649 1 sched:sched_process_exec
650 0 sched:sched_stat_wait
651 2106519415641 sched:sched_stat_sleep
652 0 sched:sched_stat_iowait
653 147453613 sched:sched_stat_blocked
654 12903026955 sched:sched_stat_runtime
655 0 sched:sched_pi_setprio
656 3574 workqueue:workqueue_queue_work
657 3574 workqueue:workqueue_activate_work
658 0 workqueue:workqueue_execute_start
659 0 workqueue:workqueue_execute_end
660 16631 irq:irq_handler_entry
661 16631 irq:irq_handler_exit
662 28521 irq:softirq_entry
663 28521 irq:softirq_exit
664 28728 irq:softirq_raise
665 1 syscalls:sys_enter_sendmmsg
666 1 syscalls:sys_exit_sendmmsg
667 0 syscalls:sys_enter_recvmmsg
668 0 syscalls:sys_exit_recvmmsg
669 14 syscalls:sys_enter_socketcall
670 14 syscalls:sys_exit_socketcall
671 .
672 .
673 .
674 16965 syscalls:sys_enter_read
675 16965 syscalls:sys_exit_read
676 12854 syscalls:sys_enter_write
677 12854 syscalls:sys_exit_write
678 .
679 .
680 .
681
682 58.029710972 seconds time elapsed
683 </literallayout>
684 Let's pick one of these tracepoints and tell perf to do a profile
685 using it as the sampling event:
686 <literallayout class='monospaced'>
687 root@crownbay:~# perf record -g -e sched:sched_wakeup wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
688 </literallayout>
689 </para>
690
691 <para>
692 <imagedata fileref="figures/sched-wakeup-profile.png" width="6in" depth="7in" align="center" scalefit="1" />
693 </para>
694
695 <para>
696 The screenshot above shows the results of running a profile using
697 sched:sched_switch tracepoint, which shows the relative costs of
698 various paths to sched_wakeup (note that sched_wakeup is the
699 name of the tracepoint - it's actually defined just inside
700 ttwu_do_wakeup(), which accounts for the function name actually
701 displayed in the profile:
702 <literallayout class='monospaced'>
703 /*
704 * Mark the task runnable and perform wakeup-preemption.
705 */
706 static void
707 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
708 {
709 trace_sched_wakeup(p, true);
710 .
711 .
712 .
713 }
714 </literallayout>
715 A couple of the more interesting callchains are expanded and
716 displayed above, basically some network receive paths that
717 presumably end up waking up wget (busybox) when network data is
718 ready.
719 </para>
720
721 <para>
722 Note that because tracepoints are normally used for tracing,
723 the default sampling period for tracepoints is 1 i.e. for
724 tracepoints perf will sample on every event occurrence (this
725 can be changed using the -c option). This is in contrast to
726 hardware counters such as for example the default 'cycles'
727 hardware counter used for normal profiling, where sampling
728 periods are much higher (in the thousands) because profiling should
729 have as low an overhead as possible and sampling on every cycle
730 would be prohibitively expensive.
731 </para>
732 </section>
733
734 <section id='using-perf-to-do-basic-tracing'>
735 <title>Using perf to do Basic Tracing</title>
736
737 <para>
738 Profiling is a great tool for solving many problems or for
739 getting a high-level view of what's going on with a workload or
740 across the system. It is however by definition an approximation,
741 as suggested by the most prominent word associated with it,
742 'sampling'. On the one hand, it allows a representative picture of
743 what's going on in the system to be cheaply taken, but on the other
744 hand, that cheapness limits its utility when that data suggests a
745 need to 'dive down' more deeply to discover what's really going
746 on. In such cases, the only way to see what's really going on is
747 to be able to look at (or summarize more intelligently) the
748 individual steps that go into the higher-level behavior exposed
749 by the coarse-grained profiling data.
750 </para>
751
752 <para>
753 As a concrete example, we can trace all the events we think might
754 be applicable to our workload:
755 <literallayout class='monospaced'>
756 root@crownbay:~# perf record -g -e skb:* -e net:* -e napi:* -e sched:sched_switch -e sched:sched_wakeup -e irq:*
757 -e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write
758 wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
759 </literallayout>
760 We can look at the raw trace output using 'perf script' with no
761 arguments:
762 <literallayout class='monospaced'>
763 root@crownbay:~# perf script
764
765 perf 1262 [000] 11624.857082: sys_exit_read: 0x0
766 perf 1262 [000] 11624.857193: sched_wakeup: comm=migration/0 pid=6 prio=0 success=1 target_cpu=000
767 wget 1262 [001] 11624.858021: softirq_raise: vec=1 [action=TIMER]
768 wget 1262 [001] 11624.858074: softirq_entry: vec=1 [action=TIMER]
769 wget 1262 [001] 11624.858081: softirq_exit: vec=1 [action=TIMER]
770 wget 1262 [001] 11624.858166: sys_enter_read: fd: 0x0003, buf: 0xbf82c940, count: 0x0200
771 wget 1262 [001] 11624.858177: sys_exit_read: 0x200
772 wget 1262 [001] 11624.858878: kfree_skb: skbaddr=0xeb248d80 protocol=0 location=0xc15a5308
773 wget 1262 [001] 11624.858945: kfree_skb: skbaddr=0xeb248000 protocol=0 location=0xc15a5308
774 wget 1262 [001] 11624.859020: softirq_raise: vec=1 [action=TIMER]
775 wget 1262 [001] 11624.859076: softirq_entry: vec=1 [action=TIMER]
776 wget 1262 [001] 11624.859083: softirq_exit: vec=1 [action=TIMER]
777 wget 1262 [001] 11624.859167: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
778 wget 1262 [001] 11624.859192: sys_exit_read: 0x1d7
779 wget 1262 [001] 11624.859228: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
780 wget 1262 [001] 11624.859233: sys_exit_read: 0x0
781 wget 1262 [001] 11624.859573: sys_enter_read: fd: 0x0003, buf: 0xbf82c580, count: 0x0200
782 wget 1262 [001] 11624.859584: sys_exit_read: 0x200
783 wget 1262 [001] 11624.859864: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
784 wget 1262 [001] 11624.859888: sys_exit_read: 0x400
785 wget 1262 [001] 11624.859935: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
786 wget 1262 [001] 11624.859944: sys_exit_read: 0x400
787 </literallayout>
788 This gives us a detailed timestamped sequence of events that
789 occurred within the workload with respect to those events.
790 </para>
791
792 <para>
793 In many ways, profiling can be viewed as a subset of tracing -
794 theoretically, if you have a set of trace events that's sufficient
795 to capture all the important aspects of a workload, you can derive
796 any of the results or views that a profiling run can.
797 </para>
798
799 <para>
800 Another aspect of traditional profiling is that while powerful in
801 many ways, it's limited by the granularity of the underlying data.
802 Profiling tools offer various ways of sorting and presenting the
803 sample data, which make it much more useful and amenable to user
804 experimentation, but in the end it can't be used in an open-ended
805 way to extract data that just isn't present as a consequence of
806 the fact that conceptually, most of it has been thrown away.
807 </para>
808
809 <para>
810 Full-blown detailed tracing data does however offer the opportunity
811 to manipulate and present the information collected during a
812 tracing run in an infinite variety of ways.
813 </para>
814
815 <para>
816 Another way to look at it is that there are only so many ways that
817 the 'primitive' counters can be used on their own to generate
818 interesting output; to get anything more complicated than simple
819 counts requires some amount of additional logic, which is typically
820 very specific to the problem at hand. For example, if we wanted to
821 make use of a 'counter' that maps to the value of the time
822 difference between when a process was scheduled to run on a
823 processor and the time it actually ran, we wouldn't expect such
824 a counter to exist on its own, but we could derive one called say
825 'wakeup_latency' and use it to extract a useful view of that metric
826 from trace data. Likewise, we really can't figure out from standard
827 profiling tools how much data every process on the system reads and
828 writes, along with how many of those reads and writes fail
829 completely. If we have sufficient trace data, however, we could
830 with the right tools easily extract and present that information,
831 but we'd need something other than pre-canned profiling tools to
832 do that.
833 </para>
834
835 <para>
836 Luckily, there is a general-purpose way to handle such needs,
837 called 'programming languages'. Making programming languages
838 easily available to apply to such problems given the specific
839 format of data is called a 'programming language binding' for
840 that data and language. Perf supports two programming language
841 bindings, one for Python and one for Perl.
842 </para>
843
844 <informalexample>
845 <emphasis>Tying it Together:</emphasis> Language bindings for manipulating and
846 aggregating trace data are of course not a new
847 idea. One of the first projects to do this was IBM's DProbes
848 dpcc compiler, an ANSI C compiler which targeted a low-level
849 assembly language running on an in-kernel interpreter on the
850 target system. This is exactly analogous to what Sun's DTrace
851 did, except that DTrace invented its own language for the purpose.
852 Systemtap, heavily inspired by DTrace, also created its own
853 one-off language, but rather than running the product on an
854 in-kernel interpreter, created an elaborate compiler-based
855 machinery to translate its language into kernel modules written
856 in C.
857 </informalexample>
858
859 <para>
860 Now that we have the trace data in perf.data, we can use
861 'perf script -g' to generate a skeleton script with handlers
862 for the read/write entry/exit events we recorded:
863 <literallayout class='monospaced'>
864 root@crownbay:~# perf script -g python
865 generated Python script: perf-script.py
866 </literallayout>
867 The skeleton script simply creates a python function for each
868 event type in the perf.data file. The body of each function simply
869 prints the event name along with its parameters. For example:
870 <literallayout class='monospaced'>
871 def net__netif_rx(event_name, context, common_cpu,
872 common_secs, common_nsecs, common_pid, common_comm,
873 skbaddr, len, name):
874 print_header(event_name, common_cpu, common_secs, common_nsecs,
875 common_pid, common_comm)
876
877 print "skbaddr=%u, len=%u, name=%s\n" % (skbaddr, len, name),
878 </literallayout>
879 We can run that script directly to print all of the events
880 contained in the perf.data file:
881 <literallayout class='monospaced'>
882 root@crownbay:~# perf script -s perf-script.py
883
884 in trace_begin
885 syscalls__sys_exit_read 0 11624.857082795 1262 perf nr=3, ret=0
886 sched__sched_wakeup 0 11624.857193498 1262 perf comm=migration/0, pid=6, prio=0, success=1, target_cpu=0
887 irq__softirq_raise 1 11624.858021635 1262 wget vec=TIMER
888 irq__softirq_entry 1 11624.858074075 1262 wget vec=TIMER
889 irq__softirq_exit 1 11624.858081389 1262 wget vec=TIMER
890 syscalls__sys_enter_read 1 11624.858166434 1262 wget nr=3, fd=3, buf=3213019456, count=512
891 syscalls__sys_exit_read 1 11624.858177924 1262 wget nr=3, ret=512
892 skb__kfree_skb 1 11624.858878188 1262 wget skbaddr=3945041280, location=3243922184, protocol=0
893 skb__kfree_skb 1 11624.858945608 1262 wget skbaddr=3945037824, location=3243922184, protocol=0
894 irq__softirq_raise 1 11624.859020942 1262 wget vec=TIMER
895 irq__softirq_entry 1 11624.859076935 1262 wget vec=TIMER
896 irq__softirq_exit 1 11624.859083469 1262 wget vec=TIMER
897 syscalls__sys_enter_read 1 11624.859167565 1262 wget nr=3, fd=3, buf=3077701632, count=1024
898 syscalls__sys_exit_read 1 11624.859192533 1262 wget nr=3, ret=471
899 syscalls__sys_enter_read 1 11624.859228072 1262 wget nr=3, fd=3, buf=3077701632, count=1024
900 syscalls__sys_exit_read 1 11624.859233707 1262 wget nr=3, ret=0
901 syscalls__sys_enter_read 1 11624.859573008 1262 wget nr=3, fd=3, buf=3213018496, count=512
902 syscalls__sys_exit_read 1 11624.859584818 1262 wget nr=3, ret=512
903 syscalls__sys_enter_read 1 11624.859864562 1262 wget nr=3, fd=3, buf=3077701632, count=1024
904 syscalls__sys_exit_read 1 11624.859888770 1262 wget nr=3, ret=1024
905 syscalls__sys_enter_read 1 11624.859935140 1262 wget nr=3, fd=3, buf=3077701632, count=1024
906 syscalls__sys_exit_read 1 11624.859944032 1262 wget nr=3, ret=1024
907 </literallayout>
908 That in itself isn't very useful; after all, we can accomplish
909 pretty much the same thing by simply running 'perf script'
910 without arguments in the same directory as the perf.data file.
911 </para>
912
913 <para>
914 We can however replace the print statements in the generated
915 function bodies with whatever we want, and thereby make it
916 infinitely more useful.
917 </para>
918
919 <para>
920 As a simple example, let's just replace the print statements in
921 the function bodies with a simple function that does nothing but
922 increment a per-event count. When the program is run against a
923 perf.data file, each time a particular event is encountered,
924 a tally is incremented for that event. For example:
925 <literallayout class='monospaced'>
926 def net__netif_rx(event_name, context, common_cpu,
927 common_secs, common_nsecs, common_pid, common_comm,
928 skbaddr, len, name):
929 inc_counts(event_name)
930 </literallayout>
931 Each event handler function in the generated code is modified
932 to do this. For convenience, we define a common function called
933 inc_counts() that each handler calls; inc_counts() simply tallies
934 a count for each event using the 'counts' hash, which is a
935 specialized hash function that does Perl-like autovivification, a
936 capability that's extremely useful for kinds of multi-level
937 aggregation commonly used in processing traces (see perf's
938 documentation on the Python language binding for details):
939 <literallayout class='monospaced'>
940 counts = autodict()
941
942 def inc_counts(event_name):
943 try:
944 counts[event_name] += 1
945 except TypeError:
946 counts[event_name] = 1
947 </literallayout>
948 Finally, at the end of the trace processing run, we want to
949 print the result of all the per-event tallies. For that, we
950 use the special 'trace_end()' function:
951 <literallayout class='monospaced'>
952 def trace_end():
953 for event_name, count in counts.iteritems():
954 print "%-40s %10s\n" % (event_name, count)
955 </literallayout>
956 The end result is a summary of all the events recorded in the
957 trace:
958 <literallayout class='monospaced'>
959 skb__skb_copy_datagram_iovec 13148
960 irq__softirq_entry 4796
961 irq__irq_handler_exit 3805
962 irq__softirq_exit 4795
963 syscalls__sys_enter_write 8990
964 net__net_dev_xmit 652
965 skb__kfree_skb 4047
966 sched__sched_wakeup 1155
967 irq__irq_handler_entry 3804
968 irq__softirq_raise 4799
969 net__net_dev_queue 652
970 syscalls__sys_enter_read 17599
971 net__netif_receive_skb 1743
972 syscalls__sys_exit_read 17598
973 net__netif_rx 2
974 napi__napi_poll 1877
975 syscalls__sys_exit_write 8990
976 </literallayout>
977 Note that this is pretty much exactly the same information we get
978 from 'perf stat', which goes a little way to support the idea
979 mentioned previously that given the right kind of trace data,
980 higher-level profiling-type summaries can be derived from it.
981 </para>
982
983 <para>
984 Documentation on using the
985 <ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.
986 </para>
987 </section>
988
989 <section id='system-wide-tracing-and-profiling'>
990 <title>System-Wide Tracing and Profiling</title>
991
992 <para>
993 The examples so far have focused on tracing a particular program or
994 workload - in other words, every profiling run has specified the
995 program to profile in the command-line e.g. 'perf record wget ...'.
996 </para>
997
998 <para>
999 It's also possible, and more interesting in many cases, to run a
1000 system-wide profile or trace while running the workload in a
1001 separate shell.
1002 </para>
1003
1004 <para>
1005 To do system-wide profiling or tracing, you typically use
1006 the -a flag to 'perf record'.
1007 </para>
1008
1009 <para>
1010 To demonstrate this, open up one window and start the profile
1011 using the -a flag (press Ctrl-C to stop tracing):
1012 <literallayout class='monospaced'>
1013 root@crownbay:~# perf record -g -a
1014 ^C[ perf record: Woken up 6 times to write data ]
1015 [ perf record: Captured and wrote 1.400 MB perf.data (~61172 samples) ]
1016 </literallayout>
1017 In another window, run the wget test:
1018 <literallayout class='monospaced'>
1019 root@crownbay:~# wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
1020 Connecting to downloads.yoctoproject.org (140.211.169.59:80)
1021 linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA
1022 </literallayout>
1023 Here we see entries not only for our wget load, but for other
1024 processes running on the system as well:
1025 </para>
1026
1027 <para>
1028 <imagedata fileref="figures/perf-systemwide.png" width="6in" depth="7in" align="center" scalefit="1" />
1029 </para>
1030
1031 <para>
1032 In the snapshot above, we can see callchains that originate in
1033 libc, and a callchain from Xorg that demonstrates that we're
1034 using a proprietary X driver in userspace (notice the presence
1035 of 'PVR' and some other unresolvable symbols in the expanded
1036 Xorg callchain).
1037 </para>
1038
1039 <para>
1040 Note also that we have both kernel and userspace entries in the
1041 above snapshot. We can also tell perf to focus on userspace but
1042 providing a modifier, in this case 'u', to the 'cycles' hardware
1043 counter when we record a profile:
1044 <literallayout class='monospaced'>
1045 root@crownbay:~# perf record -g -a -e cycles:u
1046 ^C[ perf record: Woken up 2 times to write data ]
1047 [ perf record: Captured and wrote 0.376 MB perf.data (~16443 samples) ]
1048 </literallayout>
1049 </para>
1050
1051 <para>
1052 <imagedata fileref="figures/perf-report-cycles-u.png" width="6in" depth="7in" align="center" scalefit="1" />
1053 </para>
1054
1055 <para>
1056 Notice in the screenshot above, we see only userspace entries ([.])
1057 </para>
1058
1059 <para>
1060 Finally, we can press 'enter' on a leaf node and select the 'Zoom
1061 into DSO' menu item to show only entries associated with a
1062 specific DSO. In the screenshot below, we've zoomed into the
1063 'libc' DSO which shows all the entries associated with the
1064 libc-xxx.so DSO.
1065 </para>
1066
1067 <para>
1068 <imagedata fileref="figures/perf-systemwide-libc.png" width="6in" depth="7in" align="center" scalefit="1" />
1069 </para>
1070
1071 <para>
1072 We can also use the system-wide -a switch to do system-wide
1073 tracing. Here we'll trace a couple of scheduler events:
1074 <literallayout class='monospaced'>
1075 root@crownbay:~# perf record -a -e sched:sched_switch -e sched:sched_wakeup
1076 ^C[ perf record: Woken up 38 times to write data ]
1077 [ perf record: Captured and wrote 9.780 MB perf.data (~427299 samples) ]
1078 </literallayout>
1079 We can look at the raw output using 'perf script' with no
1080 arguments:
1081 <literallayout class='monospaced'>
1082 root@crownbay:~# perf script
1083
1084 perf 1383 [001] 6171.460045: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1085 perf 1383 [001] 6171.460066: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120
1086 kworker/1:1 21 [001] 6171.460093: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120
1087 swapper 0 [000] 6171.468063: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000
1088 swapper 0 [000] 6171.468107: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120
1089 kworker/0:3 1209 [000] 6171.468143: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
1090 perf 1383 [001] 6171.470039: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1091 perf 1383 [001] 6171.470058: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120
1092 kworker/1:1 21 [001] 6171.470082: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120
1093 perf 1383 [001] 6171.480035: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1094 </literallayout>
1095 </para>
1096
1097 <section id='perf-filtering'>
1098 <title>Filtering</title>
1099
1100 <para>
1101 Notice that there are a lot of events that don't really have
1102 anything to do with what we're interested in, namely events
1103 that schedule 'perf' itself in and out or that wake perf up.
1104 We can get rid of those by using the '--filter' option -
1105 for each event we specify using -e, we can add a --filter
1106 after that to filter out trace events that contain fields
1107 with specific values:
1108 <literallayout class='monospaced'>
1109 root@crownbay:~# perf record -a -e sched:sched_switch --filter 'next_comm != perf &amp;&amp; prev_comm != perf' -e sched:sched_wakeup --filter 'comm != perf'
1110 ^C[ perf record: Woken up 38 times to write data ]
1111 [ perf record: Captured and wrote 9.688 MB perf.data (~423279 samples) ]
1112
1113
1114 root@crownbay:~# perf script
1115
1116 swapper 0 [000] 7932.162180: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120
1117 kworker/0:3 1209 [000] 7932.162236: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
1118 perf 1407 [001] 7932.170048: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1119 perf 1407 [001] 7932.180044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1120 perf 1407 [001] 7932.190038: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1121 perf 1407 [001] 7932.200044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1122 perf 1407 [001] 7932.210044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1123 perf 1407 [001] 7932.220044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1124 swapper 0 [001] 7932.230111: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
1125 swapper 0 [001] 7932.230146: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/1:1 next_pid=21 next_prio=120
1126 kworker/1:1 21 [001] 7932.230205: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
1127 swapper 0 [000] 7932.326109: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000
1128 swapper 0 [000] 7932.326171: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120
1129 kworker/0:3 1209 [000] 7932.326214: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
1130 </literallayout>
1131 In this case, we've filtered out all events that have 'perf'
1132 in their 'comm' or 'comm_prev' or 'comm_next' fields. Notice
1133 that there are still events recorded for perf, but notice
1134 that those events don't have values of 'perf' for the filtered
1135 fields. To completely filter out anything from perf will
1136 require a bit more work, but for the purpose of demonstrating
1137 how to use filters, it's close enough.
1138 </para>
1139
1140 <informalexample>
1141 <emphasis>Tying it Together:</emphasis> These are exactly the same set of event
1142 filters defined by the trace event subsystem. See the
1143 ftrace/tracecmd/kernelshark section for more discussion about
1144 these event filters.
1145 </informalexample>
1146
1147 <informalexample>
1148 <emphasis>Tying it Together:</emphasis> These event filters are implemented by a
1149 special-purpose pseudo-interpreter in the kernel and are an
1150 integral and indispensable part of the perf design as it
1151 relates to tracing. kernel-based event filters provide a
1152 mechanism to precisely throttle the event stream that appears
1153 in user space, where it makes sense to provide bindings to real
1154 programming languages for postprocessing the event stream.
1155 This architecture allows for the intelligent and flexible
1156 partitioning of processing between the kernel and user space.
1157 Contrast this with other tools such as SystemTap, which does
1158 all of its processing in the kernel and as such requires a
1159 special project-defined language in order to accommodate that
1160 design, or LTTng, where everything is sent to userspace and
1161 as such requires a super-efficient kernel-to-userspace
1162 transport mechanism in order to function properly. While
1163 perf certainly can benefit from for instance advances in
1164 the design of the transport, it doesn't fundamentally depend
1165 on them. Basically, if you find that your perf tracing
1166 application is causing buffer I/O overruns, it probably
1167 means that you aren't taking enough advantage of the
1168 kernel filtering engine.
1169 </informalexample>
1170 </section>
1171 </section>
1172
1173 <section id='using-dynamic-tracepoints'>
1174 <title>Using Dynamic Tracepoints</title>
1175
1176 <para>
1177 perf isn't restricted to the fixed set of static tracepoints
1178 listed by 'perf list'. Users can also add their own 'dynamic'
1179 tracepoints anywhere in the kernel. For instance, suppose we
1180 want to define our own tracepoint on do_fork(). We can do that
1181 using the 'perf probe' perf subcommand:
1182 <literallayout class='monospaced'>
1183 root@crownbay:~# perf probe do_fork
1184 Added new event:
1185 probe:do_fork (on do_fork)
1186
1187 You can now use it in all perf tools, such as:
1188
1189 perf record -e probe:do_fork -aR sleep 1
1190 </literallayout>
1191 Adding a new tracepoint via 'perf probe' results in an event
1192 with all the expected files and format in
1193 /sys/kernel/debug/tracing/events, just the same as for static
1194 tracepoints (as discussed in more detail in the trace events
1195 subsystem section:
1196 <literallayout class='monospaced'>
1197 root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# ls -al
1198 drwxr-xr-x 2 root root 0 Oct 28 11:42 .
1199 drwxr-xr-x 3 root root 0 Oct 28 11:42 ..
1200 -rw-r--r-- 1 root root 0 Oct 28 11:42 enable
1201 -rw-r--r-- 1 root root 0 Oct 28 11:42 filter
1202 -r--r--r-- 1 root root 0 Oct 28 11:42 format
1203 -r--r--r-- 1 root root 0 Oct 28 11:42 id
1204
1205 root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# cat format
1206 name: do_fork
1207 ID: 944
1208 format:
1209 field:unsigned short common_type; offset:0; size:2; signed:0;
1210 field:unsigned char common_flags; offset:2; size:1; signed:0;
1211 field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
1212 field:int common_pid; offset:4; size:4; signed:1;
1213 field:int common_padding; offset:8; size:4; signed:1;
1214
1215 field:unsigned long __probe_ip; offset:12; size:4; signed:0;
1216
1217 print fmt: "(%lx)", REC->__probe_ip
1218 </literallayout>
1219 We can list all dynamic tracepoints currently in existence:
1220 <literallayout class='monospaced'>
1221 root@crownbay:~# perf probe -l
1222 probe:do_fork (on do_fork)
1223 probe:schedule (on schedule)
1224 </literallayout>
1225 Let's record system-wide ('sleep 30' is a trick for recording
1226 system-wide but basically do nothing and then wake up after
1227 30 seconds):
1228 <literallayout class='monospaced'>
1229 root@crownbay:~# perf record -g -a -e probe:do_fork sleep 30
1230 [ perf record: Woken up 1 times to write data ]
1231 [ perf record: Captured and wrote 0.087 MB perf.data (~3812 samples) ]
1232 </literallayout>
1233 Using 'perf script' we can see each do_fork event that fired:
1234 <literallayout class='monospaced'>
1235 root@crownbay:~# perf script
1236
1237 # ========
1238 # captured on: Sun Oct 28 11:55:18 2012
1239 # hostname : crownbay
1240 # os release : 3.4.11-yocto-standard
1241 # perf version : 3.4.11
1242 # arch : i686
1243 # nrcpus online : 2
1244 # nrcpus avail : 2
1245 # cpudesc : Intel(R) Atom(TM) CPU E660 @ 1.30GHz
1246 # cpuid : GenuineIntel,6,38,1
1247 # total memory : 1017184 kB
1248 # cmdline : /usr/bin/perf record -g -a -e probe:do_fork sleep 30
1249 # event : name = probe:do_fork, type = 2, config = 0x3b0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern
1250 = 0, id = { 5, 6 }
1251 # HEADER_CPU_TOPOLOGY info available, use -I to display
1252 # ========
1253 #
1254 matchbox-deskto 1197 [001] 34211.378318: do_fork: (c1028460)
1255 matchbox-deskto 1295 [001] 34211.380388: do_fork: (c1028460)
1256 pcmanfm 1296 [000] 34211.632350: do_fork: (c1028460)
1257 pcmanfm 1296 [000] 34211.639917: do_fork: (c1028460)
1258 matchbox-deskto 1197 [001] 34217.541603: do_fork: (c1028460)
1259 matchbox-deskto 1299 [001] 34217.543584: do_fork: (c1028460)
1260 gthumb 1300 [001] 34217.697451: do_fork: (c1028460)
1261 gthumb 1300 [001] 34219.085734: do_fork: (c1028460)
1262 gthumb 1300 [000] 34219.121351: do_fork: (c1028460)
1263 gthumb 1300 [001] 34219.264551: do_fork: (c1028460)
1264 pcmanfm 1296 [000] 34219.590380: do_fork: (c1028460)
1265 matchbox-deskto 1197 [001] 34224.955965: do_fork: (c1028460)
1266 matchbox-deskto 1306 [001] 34224.957972: do_fork: (c1028460)
1267 matchbox-termin 1307 [000] 34225.038214: do_fork: (c1028460)
1268 matchbox-termin 1307 [001] 34225.044218: do_fork: (c1028460)
1269 matchbox-termin 1307 [000] 34225.046442: do_fork: (c1028460)
1270 matchbox-deskto 1197 [001] 34237.112138: do_fork: (c1028460)
1271 matchbox-deskto 1311 [001] 34237.114106: do_fork: (c1028460)
1272 gaku 1312 [000] 34237.202388: do_fork: (c1028460)
1273 </literallayout>
1274 And using 'perf report' on the same file, we can see the
1275 callgraphs from starting a few programs during those 30 seconds:
1276 </para>
1277
1278 <para>
1279 <imagedata fileref="figures/perf-probe-do_fork-profile.png" width="6in" depth="7in" align="center" scalefit="1" />
1280 </para>
1281
1282 <informalexample>
1283 <emphasis>Tying it Together:</emphasis> The trace events subsystem accommodate static
1284 and dynamic tracepoints in exactly the same way - there's no
1285 difference as far as the infrastructure is concerned. See the
1286 ftrace section for more details on the trace event subsystem.
1287 </informalexample>
1288
1289 <informalexample>
1290 <emphasis>Tying it Together:</emphasis> Dynamic tracepoints are implemented under the
1291 covers by kprobes and uprobes. kprobes and uprobes are also used
1292 by and in fact are the main focus of SystemTap.
1293 </informalexample>
1294 </section>
1295 </section>
1296
1297 <section id='perf-documentation'>
1298 <title>Documentation</title>
1299
1300 <para>
1301 Online versions of the man pages for the commands discussed in this
1302 section can be found here:
1303 <itemizedlist>
1304 <listitem><para>The <ulink url='http://linux.die.net/man/1/perf-stat'>'perf stat' manpage</ulink>.
1305 </para></listitem>
1306 <listitem><para>The <ulink url='http://linux.die.net/man/1/perf-record'>'perf record' manpage</ulink>.
1307 </para></listitem>
1308 <listitem><para>The <ulink url='http://linux.die.net/man/1/perf-report'>'perf report' manpage</ulink>.
1309 </para></listitem>
1310 <listitem><para>The <ulink url='http://linux.die.net/man/1/perf-probe'>'perf probe' manpage</ulink>.
1311 </para></listitem>
1312 <listitem><para>The <ulink url='http://linux.die.net/man/1/perf-script'>'perf script' manpage</ulink>.
1313 </para></listitem>
1314 <listitem><para>Documentation on using the
1315 <ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.
1316 </para></listitem>
1317 <listitem><para>The top-level
1318 <ulink url='http://linux.die.net/man/1/perf'>perf(1) manpage</ulink>.
1319 </para></listitem>
1320 </itemizedlist>
1321 </para>
1322
1323 <para>
1324 Normally, you should be able to invoke the man pages via perf
1325 itself e.g. 'perf help' or 'perf help record'.
1326 </para>
1327
1328 <para>
1329 However, by default Yocto doesn't install man pages, but perf
1330 invokes the man pages for most help functionality. This is a bug
1331 and is being addressed by a Yocto bug:
1332 <ulink url='https://bugzilla.yoctoproject.org/show_bug.cgi?id=3388'>Bug 3388 - perf: enable man pages for basic 'help' functionality</ulink>.
1333 </para>
1334
1335 <para>
1336 The man pages in text form, along with some other files, such as
1337 a set of examples, can be found in the 'perf' directory of the
1338 kernel tree:
1339 <literallayout class='monospaced'>
1340 tools/perf/Documentation
1341 </literallayout>
1342 There's also a nice perf tutorial on the perf wiki that goes
1343 into more detail than we do here in certain areas:
1344 <ulink url='https://perf.wiki.kernel.org/index.php/Tutorial'>Perf Tutorial</ulink>
1345 </para>
1346 </section>
1347</section>
1348
1349<section id='profile-manual-ftrace'>
1350 <title>ftrace</title>
1351
1352 <para>
1353 'ftrace' literally refers to the 'ftrace function tracer' but in
1354 reality this encompasses a number of related tracers along with
1355 the infrastructure that they all make use of.
1356 </para>
1357
1358 <section id='ftrace-setup'>
1359 <title>Setup</title>
1360
1361 <para>
1362 For this section, we'll assume you've already performed the basic
1363 setup outlined in the General Setup section.
1364 </para>
1365
1366 <para>
1367 ftrace, trace-cmd, and kernelshark run on the target system,
1368 and are ready to go out-of-the-box - no additional setup is
1369 necessary. For the rest of this section we assume you've ssh'ed
1370 to the host and will be running ftrace on the target. kernelshark
1371 is a GUI application and if you use the '-X' option to ssh you
1372 can have the kernelshark GUI run on the target but display
1373 remotely on the host if you want.
1374 </para>
1375 </section>
1376
1377 <section id='basic-ftrace-usage'>
1378 <title>Basic ftrace usage</title>
1379
1380 <para>
1381 'ftrace' essentially refers to everything included in
1382 the /tracing directory of the mounted debugfs filesystem
1383 (Yocto follows the standard convention and mounts it
1384 at /sys/kernel/debug). Here's a listing of all the files
1385 found in /sys/kernel/debug/tracing on a Yocto system:
1386 <literallayout class='monospaced'>
1387 root@sugarbay:/sys/kernel/debug/tracing# ls
1388 README kprobe_events trace
1389 available_events kprobe_profile trace_clock
1390 available_filter_functions options trace_marker
1391 available_tracers per_cpu trace_options
1392 buffer_size_kb printk_formats trace_pipe
1393 buffer_total_size_kb saved_cmdlines tracing_cpumask
1394 current_tracer set_event tracing_enabled
1395 dyn_ftrace_total_info set_ftrace_filter tracing_on
1396 enabled_functions set_ftrace_notrace tracing_thresh
1397 events set_ftrace_pid
1398 free_buffer set_graph_function
1399 </literallayout>
1400 The files listed above are used for various purposes -
1401 some relate directly to the tracers themselves, others are
1402 used to set tracing options, and yet others actually contain
1403 the tracing output when a tracer is in effect. Some of the
1404 functions can be guessed from their names, others need
1405 explanation; in any case, we'll cover some of the files we
1406 see here below but for an explanation of the others, please
1407 see the ftrace documentation.
1408 </para>
1409
1410 <para>
1411 We'll start by looking at some of the available built-in
1412 tracers.
1413 </para>
1414
1415 <para>
1416 cat'ing the 'available_tracers' file lists the set of
1417 available tracers:
1418 <literallayout class='monospaced'>
1419 root@sugarbay:/sys/kernel/debug/tracing# cat available_tracers
1420 blk function_graph function nop
1421 </literallayout>
1422 The 'current_tracer' file contains the tracer currently in
1423 effect:
1424 <literallayout class='monospaced'>
1425 root@sugarbay:/sys/kernel/debug/tracing# cat current_tracer
1426 nop
1427 </literallayout>
1428 The above listing of current_tracer shows that
1429 the 'nop' tracer is in effect, which is just another
1430 way of saying that there's actually no tracer
1431 currently in effect.
1432 </para>
1433
1434 <para>
1435 echo'ing one of the available_tracers into current_tracer
1436 makes the specified tracer the current tracer:
1437 <literallayout class='monospaced'>
1438 root@sugarbay:/sys/kernel/debug/tracing# echo function > current_tracer
1439 root@sugarbay:/sys/kernel/debug/tracing# cat current_tracer
1440 function
1441 </literallayout>
1442 The above sets the current tracer to be the
1443 'function tracer'. This tracer traces every function
1444 call in the kernel and makes it available as the
1445 contents of the 'trace' file. Reading the 'trace' file
1446 lists the currently buffered function calls that have been
1447 traced by the function tracer:
1448 <literallayout class='monospaced'>
1449 root@sugarbay:/sys/kernel/debug/tracing# cat trace | less
1450
1451 # tracer: function
1452 #
1453 # entries-in-buffer/entries-written: 310629/766471 #P:8
1454 #
1455 # _-----=&gt; irqs-off
1456 # / _----=&gt; need-resched
1457 # | / _---=&gt; hardirq/softirq
1458 # || / _--=&gt; preempt-depth
1459 # ||| / delay
1460 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
1461 # | | | |||| | |
1462 &lt;idle&gt;-0 [004] d..1 470.867169: ktime_get_real &lt;-intel_idle
1463 &lt;idle&gt;-0 [004] d..1 470.867170: getnstimeofday &lt;-ktime_get_real
1464 &lt;idle&gt;-0 [004] d..1 470.867171: ns_to_timeval &lt;-intel_idle
1465 &lt;idle&gt;-0 [004] d..1 470.867171: ns_to_timespec &lt;-ns_to_timeval
1466 &lt;idle&gt;-0 [004] d..1 470.867172: smp_apic_timer_interrupt &lt;-apic_timer_interrupt
1467 &lt;idle&gt;-0 [004] d..1 470.867172: native_apic_mem_write &lt;-smp_apic_timer_interrupt
1468 &lt;idle&gt;-0 [004] d..1 470.867172: irq_enter &lt;-smp_apic_timer_interrupt
1469 &lt;idle&gt;-0 [004] d..1 470.867172: rcu_irq_enter &lt;-irq_enter
1470 &lt;idle&gt;-0 [004] d..1 470.867173: rcu_idle_exit_common.isra.33 &lt;-rcu_irq_enter
1471 &lt;idle&gt;-0 [004] d..1 470.867173: local_bh_disable &lt;-irq_enter
1472 &lt;idle&gt;-0 [004] d..1 470.867173: add_preempt_count &lt;-local_bh_disable
1473 &lt;idle&gt;-0 [004] d.s1 470.867174: tick_check_idle &lt;-irq_enter
1474 &lt;idle&gt;-0 [004] d.s1 470.867174: tick_check_oneshot_broadcast &lt;-tick_check_idle
1475 &lt;idle&gt;-0 [004] d.s1 470.867174: ktime_get &lt;-tick_check_idle
1476 &lt;idle&gt;-0 [004] d.s1 470.867174: tick_nohz_stop_idle &lt;-tick_check_idle
1477 &lt;idle&gt;-0 [004] d.s1 470.867175: update_ts_time_stats &lt;-tick_nohz_stop_idle
1478 &lt;idle&gt;-0 [004] d.s1 470.867175: nr_iowait_cpu &lt;-update_ts_time_stats
1479 &lt;idle&gt;-0 [004] d.s1 470.867175: tick_do_update_jiffies64 &lt;-tick_check_idle
1480 &lt;idle&gt;-0 [004] d.s1 470.867175: _raw_spin_lock &lt;-tick_do_update_jiffies64
1481 &lt;idle&gt;-0 [004] d.s1 470.867176: add_preempt_count &lt;-_raw_spin_lock
1482 &lt;idle&gt;-0 [004] d.s2 470.867176: do_timer &lt;-tick_do_update_jiffies64
1483 &lt;idle&gt;-0 [004] d.s2 470.867176: _raw_spin_lock &lt;-do_timer
1484 &lt;idle&gt;-0 [004] d.s2 470.867176: add_preempt_count &lt;-_raw_spin_lock
1485 &lt;idle&gt;-0 [004] d.s3 470.867177: ntp_tick_length &lt;-do_timer
1486 &lt;idle&gt;-0 [004] d.s3 470.867177: _raw_spin_lock_irqsave &lt;-ntp_tick_length
1487 .
1488 .
1489 .
1490 </literallayout>
1491 Each line in the trace above shows what was happening in
1492 the kernel on a given cpu, to the level of detail of
1493 function calls. Each entry shows the function called,
1494 followed by its caller (after the arrow).
1495 </para>
1496
1497 <para>
1498 The function tracer gives you an extremely detailed idea
1499 of what the kernel was doing at the point in time the trace
1500 was taken, and is a great way to learn about how the kernel
1501 code works in a dynamic sense.
1502 </para>
1503
1504 <informalexample>
1505 <emphasis>Tying it Together:</emphasis> The ftrace function tracer is also
1506 available from within perf, as the ftrace:function tracepoint.
1507 </informalexample>
1508
1509 <para>
1510 It is a little more difficult to follow the call chains than
1511 it needs to be - luckily there's a variant of the function
1512 tracer that displays the callchains explicitly, called the
1513 'function_graph' tracer:
1514 <literallayout class='monospaced'>
1515 root@sugarbay:/sys/kernel/debug/tracing# echo function_graph &gt; current_tracer
1516 root@sugarbay:/sys/kernel/debug/tracing# cat trace | less
1517
1518 tracer: function_graph
1519
1520 CPU DURATION FUNCTION CALLS
1521 | | | | | | |
1522 7) 0.046 us | pick_next_task_fair();
1523 7) 0.043 us | pick_next_task_stop();
1524 7) 0.042 us | pick_next_task_rt();
1525 7) 0.032 us | pick_next_task_fair();
1526 7) 0.030 us | pick_next_task_idle();
1527 7) | _raw_spin_unlock_irq() {
1528 7) 0.033 us | sub_preempt_count();
1529 7) 0.258 us | }
1530 7) 0.032 us | sub_preempt_count();
1531 7) + 13.341 us | } /* __schedule */
1532 7) 0.095 us | } /* sub_preempt_count */
1533 7) | schedule() {
1534 7) | __schedule() {
1535 7) 0.060 us | add_preempt_count();
1536 7) 0.044 us | rcu_note_context_switch();
1537 7) | _raw_spin_lock_irq() {
1538 7) 0.033 us | add_preempt_count();
1539 7) 0.247 us | }
1540 7) | idle_balance() {
1541 7) | _raw_spin_unlock() {
1542 7) 0.031 us | sub_preempt_count();
1543 7) 0.246 us | }
1544 7) | update_shares() {
1545 7) 0.030 us | __rcu_read_lock();
1546 7) 0.029 us | __rcu_read_unlock();
1547 7) 0.484 us | }
1548 7) 0.030 us | __rcu_read_lock();
1549 7) | load_balance() {
1550 7) | find_busiest_group() {
1551 7) 0.031 us | idle_cpu();
1552 7) 0.029 us | idle_cpu();
1553 7) 0.035 us | idle_cpu();
1554 7) 0.906 us | }
1555 7) 1.141 us | }
1556 7) 0.022 us | msecs_to_jiffies();
1557 7) | load_balance() {
1558 7) | find_busiest_group() {
1559 7) 0.031 us | idle_cpu();
1560 .
1561 .
1562 .
1563 4) 0.062 us | msecs_to_jiffies();
1564 4) 0.062 us | __rcu_read_unlock();
1565 4) | _raw_spin_lock() {
1566 4) 0.073 us | add_preempt_count();
1567 4) 0.562 us | }
1568 4) + 17.452 us | }
1569 4) 0.108 us | put_prev_task_fair();
1570 4) 0.102 us | pick_next_task_fair();
1571 4) 0.084 us | pick_next_task_stop();
1572 4) 0.075 us | pick_next_task_rt();
1573 4) 0.062 us | pick_next_task_fair();
1574 4) 0.066 us | pick_next_task_idle();
1575 ------------------------------------------
1576 4) kworker-74 =&gt; &lt;idle&gt;-0
1577 ------------------------------------------
1578
1579 4) | finish_task_switch() {
1580 4) | _raw_spin_unlock_irq() {
1581 4) 0.100 us | sub_preempt_count();
1582 4) 0.582 us | }
1583 4) 1.105 us | }
1584 4) 0.088 us | sub_preempt_count();
1585 4) ! 100.066 us | }
1586 .
1587 .
1588 .
1589 3) | sys_ioctl() {
1590 3) 0.083 us | fget_light();
1591 3) | security_file_ioctl() {
1592 3) 0.066 us | cap_file_ioctl();
1593 3) 0.562 us | }
1594 3) | do_vfs_ioctl() {
1595 3) | drm_ioctl() {
1596 3) 0.075 us | drm_ut_debug_printk();
1597 3) | i915_gem_pwrite_ioctl() {
1598 3) | i915_mutex_lock_interruptible() {
1599 3) 0.070 us | mutex_lock_interruptible();
1600 3) 0.570 us | }
1601 3) | drm_gem_object_lookup() {
1602 3) | _raw_spin_lock() {
1603 3) 0.080 us | add_preempt_count();
1604 3) 0.620 us | }
1605 3) | _raw_spin_unlock() {
1606 3) 0.085 us | sub_preempt_count();
1607 3) 0.562 us | }
1608 3) 2.149 us | }
1609 3) 0.133 us | i915_gem_object_pin();
1610 3) | i915_gem_object_set_to_gtt_domain() {
1611 3) 0.065 us | i915_gem_object_flush_gpu_write_domain();
1612 3) 0.065 us | i915_gem_object_wait_rendering();
1613 3) 0.062 us | i915_gem_object_flush_cpu_write_domain();
1614 3) 1.612 us | }
1615 3) | i915_gem_object_put_fence() {
1616 3) 0.097 us | i915_gem_object_flush_fence.constprop.36();
1617 3) 0.645 us | }
1618 3) 0.070 us | add_preempt_count();
1619 3) 0.070 us | sub_preempt_count();
1620 3) 0.073 us | i915_gem_object_unpin();
1621 3) 0.068 us | mutex_unlock();
1622 3) 9.924 us | }
1623 3) + 11.236 us | }
1624 3) + 11.770 us | }
1625 3) + 13.784 us | }
1626 3) | sys_ioctl() {
1627 </literallayout>
1628 As you can see, the function_graph display is much easier to
1629 follow. Also note that in addition to the function calls and
1630 associated braces, other events such as scheduler events
1631 are displayed in context. In fact, you can freely include
1632 any tracepoint available in the trace events subsystem described
1633 in the next section by simply enabling those events, and they'll
1634 appear in context in the function graph display. Quite a
1635 powerful tool for understanding kernel dynamics.
1636 </para>
1637
1638 <para>
1639 Also notice that there are various annotations on the left
1640 hand side of the display. For example if the total time it
1641 took for a given function to execute is above a certain
1642 threshold, an exclamation point or plus sign appears on the
1643 left hand side. Please see the ftrace documentation for
1644 details on all these fields.
1645 </para>
1646 </section>
1647
1648 <section id='the-trace-events-subsystem'>
1649 <title>The 'trace events' Subsystem</title>
1650
1651 <para>
1652 One especially important directory contained within
1653 the /sys/kernel/debug/tracing directory is the 'events'
1654 subdirectory, which contains representations of every
1655 tracepoint in the system. Listing out the contents of
1656 the 'events' subdirectory, we see mainly another set of
1657 subdirectories:
1658 <literallayout class='monospaced'>
1659 root@sugarbay:/sys/kernel/debug/tracing# cd events
1660 root@sugarbay:/sys/kernel/debug/tracing/events# ls -al
1661 drwxr-xr-x 38 root root 0 Nov 14 23:19 .
1662 drwxr-xr-x 5 root root 0 Nov 14 23:19 ..
1663 drwxr-xr-x 19 root root 0 Nov 14 23:19 block
1664 drwxr-xr-x 32 root root 0 Nov 14 23:19 btrfs
1665 drwxr-xr-x 5 root root 0 Nov 14 23:19 drm
1666 -rw-r--r-- 1 root root 0 Nov 14 23:19 enable
1667 drwxr-xr-x 40 root root 0 Nov 14 23:19 ext3
1668 drwxr-xr-x 79 root root 0 Nov 14 23:19 ext4
1669 drwxr-xr-x 14 root root 0 Nov 14 23:19 ftrace
1670 drwxr-xr-x 8 root root 0 Nov 14 23:19 hda
1671 -r--r--r-- 1 root root 0 Nov 14 23:19 header_event
1672 -r--r--r-- 1 root root 0 Nov 14 23:19 header_page
1673 drwxr-xr-x 25 root root 0 Nov 14 23:19 i915
1674 drwxr-xr-x 7 root root 0 Nov 14 23:19 irq
1675 drwxr-xr-x 12 root root 0 Nov 14 23:19 jbd
1676 drwxr-xr-x 14 root root 0 Nov 14 23:19 jbd2
1677 drwxr-xr-x 14 root root 0 Nov 14 23:19 kmem
1678 drwxr-xr-x 7 root root 0 Nov 14 23:19 module
1679 drwxr-xr-x 3 root root 0 Nov 14 23:19 napi
1680 drwxr-xr-x 6 root root 0 Nov 14 23:19 net
1681 drwxr-xr-x 3 root root 0 Nov 14 23:19 oom
1682 drwxr-xr-x 12 root root 0 Nov 14 23:19 power
1683 drwxr-xr-x 3 root root 0 Nov 14 23:19 printk
1684 drwxr-xr-x 8 root root 0 Nov 14 23:19 random
1685 drwxr-xr-x 4 root root 0 Nov 14 23:19 raw_syscalls
1686 drwxr-xr-x 3 root root 0 Nov 14 23:19 rcu
1687 drwxr-xr-x 6 root root 0 Nov 14 23:19 rpm
1688 drwxr-xr-x 20 root root 0 Nov 14 23:19 sched
1689 drwxr-xr-x 7 root root 0 Nov 14 23:19 scsi
1690 drwxr-xr-x 4 root root 0 Nov 14 23:19 signal
1691 drwxr-xr-x 5 root root 0 Nov 14 23:19 skb
1692 drwxr-xr-x 4 root root 0 Nov 14 23:19 sock
1693 drwxr-xr-x 10 root root 0 Nov 14 23:19 sunrpc
1694 drwxr-xr-x 538 root root 0 Nov 14 23:19 syscalls
1695 drwxr-xr-x 4 root root 0 Nov 14 23:19 task
1696 drwxr-xr-x 14 root root 0 Nov 14 23:19 timer
1697 drwxr-xr-x 3 root root 0 Nov 14 23:19 udp
1698 drwxr-xr-x 21 root root 0 Nov 14 23:19 vmscan
1699 drwxr-xr-x 3 root root 0 Nov 14 23:19 vsyscall
1700 drwxr-xr-x 6 root root 0 Nov 14 23:19 workqueue
1701 drwxr-xr-x 26 root root 0 Nov 14 23:19 writeback
1702 </literallayout>
1703 Each one of these subdirectories corresponds to a
1704 'subsystem' and contains yet again more subdirectories,
1705 each one of those finally corresponding to a tracepoint.
1706 For example, here are the contents of the 'kmem' subsystem:
1707 <literallayout class='monospaced'>
1708 root@sugarbay:/sys/kernel/debug/tracing/events# cd kmem
1709 root@sugarbay:/sys/kernel/debug/tracing/events/kmem# ls -al
1710 drwxr-xr-x 14 root root 0 Nov 14 23:19 .
1711 drwxr-xr-x 38 root root 0 Nov 14 23:19 ..
1712 -rw-r--r-- 1 root root 0 Nov 14 23:19 enable
1713 -rw-r--r-- 1 root root 0 Nov 14 23:19 filter
1714 drwxr-xr-x 2 root root 0 Nov 14 23:19 kfree
1715 drwxr-xr-x 2 root root 0 Nov 14 23:19 kmalloc
1716 drwxr-xr-x 2 root root 0 Nov 14 23:19 kmalloc_node
1717 drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_alloc
1718 drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_alloc_node
1719 drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_free
1720 drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc
1721 drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc_extfrag
1722 drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc_zone_locked
1723 drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_free
1724 drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_free_batched
1725 drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_pcpu_drain
1726 </literallayout>
1727 Let's see what's inside the subdirectory for a specific
1728 tracepoint, in this case the one for kmalloc:
1729 <literallayout class='monospaced'>
1730 root@sugarbay:/sys/kernel/debug/tracing/events/kmem# cd kmalloc
1731 root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# ls -al
1732 drwxr-xr-x 2 root root 0 Nov 14 23:19 .
1733 drwxr-xr-x 14 root root 0 Nov 14 23:19 ..
1734 -rw-r--r-- 1 root root 0 Nov 14 23:19 enable
1735 -rw-r--r-- 1 root root 0 Nov 14 23:19 filter
1736 -r--r--r-- 1 root root 0 Nov 14 23:19 format
1737 -r--r--r-- 1 root root 0 Nov 14 23:19 id
1738 </literallayout>
1739 The 'format' file for the tracepoint describes the event
1740 in memory, which is used by the various tracing tools
1741 that now make use of these tracepoint to parse the event
1742 and make sense of it, along with a 'print fmt' field that
1743 allows tools like ftrace to display the event as text.
1744 Here's what the format of the kmalloc event looks like:
1745 <literallayout class='monospaced'>
1746 root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# cat format
1747 name: kmalloc
1748 ID: 313
1749 format:
1750 field:unsigned short common_type; offset:0; size:2; signed:0;
1751 field:unsigned char common_flags; offset:2; size:1; signed:0;
1752 field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
1753 field:int common_pid; offset:4; size:4; signed:1;
1754 field:int common_padding; offset:8; size:4; signed:1;
1755
1756 field:unsigned long call_site; offset:16; size:8; signed:0;
1757 field:const void * ptr; offset:24; size:8; signed:0;
1758 field:size_t bytes_req; offset:32; size:8; signed:0;
1759 field:size_t bytes_alloc; offset:40; size:8; signed:0;
1760 field:gfp_t gfp_flags; offset:48; size:4; signed:0;
1761
1762 print fmt: "call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s", REC->call_site, REC->ptr, REC->bytes_req, REC->bytes_alloc,
1763 (REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {(unsigned long)(((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((
1764 gfp_t)0x20000u) | (( gfp_t)0x02u) | (( gfp_t)0x08u)) | (( gfp_t)0x4000u) | (( gfp_t)0x10000u) | (( gfp_t)0x1000u) | (( gfp_t)0x200u) | ((
1765 gfp_t)0x400000u)), "GFP_TRANSHUGE"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x20000u) | ((
1766 gfp_t)0x02u) | (( gfp_t)0x08u)), "GFP_HIGHUSER_MOVABLE"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((
1767 gfp_t)0x20000u) | (( gfp_t)0x02u)), "GFP_HIGHUSER"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((
1768 gfp_t)0x20000u)), "GFP_USER"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x80000u)), GFP_TEMPORARY"},
1769 {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u)), "GFP_KERNEL"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u)),
1770 "GFP_NOFS"}, {(unsigned long)((( gfp_t)0x20u)), "GFP_ATOMIC"}, {(unsigned long)((( gfp_t)0x10u)), "GFP_NOIO"}, {(unsigned long)((
1771 gfp_t)0x20u), "GFP_HIGH"}, {(unsigned long)(( gfp_t)0x10u), "GFP_WAIT"}, {(unsigned long)(( gfp_t)0x40u), "GFP_IO"}, {(unsigned long)((
1772 gfp_t)0x100u), "GFP_COLD"}, {(unsigned long)(( gfp_t)0x200u), "GFP_NOWARN"}, {(unsigned long)(( gfp_t)0x400u), "GFP_REPEAT"}, {(unsigned
1773 long)(( gfp_t)0x800u), "GFP_NOFAIL"}, {(unsigned long)(( gfp_t)0x1000u), "GFP_NORETRY"}, {(unsigned long)(( gfp_t)0x4000u), "GFP_COMP"},
1774 {(unsigned long)(( gfp_t)0x8000u), "GFP_ZERO"}, {(unsigned long)(( gfp_t)0x10000u), "GFP_NOMEMALLOC"}, {(unsigned long)(( gfp_t)0x20000u),
1775 "GFP_HARDWALL"}, {(unsigned long)(( gfp_t)0x40000u), "GFP_THISNODE"}, {(unsigned long)(( gfp_t)0x80000u), "GFP_RECLAIMABLE"}, {(unsigned
1776 long)(( gfp_t)0x08u), "GFP_MOVABLE"}, {(unsigned long)(( gfp_t)0), "GFP_NOTRACK"}, {(unsigned long)(( gfp_t)0x400000u), "GFP_NO_KSWAPD"},
1777 {(unsigned long)(( gfp_t)0x800000u), "GFP_OTHER_NODE"} ) : "GFP_NOWAIT"
1778 </literallayout>
1779 The 'enable' file in the tracepoint directory is what allows
1780 the user (or tools such as trace-cmd) to actually turn the
1781 tracepoint on and off. When enabled, the corresponding
1782 tracepoint will start appearing in the ftrace 'trace'
1783 file described previously. For example, this turns on the
1784 kmalloc tracepoint:
1785 <literallayout class='monospaced'>
1786 root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# echo 1 > enable
1787 </literallayout>
1788 At the moment, we're not interested in the function tracer or
1789 some other tracer that might be in effect, so we first turn
1790 it off, but if we do that, we still need to turn tracing on in
1791 order to see the events in the output buffer:
1792 <literallayout class='monospaced'>
1793 root@sugarbay:/sys/kernel/debug/tracing# echo nop > current_tracer
1794 root@sugarbay:/sys/kernel/debug/tracing# echo 1 > tracing_on
1795 </literallayout>
1796 Now, if we look at the the 'trace' file, we see nothing
1797 but the kmalloc events we just turned on:
1798 <literallayout class='monospaced'>
1799 root@sugarbay:/sys/kernel/debug/tracing# cat trace | less
1800 # tracer: nop
1801 #
1802 # entries-in-buffer/entries-written: 1897/1897 #P:8
1803 #
1804 # _-----=&gt; irqs-off
1805 # / _----=&gt; need-resched
1806 # | / _---=&gt; hardirq/softirq
1807 # || / _--=&gt; preempt-depth
1808 # ||| / delay
1809 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
1810 # | | | |||| | |
1811 dropbear-1465 [000] ...1 18154.620753: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL
1812 &lt;idle&gt;-0 [000] ..s3 18154.621640: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1813 &lt;idle&gt;-0 [000] ..s3 18154.621656: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1814 matchbox-termin-1361 [001] ...1 18154.755472: kmalloc: call_site=ffffffff81614050 ptr=ffff88006d5f0e00 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_KERNEL|GFP_REPEAT
1815 Xorg-1264 [002] ...1 18154.755581: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY
1816 Xorg-1264 [002] ...1 18154.755583: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO
1817 Xorg-1264 [002] ...1 18154.755589: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO
1818 matchbox-termin-1361 [001] ...1 18155.354594: kmalloc: call_site=ffffffff81614050 ptr=ffff88006db35400 bytes_req=576 bytes_alloc=1024 gfp_flags=GFP_KERNEL|GFP_REPEAT
1819 Xorg-1264 [002] ...1 18155.354703: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY
1820 Xorg-1264 [002] ...1 18155.354705: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO
1821 Xorg-1264 [002] ...1 18155.354711: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO
1822 &lt;idle&gt;-0 [000] ..s3 18155.673319: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1823 dropbear-1465 [000] ...1 18155.673525: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL
1824 &lt;idle&gt;-0 [000] ..s3 18155.674821: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1825 &lt;idle&gt;-0 [000] ..s3 18155.793014: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1826 dropbear-1465 [000] ...1 18155.793219: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL
1827 &lt;idle&gt;-0 [000] ..s3 18155.794147: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1828 &lt;idle&gt;-0 [000] ..s3 18155.936705: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1829 dropbear-1465 [000] ...1 18155.936910: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL
1830 &lt;idle&gt;-0 [000] ..s3 18155.937869: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1831 matchbox-termin-1361 [001] ...1 18155.953667: kmalloc: call_site=ffffffff81614050 ptr=ffff88006d5f2000 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_KERNEL|GFP_REPEAT
1832 Xorg-1264 [002] ...1 18155.953775: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY
1833 Xorg-1264 [002] ...1 18155.953777: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO
1834 Xorg-1264 [002] ...1 18155.953783: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO
1835 &lt;idle&gt;-0 [000] ..s3 18156.176053: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1836 dropbear-1465 [000] ...1 18156.176257: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL
1837 &lt;idle&gt;-0 [000] ..s3 18156.177717: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1838 &lt;idle&gt;-0 [000] ..s3 18156.399229: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1839 dropbear-1465 [000] ...1 18156.399434: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_http://rostedt.homelinux.com/kernelshark/req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL
1840 &lt;idle&gt;-0 [000] ..s3 18156.400660: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC
1841 matchbox-termin-1361 [001] ...1 18156.552800: kmalloc: call_site=ffffffff81614050 ptr=ffff88006db34800 bytes_req=576 bytes_alloc=1024 gfp_flags=GFP_KERNEL|GFP_REPEAT
1842 </literallayout>
1843 To again disable the kmalloc event, we need to send 0 to the
1844 enable file:
1845 <literallayout class='monospaced'>
1846 root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# echo 0 > enable
1847 </literallayout>
1848 You can enable any number of events or complete subsystems
1849 (by using the 'enable' file in the subsystem directory) and
1850 get an arbitrarily fine-grained idea of what's going on in the
1851 system by enabling as many of the appropriate tracepoints
1852 as applicable.
1853 </para>
1854
1855 <para>
1856 A number of the tools described in this HOWTO do just that,
1857 including trace-cmd and kernelshark in the next section.
1858 </para>
1859
1860 <informalexample>
1861 <emphasis>Tying it Together:</emphasis> These tracepoints and their representation
1862 are used not only by ftrace, but by many of the other tools
1863 covered in this document and they form a central point of
1864 integration for the various tracers available in Linux.
1865 They form a central part of the instrumentation for the
1866 following tools: perf, lttng, ftrace, blktrace and SystemTap
1867 </informalexample>
1868
1869 <informalexample>
1870 <emphasis>Tying it Together:</emphasis> Eventually all the special-purpose tracers
1871 currently available in /sys/kernel/debug/tracing will be
1872 removed and replaced with equivalent tracers based on the
1873 'trace events' subsystem.
1874 </informalexample>
1875 </section>
1876
1877 <section id='trace-cmd-kernelshark'>
1878 <title>trace-cmd/kernelshark</title>
1879
1880 <para>
1881 trace-cmd is essentially an extensive command-line 'wrapper'
1882 interface that hides the details of all the individual files
1883 in /sys/kernel/debug/tracing, allowing users to specify
1884 specific particular events within the
1885 /sys/kernel/debug/tracing/events/ subdirectory and to collect
1886 traces and avoid having to deal with those details directly.
1887 </para>
1888
1889 <para>
1890 As yet another layer on top of that, kernelshark provides a GUI
1891 that allows users to start and stop traces and specify sets
1892 of events using an intuitive interface, and view the
1893 output as both trace events and as a per-CPU graphical
1894 display. It directly uses 'trace-cmd' as the plumbing
1895 that accomplishes all that underneath the covers (and
1896 actually displays the trace-cmd command it uses, as we'll see).
1897 </para>
1898
1899 <para>
1900 To start a trace using kernelshark, first start kernelshark:
1901 <literallayout class='monospaced'>
1902 root@sugarbay:~# kernelshark
1903 </literallayout>
1904 Then bring up the 'Capture' dialog by choosing from the
1905 kernelshark menu:
1906 <literallayout class='monospaced'>
1907 Capture | Record
1908 </literallayout>
1909 That will display the following dialog, which allows you to
1910 choose one or more events (or even one or more complete
1911 subsystems) to trace:
1912 </para>
1913
1914 <para>
1915 <imagedata fileref="figures/kernelshark-choose-events.png" width="6in" depth="6in" align="center" scalefit="1" />
1916 </para>
1917
1918 <para>
1919 Note that these are exactly the same sets of events described
1920 in the previous trace events subsystem section, and in fact
1921 is where trace-cmd gets them for kernelshark.
1922 </para>
1923
1924 <para>
1925 In the above screenshot, we've decided to explore the
1926 graphics subsystem a bit and so have chosen to trace all
1927 the tracepoints contained within the 'i915' and 'drm'
1928 subsystems.
1929 </para>
1930
1931 <para>
1932 After doing that, we can start and stop the trace using
1933 the 'Run' and 'Stop' button on the lower right corner of
1934 the dialog (the same button will turn into the 'Stop'
1935 button after the trace has started):
1936 </para>
1937
1938 <para>
1939 <imagedata fileref="figures/kernelshark-output-display.png" width="6in" depth="6in" align="center" scalefit="1" />
1940 </para>
1941
1942 <para>
1943 Notice that the right-hand pane shows the exact trace-cmd
1944 command-line that's used to run the trace, along with the
1945 results of the trace-cmd run.
1946 </para>
1947
1948 <para>
1949 Once the 'Stop' button is pressed, the graphical view magically
1950 fills up with a colorful per-cpu display of the trace data,
1951 along with the detailed event listing below that:
1952 </para>
1953
1954 <para>
1955 <imagedata fileref="figures/kernelshark-i915-display.png" width="6in" depth="7in" align="center" scalefit="1" />
1956 </para>
1957
1958 <para>
1959 Here's another example, this time a display resulting
1960 from tracing 'all events':
1961 </para>
1962
1963 <para>
1964 <imagedata fileref="figures/kernelshark-all.png" width="6in" depth="7in" align="center" scalefit="1" />
1965 </para>
1966
1967 <para>
1968 The tool is pretty self-explanatory, but for more detailed
1969 information on navigating through the data, see the
1970 <ulink url='http://rostedt.homelinux.com/kernelshark/'>kernelshark website</ulink>.
1971 </para>
1972 </section>
1973
1974 <section id='ftrace-documentation'>
1975 <title>Documentation</title>
1976
1977 <para>
1978 The documentation for ftrace can be found in the kernel
1979 Documentation directory:
1980 <literallayout class='monospaced'>
1981 Documentation/trace/ftrace.txt
1982 </literallayout>
1983 The documentation for the trace event subsystem can also
1984 be found in the kernel Documentation directory:
1985 <literallayout class='monospaced'>
1986 Documentation/trace/events.txt
1987 </literallayout>
1988 There is a nice series of articles on using
1989 ftrace and trace-cmd at LWN:
1990 <itemizedlist>
1991 <listitem><para><ulink url='http://lwn.net/Articles/365835/'>Debugging the kernel using Ftrace - part 1</ulink>
1992 </para></listitem>
1993 <listitem><para><ulink url='http://lwn.net/Articles/366796/'>Debugging the kernel using Ftrace - part 2</ulink>
1994 </para></listitem>
1995 <listitem><para><ulink url='http://lwn.net/Articles/370423/'>Secrets of the Ftrace function tracer</ulink>
1996 </para></listitem>
1997 <listitem><para><ulink url='https://lwn.net/Articles/410200/'>trace-cmd: A front-end for Ftrace</ulink>
1998 </para></listitem>
1999 </itemizedlist>
2000 </para>
2001
2002 <para>
2003 There's more detailed documentation kernelshark usage here:
2004 <ulink url='http://rostedt.homelinux.com/kernelshark/'>KernelShark</ulink>
2005 </para>
2006
2007 <para>
2008 An amusing yet useful README (a tracing mini-HOWTO) can be
2009 found in /sys/kernel/debug/tracing/README.
2010 </para>
2011 </section>
2012</section>
2013
2014<section id='profile-manual-systemtap'>
2015 <title>systemtap</title>
2016
2017 <para>
2018 SystemTap is a system-wide script-based tracing and profiling tool.
2019 </para>
2020
2021 <para>
2022 SystemTap scripts are C-like programs that are executed in the
2023 kernel to gather/print/aggregate data extracted from the context
2024 they end up being invoked under.
2025 </para>
2026
2027 <para>
2028 For example, this probe from the
2029 <ulink url='http://sourceware.org/systemtap/tutorial/'>SystemTap tutorial</ulink>
2030 simply prints a line every time any process on the system open()s
2031 a file. For each line, it prints the executable name of the
2032 program that opened the file, along with its PID, and the name
2033 of the file it opened (or tried to open), which it extracts
2034 from the open syscall's argstr.
2035 <literallayout class='monospaced'>
2036 probe syscall.open
2037 {
2038 printf ("%s(%d) open (%s)\n", execname(), pid(), argstr)
2039 }
2040
2041 probe timer.ms(4000) # after 4 seconds
2042 {
2043 exit ()
2044 }
2045 </literallayout>
2046 Normally, to execute this probe, you'd simply install
2047 systemtap on the system you want to probe, and directly run
2048 the probe on that system e.g. assuming the name of the file
2049 containing the above text is trace_open.stp:
2050 <literallayout class='monospaced'>
2051 # stap trace_open.stp
2052 </literallayout>
2053 What systemtap does under the covers to run this probe is 1)
2054 parse and convert the probe to an equivalent 'C' form, 2)
2055 compile the 'C' form into a kernel module, 3) insert the
2056 module into the kernel, which arms it, and 4) collect the data
2057 generated by the probe and display it to the user.
2058 </para>
2059
2060 <para>
2061 In order to accomplish steps 1 and 2, the 'stap' program needs
2062 access to the kernel build system that produced the kernel
2063 that the probed system is running. In the case of a typical
2064 embedded system (the 'target'), the kernel build system
2065 unfortunately isn't typically part of the image running on
2066 the target. It is normally available on the 'host' system
2067 that produced the target image however; in such cases,
2068 steps 1 and 2 are executed on the host system, and steps
2069 3 and 4 are executed on the target system, using only the
2070 systemtap 'runtime'.
2071 </para>
2072
2073 <para>
2074 The systemtap support in Yocto assumes that only steps
2075 3 and 4 are run on the target; it is possible to do
2076 everything on the target, but this section assumes only
2077 the typical embedded use-case.
2078 </para>
2079
2080 <para>
2081 So basically what you need to do in order to run a systemtap
2082 script on the target is to 1) on the host system, compile the
2083 probe into a kernel module that makes sense to the target, 2)
2084 copy the module onto the target system and 3) insert the
2085 module into the target kernel, which arms it, and 4) collect
2086 the data generated by the probe and display it to the user.
2087 </para>
2088
2089 <section id='systemtap-setup'>
2090 <title>Setup</title>
2091
2092 <para>
2093 Those are a lot of steps and a lot of details, but
2094 fortunately Yocto includes a script called 'crosstap'
2095 that will take care of those details, allowing you to
2096 simply execute a systemtap script on the remote target,
2097 with arguments if necessary.
2098 </para>
2099
2100 <para>
2101 In order to do this from a remote host, however, you
2102 need to have access to the build for the image you
2103 booted. The 'crosstap' script provides details on how
2104 to do this if you run the script on the host without having
2105 done a build:
2106 <note>
2107 SystemTap, which uses 'crosstap', assumes you can establish an
2108 ssh connection to the remote target.
2109 Please refer to the crosstap wiki page for details on verifying
2110 ssh connections at
2111 <ulink url='https://wiki.yoctoproject.org/wiki/Tracing_and_Profiling#systemtap'></ulink>.
2112 Also, the ability to ssh into the target system is not enabled
2113 by default in *-minimal images.
2114 </note>
2115 <literallayout class='monospaced'>
2116 $ crosstap root@192.168.1.88 trace_open.stp
2117
2118 Error: No target kernel build found.
2119 Did you forget to create a local build of your image?
2120
2121 'crosstap' requires a local sdk build of the target system
2122 (or a build that includes 'tools-profile') in order to build
2123 kernel modules that can probe the target system.
2124
2125 Practically speaking, that means you need to do the following:
2126 - If you're running a pre-built image, download the release
2127 and/or BSP tarballs used to build the image.
2128 - If you're working from git sources, just clone the metadata
2129 and BSP layers needed to build the image you'll be booting.
2130 - Make sure you're properly set up to build a new image (see
2131 the BSP README and/or the widely available basic documentation
2132 that discusses how to build images).
2133 - Build an -sdk version of the image e.g.:
2134 $ bitbake core-image-sato-sdk
2135 OR
2136 - Build a non-sdk image but include the profiling tools:
2137 [ edit local.conf and add 'tools-profile' to the end of
2138 the EXTRA_IMAGE_FEATURES variable ]
2139 $ bitbake core-image-sato
2140
2141 Once you've build the image on the host system, you're ready to
2142 boot it (or the equivalent pre-built image) and use 'crosstap'
2143 to probe it (you need to source the environment as usual first):
2144
2145 $ source oe-init-build-env
2146 $ cd ~/my/systemtap/scripts
2147 $ crosstap root@192.168.1.xxx myscript.stp
2148 </literallayout>
2149 So essentially what you need to do is build an SDK image or
2150 image with 'tools-profile' as detailed in the
2151 "<link linkend='profile-manual-general-setup'>General Setup</link>"
2152 section of this manual, and boot the resulting target image.
2153 </para>
2154
2155 <note>
2156 If you have a build directory containing multiple machines,
2157 you need to have the MACHINE you're connecting to selected
2158 in local.conf, and the kernel in that machine's build
2159 directory must match the kernel on the booted system exactly,
2160 or you'll get the above 'crosstap' message when you try to
2161 invoke a script.
2162 </note>
2163 </section>
2164
2165 <section id='running-a-script-on-a-target'>
2166 <title>Running a Script on a Target</title>
2167
2168 <para>
2169 Once you've done that, you should be able to run a systemtap
2170 script on the target:
2171 <literallayout class='monospaced'>
2172 $ cd /path/to/yocto
2173 $ source oe-init-build-env
2174
2175 ### Shell environment set up for builds. ###
2176
Patrick Williamsd8c66bc2016-06-20 12:57:21 -05002177 You can now run 'bitbake &lt;target&gt;'
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002178
2179 Common targets are:
Patrick Williamsd8c66bc2016-06-20 12:57:21 -05002180 core-image-minimal
2181 core-image-sato
2182 meta-toolchain
2183 meta-ide-support
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002184
Andrew Geissler82c905d2020-04-13 13:39:40 -05002185 You can also run generated qemu images with a command like 'runqemu qemux86-64'
Patrick Williamsd8c66bc2016-06-20 12:57:21 -05002186
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002187 </literallayout>
2188 Once you've done that, you can cd to whatever directory
2189 contains your scripts and use 'crosstap' to run the script:
2190 <literallayout class='monospaced'>
2191 $ cd /path/to/my/systemap/script
2192 $ crosstap root@192.168.7.2 trace_open.stp
2193 </literallayout>
2194 If you get an error connecting to the target e.g.:
2195 <literallayout class='monospaced'>
2196 $ crosstap root@192.168.7.2 trace_open.stp
2197 error establishing ssh connection on remote 'root@192.168.7.2'
2198 </literallayout>
2199 Try ssh'ing to the target and see what happens:
2200 <literallayout class='monospaced'>
2201 $ ssh root@192.168.7.2
2202 </literallayout>
2203 A lot of the time, connection problems are due specifying a
2204 wrong IP address or having a 'host key verification error'.
2205 </para>
2206
2207 <para>
2208 If everything worked as planned, you should see something
2209 like this (enter the password when prompted, or press enter
2210 if it's set up to use no password):
2211 <literallayout class='monospaced'>
2212 $ crosstap root@192.168.7.2 trace_open.stp
2213 root@192.168.7.2's password:
2214 matchbox-termin(1036) open ("/tmp/vte3FS2LW", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600)
2215 matchbox-termin(1036) open ("/tmp/vteJMC7LW", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600)
2216 </literallayout>
2217 </para>
2218 </section>
2219
2220 <section id='systemtap-documentation'>
2221 <title>Documentation</title>
2222
2223 <para>
2224 The SystemTap language reference can be found here:
2225 <ulink url='http://sourceware.org/systemtap/langref/'>SystemTap Language Reference</ulink>
2226 </para>
2227
2228 <para>
2229 Links to other SystemTap documents, tutorials, and examples can be
2230 found here:
2231 <ulink url='http://sourceware.org/systemtap/documentation.html'>SystemTap documentation page</ulink>
2232 </para>
2233 </section>
2234</section>
2235
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002236<section id='profile-manual-sysprof'>
2237 <title>Sysprof</title>
2238
2239 <para>
2240 Sysprof is a very easy to use system-wide profiler that consists
2241 of a single window with three panes and a few buttons which allow
2242 you to start, stop, and view the profile from one place.
2243 </para>
2244
2245 <section id='sysprof-setup'>
2246 <title>Setup</title>
2247
2248 <para>
2249 For this section, we'll assume you've already performed the
2250 basic setup outlined in the General Setup section.
2251 </para>
2252
2253 <para>
2254 Sysprof is a GUI-based application that runs on the target
2255 system. For the rest of this document we assume you've
2256 ssh'ed to the host and will be running Sysprof on the
2257 target (you can use the '-X' option to ssh and have the
2258 Sysprof GUI run on the target but display remotely on the
2259 host if you want).
2260 </para>
2261 </section>
2262
2263 <section id='sysprof-basic-usage'>
2264 <title>Basic Usage</title>
2265
2266 <para>
2267 To start profiling the system, you simply press the 'Start'
2268 button. To stop profiling and to start viewing the profile data
2269 in one easy step, press the 'Profile' button.
2270 </para>
2271
2272 <para>
2273 Once you've pressed the profile button, the three panes will
2274 fill up with profiling data:
2275 </para>
2276
2277 <para>
2278 <imagedata fileref="figures/sysprof-copy-to-user.png" width="6in" depth="4in" align="center" scalefit="1" />
2279 </para>
2280
2281 <para>
2282 The left pane shows a list of functions and processes.
2283 Selecting one of those expands that function in the right
2284 pane, showing all its callees. Note that this caller-oriented
2285 display is essentially the inverse of perf's default
2286 callee-oriented callchain display.
2287 </para>
2288
2289 <para>
2290 In the screenshot above, we're focusing on __copy_to_user_ll()
2291 and looking up the callchain we can see that one of the callers
2292 of __copy_to_user_ll is sys_read() and the complete callpath
2293 between them. Notice that this is essentially a portion of the
2294 same information we saw in the perf display shown in the perf
2295 section of this page.
2296 </para>
2297
2298 <para>
2299 <imagedata fileref="figures/sysprof-copy-from-user.png" width="6in" depth="4in" align="center" scalefit="1" />
2300 </para>
2301
2302 <para>
2303 Similarly, the above is a snapshot of the Sysprof display of a
2304 copy-from-user callchain.
2305 </para>
2306
2307 <para>
2308 Finally, looking at the third Sysprof pane in the lower left,
2309 we can see a list of all the callers of a particular function
2310 selected in the top left pane. In this case, the lower pane is
2311 showing all the callers of __mark_inode_dirty:
2312 </para>
2313
2314 <para>
2315 <imagedata fileref="figures/sysprof-callers.png" width="6in" depth="4in" align="center" scalefit="1" />
2316 </para>
2317
2318 <para>
2319 Double-clicking on one of those functions will in turn change the
2320 focus to the selected function, and so on.
2321 </para>
2322
2323 <informalexample>
2324 <emphasis>Tying it Together:</emphasis> If you like sysprof's 'caller-oriented'
2325 display, you may be able to approximate it in other tools as
2326 well. For example, 'perf report' has the -g (--call-graph)
2327 option that you can experiment with; one of the options is
2328 'caller' for an inverted caller-based callgraph display.
2329 </informalexample>
2330 </section>
2331
2332 <section id='sysprof-documentation'>
2333 <title>Documentation</title>
2334
2335 <para>
2336 There doesn't seem to be any documentation for Sysprof, but
2337 maybe that's because it's pretty self-explanatory.
2338 The Sysprof website, however, is here:
2339 <ulink url='http://sysprof.com/'>Sysprof, System-wide Performance Profiler for Linux</ulink>
2340 </para>
2341 </section>
2342</section>
2343
2344<section id='lttng-linux-trace-toolkit-next-generation'>
2345 <title>LTTng (Linux Trace Toolkit, next generation)</title>
2346
2347 <section id='lttng-setup'>
2348 <title>Setup</title>
2349
2350 <para>
2351 For this section, we'll assume you've already performed the
2352 basic setup outlined in the General Setup section.
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002353 LTTng is run on the target system by ssh'ing to it.
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002354 </para>
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002355 </section>
2356
2357 <section id='collecting-and-viewing-traces'>
2358 <title>Collecting and Viewing Traces</title>
2359
2360 <para>
2361 Once you've applied the above commits and built and booted your
2362 image (you need to build the core-image-sato-sdk image or use one of the
2363 other methods described in the General Setup section), you're
2364 ready to start tracing.
2365 </para>
2366
2367 <section id='collecting-and-viewing-a-trace-on-the-target-inside-a-shell'>
2368 <title>Collecting and viewing a trace on the target (inside a shell)</title>
2369
2370 <para>
2371 First, from the host, ssh to the target:
2372 <literallayout class='monospaced'>
2373 $ ssh -l root 192.168.1.47
2374 The authenticity of host '192.168.1.47 (192.168.1.47)' can't be established.
2375 RSA key fingerprint is 23:bd:c8:b1:a8:71:52:00:ee:00:4f:64:9e:10:b9:7e.
2376 Are you sure you want to continue connecting (yes/no)? yes
2377 Warning: Permanently added '192.168.1.47' (RSA) to the list of known hosts.
2378 root@192.168.1.47's password:
2379 </literallayout>
2380 Once on the target, use these steps to create a trace:
2381 <literallayout class='monospaced'>
2382 root@crownbay:~# lttng create
2383 Spawning a session daemon
2384 Session auto-20121015-232120 created.
2385 Traces will be written in /home/root/lttng-traces/auto-20121015-232120
2386 </literallayout>
2387 Enable the events you want to trace (in this case all
2388 kernel events):
2389 <literallayout class='monospaced'>
2390 root@crownbay:~# lttng enable-event --kernel --all
2391 All kernel events are enabled in channel channel0
2392 </literallayout>
2393 Start the trace:
2394 <literallayout class='monospaced'>
2395 root@crownbay:~# lttng start
2396 Tracing started for session auto-20121015-232120
2397 </literallayout>
2398 And then stop the trace after awhile or after running
2399 a particular workload that you want to trace:
2400 <literallayout class='monospaced'>
2401 root@crownbay:~# lttng stop
2402 Tracing stopped for session auto-20121015-232120
2403 </literallayout>
2404 You can now view the trace in text form on the target:
2405 <literallayout class='monospaced'>
2406 root@crownbay:~# lttng view
2407 [23:21:56.989270399] (+?.?????????) sys_geteuid: { 1 }, { }
2408 [23:21:56.989278081] (+0.000007682) exit_syscall: { 1 }, { ret = 0 }
2409 [23:21:56.989286043] (+0.000007962) sys_pipe: { 1 }, { fildes = 0xB77B9E8C }
2410 [23:21:56.989321802] (+0.000035759) exit_syscall: { 1 }, { ret = 0 }
2411 [23:21:56.989329345] (+0.000007543) sys_mmap_pgoff: { 1 }, { addr = 0x0, len = 10485760, prot = 3, flags = 131362, fd = 4294967295, pgoff = 0 }
2412 [23:21:56.989351694] (+0.000022349) exit_syscall: { 1 }, { ret = -1247805440 }
2413 [23:21:56.989432989] (+0.000081295) sys_clone: { 1 }, { clone_flags = 0x411, newsp = 0xB5EFFFE4, parent_tid = 0xFFFFFFFF, child_tid = 0x0 }
2414 [23:21:56.989477129] (+0.000044140) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 681660, vruntime = 43367983388 }
2415 [23:21:56.989486697] (+0.000009568) sched_migrate_task: { 1 }, { comm = "lttng-consumerd", tid = 1193, prio = 20, orig_cpu = 1, dest_cpu = 1 }
2416 [23:21:56.989508418] (+0.000021721) hrtimer_init: { 1 }, { hrtimer = 3970832076, clockid = 1, mode = 1 }
2417 [23:21:56.989770462] (+0.000262044) hrtimer_cancel: { 1 }, { hrtimer = 3993865440 }
2418 [23:21:56.989771580] (+0.000001118) hrtimer_cancel: { 0 }, { hrtimer = 3993812192 }
2419 [23:21:56.989776957] (+0.000005377) hrtimer_expire_entry: { 1 }, { hrtimer = 3993865440, now = 79815980007057, function = 3238465232 }
2420 [23:21:56.989778145] (+0.000001188) hrtimer_expire_entry: { 0 }, { hrtimer = 3993812192, now = 79815980008174, function = 3238465232 }
2421 [23:21:56.989791695] (+0.000013550) softirq_raise: { 1 }, { vec = 1 }
2422 [23:21:56.989795396] (+0.000003701) softirq_raise: { 0 }, { vec = 1 }
2423 [23:21:56.989800635] (+0.000005239) softirq_raise: { 0 }, { vec = 9 }
2424 [23:21:56.989807130] (+0.000006495) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 330710, vruntime = 43368314098 }
2425 [23:21:56.989809993] (+0.000002863) sched_stat_runtime: { 0 }, { comm = "lttng-sessiond", tid = 1181, runtime = 1015313, vruntime = 36976733240 }
2426 [23:21:56.989818514] (+0.000008521) hrtimer_expire_exit: { 0 }, { hrtimer = 3993812192 }
2427 [23:21:56.989819631] (+0.000001117) hrtimer_expire_exit: { 1 }, { hrtimer = 3993865440 }
2428 [23:21:56.989821866] (+0.000002235) hrtimer_start: { 0 }, { hrtimer = 3993812192, function = 3238465232, expires = 79815981000000, softexpires = 79815981000000 }
2429 [23:21:56.989822984] (+0.000001118) hrtimer_start: { 1 }, { hrtimer = 3993865440, function = 3238465232, expires = 79815981000000, softexpires = 79815981000000 }
2430 [23:21:56.989832762] (+0.000009778) softirq_entry: { 1 }, { vec = 1 }
2431 [23:21:56.989833879] (+0.000001117) softirq_entry: { 0 }, { vec = 1 }
2432 [23:21:56.989838069] (+0.000004190) timer_cancel: { 1 }, { timer = 3993871956 }
2433 [23:21:56.989839187] (+0.000001118) timer_cancel: { 0 }, { timer = 3993818708 }
2434 [23:21:56.989841492] (+0.000002305) timer_expire_entry: { 1 }, { timer = 3993871956, now = 79515980, function = 3238277552 }
2435 [23:21:56.989842819] (+0.000001327) timer_expire_entry: { 0 }, { timer = 3993818708, now = 79515980, function = 3238277552 }
2436 [23:21:56.989854831] (+0.000012012) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 49237, vruntime = 43368363335 }
2437 [23:21:56.989855949] (+0.000001118) sched_stat_runtime: { 0 }, { comm = "lttng-sessiond", tid = 1181, runtime = 45121, vruntime = 36976778361 }
2438 [23:21:56.989861257] (+0.000005308) sched_stat_sleep: { 1 }, { comm = "kworker/1:1", tid = 21, delay = 9451318 }
2439 [23:21:56.989862374] (+0.000001117) sched_stat_sleep: { 0 }, { comm = "kworker/0:0", tid = 4, delay = 9958820 }
2440 [23:21:56.989868241] (+0.000005867) sched_wakeup: { 0 }, { comm = "kworker/0:0", tid = 4, prio = 120, success = 1, target_cpu = 0 }
2441 [23:21:56.989869358] (+0.000001117) sched_wakeup: { 1 }, { comm = "kworker/1:1", tid = 21, prio = 120, success = 1, target_cpu = 1 }
2442 [23:21:56.989877460] (+0.000008102) timer_expire_exit: { 1 }, { timer = 3993871956 }
2443 [23:21:56.989878577] (+0.000001117) timer_expire_exit: { 0 }, { timer = 3993818708 }
2444 .
2445 .
2446 .
2447 </literallayout>
2448 You can now safely destroy the trace session (note that
2449 this doesn't delete the trace - it's still there
2450 in ~/lttng-traces):
2451 <literallayout class='monospaced'>
2452 root@crownbay:~# lttng destroy
2453 Session auto-20121015-232120 destroyed at /home/root
2454 </literallayout>
2455 Note that the trace is saved in a directory of the same
2456 name as returned by 'lttng create', under the ~/lttng-traces
2457 directory (note that you can change this by supplying your
2458 own name to 'lttng create'):
2459 <literallayout class='monospaced'>
2460 root@crownbay:~# ls -al ~/lttng-traces
2461 drwxrwx--- 3 root root 1024 Oct 15 23:21 .
2462 drwxr-xr-x 5 root root 1024 Oct 15 23:57 ..
2463 drwxrwx--- 3 root root 1024 Oct 15 23:21 auto-20121015-232120
2464 </literallayout>
2465 </para>
2466 </section>
2467
2468 <section id='collecting-and-viewing-a-userspace-trace-on-the-target-inside-a-shell'>
2469 <title>Collecting and viewing a userspace trace on the target (inside a shell)</title>
2470
2471 <para>
2472 For LTTng userspace tracing, you need to have a properly
2473 instrumented userspace program. For this example, we'll use
2474 the 'hello' test program generated by the lttng-ust build.
2475 </para>
2476
2477 <para>
2478 The 'hello' test program isn't installed on the rootfs by
2479 the lttng-ust build, so we need to copy it over manually.
2480 First cd into the build directory that contains the hello
2481 executable:
2482 <literallayout class='monospaced'>
2483 $ cd build/tmp/work/core2_32-poky-linux/lttng-ust/2.0.5-r0/git/tests/hello/.libs
2484 </literallayout>
2485 Copy that over to the target machine:
2486 <literallayout class='monospaced'>
2487 $ scp hello root@192.168.1.20:
2488 </literallayout>
2489 You now have the instrumented lttng 'hello world' test
2490 program on the target, ready to test.
2491 </para>
2492
2493 <para>
2494 First, from the host, ssh to the target:
2495 <literallayout class='monospaced'>
2496 $ ssh -l root 192.168.1.47
2497 The authenticity of host '192.168.1.47 (192.168.1.47)' can't be established.
2498 RSA key fingerprint is 23:bd:c8:b1:a8:71:52:00:ee:00:4f:64:9e:10:b9:7e.
2499 Are you sure you want to continue connecting (yes/no)? yes
2500 Warning: Permanently added '192.168.1.47' (RSA) to the list of known hosts.
2501 root@192.168.1.47's password:
2502 </literallayout>
2503 Once on the target, use these steps to create a trace:
2504 <literallayout class='monospaced'>
2505 root@crownbay:~# lttng create
2506 Session auto-20190303-021943 created.
2507 Traces will be written in /home/root/lttng-traces/auto-20190303-021943
2508 </literallayout>
2509 Enable the events you want to trace (in this case all
2510 userspace events):
2511 <literallayout class='monospaced'>
2512 root@crownbay:~# lttng enable-event --userspace --all
2513 All UST events are enabled in channel channel0
2514 </literallayout>
2515 Start the trace:
2516 <literallayout class='monospaced'>
2517 root@crownbay:~# lttng start
2518 Tracing started for session auto-20190303-021943
2519 </literallayout>
2520 Run the instrumented hello world program:
2521 <literallayout class='monospaced'>
2522 root@crownbay:~# ./hello
2523 Hello, World!
2524 Tracing... done.
2525 </literallayout>
2526 And then stop the trace after awhile or after running a
2527 particular workload that you want to trace:
2528 <literallayout class='monospaced'>
2529 root@crownbay:~# lttng stop
2530 Tracing stopped for session auto-20190303-021943
2531 </literallayout>
2532 You can now view the trace in text form on the target:
2533 <literallayout class='monospaced'>
2534 root@crownbay:~# lttng view
2535 [02:31:14.906146544] (+?.?????????) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 0, intfield2 = 0x0, longfield = 0, netintfield = 0, netintfieldhex = 0x0, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }
2536 [02:31:14.906170360] (+0.000023816) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 1, intfield2 = 0x1, longfield = 1, netintfield = 1, netintfieldhex = 0x1, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }
2537 [02:31:14.906183140] (+0.000012780) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 2, intfield2 = 0x2, longfield = 2, netintfield = 2, netintfieldhex = 0x2, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }
2538 [02:31:14.906194385] (+0.000011245) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 3, intfield2 = 0x3, longfield = 3, netintfield = 3, netintfieldhex = 0x3, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }
2539 .
2540 .
2541 .
2542 </literallayout>
2543 You can now safely destroy the trace session (note that
2544 this doesn't delete the trace - it's still
2545 there in ~/lttng-traces):
2546 <literallayout class='monospaced'>
2547 root@crownbay:~# lttng destroy
2548 Session auto-20190303-021943 destroyed at /home/root
2549 </literallayout>
2550 </para>
2551 </section>
2552
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002553 </section>
2554
2555 <section id='lltng-documentation'>
2556 <title>Documentation</title>
2557
2558 <para>
2559 You can find the primary LTTng Documentation on the
2560 <ulink url='https://lttng.org/docs/'>LTTng Documentation</ulink>
2561 site.
2562 The documentation on this site is appropriate for intermediate to
2563 advanced software developers who are working in a Linux environment
2564 and are interested in efficient software tracing.
2565 </para>
2566
2567 <para>
2568 For information on LTTng in general, visit the
2569 <ulink url='http://lttng.org/lttng2.0'>LTTng Project</ulink>
2570 site.
2571 You can find a "Getting Started" link on this site that takes
2572 you to an LTTng Quick Start.
2573 </para>
Patrick Williamsc124f4f2015-09-15 14:41:29 -05002574 </section>
2575</section>
2576
2577<section id='profile-manual-blktrace'>
2578 <title>blktrace</title>
2579
2580 <para>
2581 blktrace is a tool for tracing and reporting low-level disk I/O.
2582 blktrace provides the tracing half of the equation; its output can
2583 be piped into the blkparse program, which renders the data in a
2584 human-readable form and does some basic analysis:
2585 </para>
2586
2587 <section id='blktrace-setup'>
2588 <title>Setup</title>
2589
2590 <para>
2591 For this section, we'll assume you've already performed the
2592 basic setup outlined in the
2593 "<link linkend='profile-manual-general-setup'>General Setup</link>"
2594 section.
2595 </para>
2596
2597 <para>
2598 blktrace is an application that runs on the target system.
2599 You can run the entire blktrace and blkparse pipeline on the
2600 target, or you can run blktrace in 'listen' mode on the target
2601 and have blktrace and blkparse collect and analyze the data on
2602 the host (see the
2603 "<link linkend='using-blktrace-remotely'>Using blktrace Remotely</link>"
2604 section below).
2605 For the rest of this section we assume you've ssh'ed to the
2606 host and will be running blkrace on the target.
2607 </para>
2608 </section>
2609
2610 <section id='blktrace-basic-usage'>
2611 <title>Basic Usage</title>
2612
2613 <para>
2614 To record a trace, simply run the 'blktrace' command, giving it
2615 the name of the block device you want to trace activity on:
2616 <literallayout class='monospaced'>
2617 root@crownbay:~# blktrace /dev/sdc
2618 </literallayout>
2619 In another shell, execute a workload you want to trace.
2620 <literallayout class='monospaced'>
2621 root@crownbay:/media/sdc# rm linux-2.6.19.2.tar.bz2; wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>; sync
2622 Connecting to downloads.yoctoproject.org (140.211.169.59:80)
2623 linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA
2624 </literallayout>
2625 Press Ctrl-C in the blktrace shell to stop the trace. It will
2626 display how many events were logged, along with the per-cpu file
2627 sizes (blktrace records traces in per-cpu kernel buffers and
2628 simply dumps them to userspace for blkparse to merge and sort
2629 later).
2630 <literallayout class='monospaced'>
2631 ^C=== sdc ===
2632 CPU 0: 7082 events, 332 KiB data
2633 CPU 1: 1578 events, 74 KiB data
2634 Total: 8660 events (dropped 0), 406 KiB data
2635 </literallayout>
2636 If you examine the files saved to disk, you see multiple files,
2637 one per CPU and with the device name as the first part of the
2638 filename:
2639 <literallayout class='monospaced'>
2640 root@crownbay:~# ls -al
2641 drwxr-xr-x 6 root root 1024 Oct 27 22:39 .
2642 drwxr-sr-x 4 root root 1024 Oct 26 18:24 ..
2643 -rw-r--r-- 1 root root 339938 Oct 27 22:40 sdc.blktrace.0
2644 -rw-r--r-- 1 root root 75753 Oct 27 22:40 sdc.blktrace.1
2645 </literallayout>
2646 To view the trace events, simply invoke 'blkparse' in the
2647 directory containing the trace files, giving it the device name
2648 that forms the first part of the filenames:
2649 <literallayout class='monospaced'>
2650 root@crownbay:~# blkparse sdc
2651
2652 8,32 1 1 0.000000000 1225 Q WS 3417048 + 8 [jbd2/sdc-8]
2653 8,32 1 2 0.000025213 1225 G WS 3417048 + 8 [jbd2/sdc-8]
2654 8,32 1 3 0.000033384 1225 P N [jbd2/sdc-8]
2655 8,32 1 4 0.000043301 1225 I WS 3417048 + 8 [jbd2/sdc-8]
2656 8,32 1 0 0.000057270 0 m N cfq1225 insert_request
2657 8,32 1 0 0.000064813 0 m N cfq1225 add_to_rr
2658 8,32 1 5 0.000076336 1225 U N [jbd2/sdc-8] 1
2659 8,32 1 0 0.000088559 0 m N cfq workload slice:150
2660 8,32 1 0 0.000097359 0 m N cfq1225 set_active wl_prio:0 wl_type:1
2661 8,32 1 0 0.000104063 0 m N cfq1225 Not idling. st->count:1
2662 8,32 1 0 0.000112584 0 m N cfq1225 fifo= (null)
2663 8,32 1 0 0.000118730 0 m N cfq1225 dispatch_insert
2664 8,32 1 0 0.000127390 0 m N cfq1225 dispatched a request
2665 8,32 1 0 0.000133536 0 m N cfq1225 activate rq, drv=1
2666 8,32 1 6 0.000136889 1225 D WS 3417048 + 8 [jbd2/sdc-8]
2667 8,32 1 7 0.000360381 1225 Q WS 3417056 + 8 [jbd2/sdc-8]
2668 8,32 1 8 0.000377422 1225 G WS 3417056 + 8 [jbd2/sdc-8]
2669 8,32 1 9 0.000388876 1225 P N [jbd2/sdc-8]
2670 8,32 1 10 0.000397886 1225 Q WS 3417064 + 8 [jbd2/sdc-8]
2671 8,32 1 11 0.000404800 1225 M WS 3417064 + 8 [jbd2/sdc-8]
2672 8,32 1 12 0.000412343 1225 Q WS 3417072 + 8 [jbd2/sdc-8]
2673 8,32 1 13 0.000416533 1225 M WS 3417072 + 8 [jbd2/sdc-8]
2674 8,32 1 14 0.000422121 1225 Q WS 3417080 + 8 [jbd2/sdc-8]
2675 8,32 1 15 0.000425194 1225 M WS 3417080 + 8 [jbd2/sdc-8]
2676 8,32 1 16 0.000431968 1225 Q WS 3417088 + 8 [jbd2/sdc-8]
2677 8,32 1 17 0.000435251 1225 M WS 3417088 + 8 [jbd2/sdc-8]
2678 8,32 1 18 0.000440279 1225 Q WS 3417096 + 8 [jbd2/sdc-8]
2679 8,32 1 19 0.000443911 1225 M WS 3417096 + 8 [jbd2/sdc-8]
2680 8,32 1 20 0.000450336 1225 Q WS 3417104 + 8 [jbd2/sdc-8]
2681 8,32 1 21 0.000454038 1225 M WS 3417104 + 8 [jbd2/sdc-8]
2682 8,32 1 22 0.000462070 1225 Q WS 3417112 + 8 [jbd2/sdc-8]
2683 8,32 1 23 0.000465422 1225 M WS 3417112 + 8 [jbd2/sdc-8]
2684 8,32 1 24 0.000474222 1225 I WS 3417056 + 64 [jbd2/sdc-8]
2685 8,32 1 0 0.000483022 0 m N cfq1225 insert_request
2686 8,32 1 25 0.000489727 1225 U N [jbd2/sdc-8] 1
2687 8,32 1 0 0.000498457 0 m N cfq1225 Not idling. st->count:1
2688 8,32 1 0 0.000503765 0 m N cfq1225 dispatch_insert
2689 8,32 1 0 0.000512914 0 m N cfq1225 dispatched a request
2690 8,32 1 0 0.000518851 0 m N cfq1225 activate rq, drv=2
2691 .
2692 .
2693 .
2694 8,32 0 0 58.515006138 0 m N cfq3551 complete rqnoidle 1
2695 8,32 0 2024 58.516603269 3 C WS 3156992 + 16 [0]
2696 8,32 0 0 58.516626736 0 m N cfq3551 complete rqnoidle 1
2697 8,32 0 0 58.516634558 0 m N cfq3551 arm_idle: 8 group_idle: 0
2698 8,32 0 0 58.516636933 0 m N cfq schedule dispatch
2699 8,32 1 0 58.516971613 0 m N cfq3551 slice expired t=0
2700 8,32 1 0 58.516982089 0 m N cfq3551 sl_used=13 disp=6 charge=13 iops=0 sect=80
2701 8,32 1 0 58.516985511 0 m N cfq3551 del_from_rr
2702 8,32 1 0 58.516990819 0 m N cfq3551 put_queue
2703
2704 CPU0 (sdc):
2705 Reads Queued: 0, 0KiB Writes Queued: 331, 26,284KiB
2706 Read Dispatches: 0, 0KiB Write Dispatches: 485, 40,484KiB
2707 Reads Requeued: 0 Writes Requeued: 0
2708 Reads Completed: 0, 0KiB Writes Completed: 511, 41,000KiB
2709 Read Merges: 0, 0KiB Write Merges: 13, 160KiB
2710 Read depth: 0 Write depth: 2
2711 IO unplugs: 23 Timer unplugs: 0
2712 CPU1 (sdc):
2713 Reads Queued: 0, 0KiB Writes Queued: 249, 15,800KiB
2714 Read Dispatches: 0, 0KiB Write Dispatches: 42, 1,600KiB
2715 Reads Requeued: 0 Writes Requeued: 0
2716 Reads Completed: 0, 0KiB Writes Completed: 16, 1,084KiB
2717 Read Merges: 0, 0KiB Write Merges: 40, 276KiB
2718 Read depth: 0 Write depth: 2
2719 IO unplugs: 30 Timer unplugs: 1
2720
2721 Total (sdc):
2722 Reads Queued: 0, 0KiB Writes Queued: 580, 42,084KiB
2723 Read Dispatches: 0, 0KiB Write Dispatches: 527, 42,084KiB
2724 Reads Requeued: 0 Writes Requeued: 0
2725 Reads Completed: 0, 0KiB Writes Completed: 527, 42,084KiB
2726 Read Merges: 0, 0KiB Write Merges: 53, 436KiB
2727 IO unplugs: 53 Timer unplugs: 1
2728
2729 Throughput (R/W): 0KiB/s / 719KiB/s
2730 Events (sdc): 6,592 entries
2731 Skips: 0 forward (0 - 0.0%)
2732 Input file sdc.blktrace.0 added
2733 Input file sdc.blktrace.1 added
2734 </literallayout>
2735 The report shows each event that was found in the blktrace data,
2736 along with a summary of the overall block I/O traffic during
2737 the run. You can look at the
2738 <ulink url='http://linux.die.net/man/1/blkparse'>blkparse</ulink>
2739 manpage to learn the
2740 meaning of each field displayed in the trace listing.
2741 </para>
2742
2743 <section id='blktrace-live-mode'>
2744 <title>Live Mode</title>
2745
2746 <para>
2747 blktrace and blkparse are designed from the ground up to
2748 be able to operate together in a 'pipe mode' where the
2749 stdout of blktrace can be fed directly into the stdin of
2750 blkparse:
2751 <literallayout class='monospaced'>
2752 root@crownbay:~# blktrace /dev/sdc -o - | blkparse -i -
2753 </literallayout>
2754 This enables long-lived tracing sessions to run without
2755 writing anything to disk, and allows the user to look for
2756 certain conditions in the trace data in 'real-time' by
2757 viewing the trace output as it scrolls by on the screen or
2758 by passing it along to yet another program in the pipeline
2759 such as grep which can be used to identify and capture
2760 conditions of interest.
2761 </para>
2762
2763 <para>
2764 There's actually another blktrace command that implements
2765 the above pipeline as a single command, so the user doesn't
2766 have to bother typing in the above command sequence:
2767 <literallayout class='monospaced'>
2768 root@crownbay:~# btrace /dev/sdc
2769 </literallayout>
2770 </para>
2771 </section>
2772
2773 <section id='using-blktrace-remotely'>
2774 <title>Using blktrace Remotely</title>
2775
2776 <para>
2777 Because blktrace traces block I/O and at the same time
2778 normally writes its trace data to a block device, and
2779 in general because it's not really a great idea to make
2780 the device being traced the same as the device the tracer
2781 writes to, blktrace provides a way to trace without
2782 perturbing the traced device at all by providing native
2783 support for sending all trace data over the network.
2784 </para>
2785
2786 <para>
2787 To have blktrace operate in this mode, start blktrace on
2788 the target system being traced with the -l option, along with
2789 the device to trace:
2790 <literallayout class='monospaced'>
2791 root@crownbay:~# blktrace -l /dev/sdc
2792 server: waiting for connections...
2793 </literallayout>
2794 On the host system, use the -h option to connect to the
2795 target system, also passing it the device to trace:
2796 <literallayout class='monospaced'>
2797 $ blktrace -d /dev/sdc -h 192.168.1.43
2798 blktrace: connecting to 192.168.1.43
2799 blktrace: connected!
2800 </literallayout>
2801 On the target system, you should see this:
2802 <literallayout class='monospaced'>
2803 server: connection from 192.168.1.43
2804 </literallayout>
2805 In another shell, execute a workload you want to trace.
2806 <literallayout class='monospaced'>
2807 root@crownbay:/media/sdc# rm linux-2.6.19.2.tar.bz2; wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>; sync
2808 Connecting to downloads.yoctoproject.org (140.211.169.59:80)
2809 linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA
2810 </literallayout>
2811 When it's done, do a Ctrl-C on the host system to
2812 stop the trace:
2813 <literallayout class='monospaced'>
2814 ^C=== sdc ===
2815 CPU 0: 7691 events, 361 KiB data
2816 CPU 1: 4109 events, 193 KiB data
2817 Total: 11800 events (dropped 0), 554 KiB data
2818 </literallayout>
2819 On the target system, you should also see a trace
2820 summary for the trace just ended:
2821 <literallayout class='monospaced'>
2822 server: end of run for 192.168.1.43:sdc
2823 === sdc ===
2824 CPU 0: 7691 events, 361 KiB data
2825 CPU 1: 4109 events, 193 KiB data
2826 Total: 11800 events (dropped 0), 554 KiB data
2827 </literallayout>
2828 The blktrace instance on the host will save the target
2829 output inside a hostname-timestamp directory:
2830 <literallayout class='monospaced'>
2831 $ ls -al
2832 drwxr-xr-x 10 root root 1024 Oct 28 02:40 .
2833 drwxr-sr-x 4 root root 1024 Oct 26 18:24 ..
2834 drwxr-xr-x 2 root root 1024 Oct 28 02:40 192.168.1.43-2012-10-28-02:40:56
2835 </literallayout>
2836 cd into that directory to see the output files:
2837 <literallayout class='monospaced'>
2838 $ ls -l
2839 -rw-r--r-- 1 root root 369193 Oct 28 02:44 sdc.blktrace.0
2840 -rw-r--r-- 1 root root 197278 Oct 28 02:44 sdc.blktrace.1
2841 </literallayout>
2842 And run blkparse on the host system using the device name:
2843 <literallayout class='monospaced'>
2844 $ blkparse sdc
2845
2846 8,32 1 1 0.000000000 1263 Q RM 6016 + 8 [ls]
2847 8,32 1 0 0.000036038 0 m N cfq1263 alloced
2848 8,32 1 2 0.000039390 1263 G RM 6016 + 8 [ls]
2849 8,32 1 3 0.000049168 1263 I RM 6016 + 8 [ls]
2850 8,32 1 0 0.000056152 0 m N cfq1263 insert_request
2851 8,32 1 0 0.000061600 0 m N cfq1263 add_to_rr
2852 8,32 1 0 0.000075498 0 m N cfq workload slice:300
2853 .
2854 .
2855 .
2856 8,32 0 0 177.266385696 0 m N cfq1267 arm_idle: 8 group_idle: 0
2857 8,32 0 0 177.266388140 0 m N cfq schedule dispatch
2858 8,32 1 0 177.266679239 0 m N cfq1267 slice expired t=0
2859 8,32 1 0 177.266689297 0 m N cfq1267 sl_used=9 disp=6 charge=9 iops=0 sect=56
2860 8,32 1 0 177.266692649 0 m N cfq1267 del_from_rr
2861 8,32 1 0 177.266696560 0 m N cfq1267 put_queue
2862
2863 CPU0 (sdc):
2864 Reads Queued: 0, 0KiB Writes Queued: 270, 21,708KiB
2865 Read Dispatches: 59, 2,628KiB Write Dispatches: 495, 39,964KiB
2866 Reads Requeued: 0 Writes Requeued: 0
2867 Reads Completed: 90, 2,752KiB Writes Completed: 543, 41,596KiB
2868 Read Merges: 0, 0KiB Write Merges: 9, 344KiB
2869 Read depth: 2 Write depth: 2
2870 IO unplugs: 20 Timer unplugs: 1
2871 CPU1 (sdc):
2872 Reads Queued: 688, 2,752KiB Writes Queued: 381, 20,652KiB
2873 Read Dispatches: 31, 124KiB Write Dispatches: 59, 2,396KiB
2874 Reads Requeued: 0 Writes Requeued: 0
2875 Reads Completed: 0, 0KiB Writes Completed: 11, 764KiB
2876 Read Merges: 598, 2,392KiB Write Merges: 88, 448KiB
2877 Read depth: 2 Write depth: 2
2878 IO unplugs: 52 Timer unplugs: 0
2879
2880 Total (sdc):
2881 Reads Queued: 688, 2,752KiB Writes Queued: 651, 42,360KiB
2882 Read Dispatches: 90, 2,752KiB Write Dispatches: 554, 42,360KiB
2883 Reads Requeued: 0 Writes Requeued: 0
2884 Reads Completed: 90, 2,752KiB Writes Completed: 554, 42,360KiB
2885 Read Merges: 598, 2,392KiB Write Merges: 97, 792KiB
2886 IO unplugs: 72 Timer unplugs: 1
2887
2888 Throughput (R/W): 15KiB/s / 238KiB/s
2889 Events (sdc): 9,301 entries
2890 Skips: 0 forward (0 - 0.0%)
2891 </literallayout>
2892 You should see the trace events and summary just as
2893 you would have if you'd run the same command on the target.
2894 </para>
2895 </section>
2896
2897 <section id='tracing-block-io-via-ftrace'>
2898 <title>Tracing Block I/O via 'ftrace'</title>
2899
2900 <para>
2901 It's also possible to trace block I/O using only
2902 <link linkend='the-trace-events-subsystem'>trace events subsystem</link>,
2903 which can be useful for casual tracing
2904 if you don't want to bother dealing with the userspace tools.
2905 </para>
2906
2907 <para>
2908 To enable tracing for a given device, use
2909 /sys/block/xxx/trace/enable, where xxx is the device name.
2910 This for example enables tracing for /dev/sdc:
2911 <literallayout class='monospaced'>
2912 root@crownbay:/sys/kernel/debug/tracing# echo 1 > /sys/block/sdc/trace/enable
2913 </literallayout>
2914 Once you've selected the device(s) you want to trace,
2915 selecting the 'blk' tracer will turn the blk tracer on:
2916 <literallayout class='monospaced'>
2917 root@crownbay:/sys/kernel/debug/tracing# cat available_tracers
2918 blk function_graph function nop
2919
2920 root@crownbay:/sys/kernel/debug/tracing# echo blk > current_tracer
2921 </literallayout>
2922 Execute the workload you're interested in:
2923 <literallayout class='monospaced'>
2924 root@crownbay:/sys/kernel/debug/tracing# cat /media/sdc/testfile.txt
2925 </literallayout>
2926 And look at the output (note here that we're using
2927 'trace_pipe' instead of trace to capture this trace -
2928 this allows us to wait around on the pipe for data to
2929 appear):
2930 <literallayout class='monospaced'>
2931 root@crownbay:/sys/kernel/debug/tracing# cat trace_pipe
2932 cat-3587 [001] d..1 3023.276361: 8,32 Q R 1699848 + 8 [cat]
2933 cat-3587 [001] d..1 3023.276410: 8,32 m N cfq3587 alloced
2934 cat-3587 [001] d..1 3023.276415: 8,32 G R 1699848 + 8 [cat]
2935 cat-3587 [001] d..1 3023.276424: 8,32 P N [cat]
2936 cat-3587 [001] d..2 3023.276432: 8,32 I R 1699848 + 8 [cat]
2937 cat-3587 [001] d..1 3023.276439: 8,32 m N cfq3587 insert_request
2938 cat-3587 [001] d..1 3023.276445: 8,32 m N cfq3587 add_to_rr
2939 cat-3587 [001] d..2 3023.276454: 8,32 U N [cat] 1
2940 cat-3587 [001] d..1 3023.276464: 8,32 m N cfq workload slice:150
2941 cat-3587 [001] d..1 3023.276471: 8,32 m N cfq3587 set_active wl_prio:0 wl_type:2
2942 cat-3587 [001] d..1 3023.276478: 8,32 m N cfq3587 fifo= (null)
2943 cat-3587 [001] d..1 3023.276483: 8,32 m N cfq3587 dispatch_insert
2944 cat-3587 [001] d..1 3023.276490: 8,32 m N cfq3587 dispatched a request
2945 cat-3587 [001] d..1 3023.276497: 8,32 m N cfq3587 activate rq, drv=1
2946 cat-3587 [001] d..2 3023.276500: 8,32 D R 1699848 + 8 [cat]
2947 </literallayout>
2948 And this turns off tracing for the specified device:
2949 <literallayout class='monospaced'>
2950 root@crownbay:/sys/kernel/debug/tracing# echo 0 > /sys/block/sdc/trace/enable
2951 </literallayout>
2952 </para>
2953 </section>
2954 </section>
2955
2956 <section id='blktrace-documentation'>
2957 <title>Documentation</title>
2958
2959 <para>
2960 Online versions of the man pages for the commands discussed
2961 in this section can be found here:
2962 <itemizedlist>
2963 <listitem><para><ulink url='http://linux.die.net/man/8/blktrace'>http://linux.die.net/man/8/blktrace</ulink>
2964 </para></listitem>
2965 <listitem><para><ulink url='http://linux.die.net/man/1/blkparse'>http://linux.die.net/man/1/blkparse</ulink>
2966 </para></listitem>
2967 <listitem><para><ulink url='http://linux.die.net/man/8/btrace'>http://linux.die.net/man/8/btrace</ulink>
2968 </para></listitem>
2969 </itemizedlist>
2970 </para>
2971
2972 <para>
2973 The above manpages, along with manpages for the other
2974 blktrace utilities (btt, blkiomon, etc) can be found in the
2975 /doc directory of the blktrace tools git repo:
2976 <literallayout class='monospaced'>
2977 $ git clone git://git.kernel.dk/blktrace.git
2978 </literallayout>
2979 </para>
2980 </section>
2981</section>
2982</chapter>
2983<!--
2984vim: expandtab tw=80 ts=4
2985-->