Blame - yocto-poky/documentation/profile-manual/profile-manual-usage.xml - openbmc/openbmc

wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

127

</literallayout>

128

The quickest and easiest way to get some basic overall data about

129

what's going on for a particular workload is to profile it using

130

'perf stat'. 'perf stat' basically profiles using a few default

131

counters and displays the summed counts at the end of the run:

132

133

root@crownbay:~# perf stat wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

134

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

135

linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA

136

137

Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':

138

139

4597.223902 task-clock # 0.077 CPUs utilized

140

23568 context-switches # 0.005 M/sec

141

68 CPU-migrations # 0.015 K/sec

142

241 page-faults # 0.052 K/sec

143

3045817293 cycles # 0.663 GHz

144

<not supported> stalled-cycles-frontend

145

<not supported> stalled-cycles-backend

146

858909167 instructions # 0.28 insns per cycle

147

165441165 branches # 35.987 M/sec

148

19550329 branch-misses # 11.82% of all branches

149

150

59.836627620 seconds time elapsed

151

</literallayout>

152

Many times such a simple-minded test doesn't yield much of

153

interest, but sometimes it does (see Real-world Yocto bug

154

(slow loop-mounted write speed)).

</para>

<para>

Also, note that 'perf stat' isn't restricted to a fixed set of

159

counters - basically any event listed in the output of 'perf list'

160

can be tallied by 'perf stat'. For example, suppose we wanted to

161

see a summary of all the events related to kernel memory

162

allocation/freeing along with cache hits and misses:

163

164

root@crownbay:~# perf stat -e kmem:* -e cache-references -e cache-misses wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

165

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

166

linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA

167

168

Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':

169

170

5566 kmem:kmalloc

171

125517 kmem:kmem_cache_alloc

172

0 kmem:kmalloc_node

173

0 kmem:kmem_cache_alloc_node

174

34401 kmem:kfree

175

69920 kmem:kmem_cache_free

176

133 kmem:mm_page_free

177

41 kmem:mm_page_free_batched

178

11502 kmem:mm_page_alloc

179

11375 kmem:mm_page_alloc_zone_locked

180

0 kmem:mm_page_pcpu_drain

181

0 kmem:mm_page_alloc_extfrag

182

66848602 cache-references

183

2917740 cache-misses # 4.365 % of all cache refs

184

185

44.831023415 seconds time elapsed

186

</literallayout>

187

So 'perf stat' gives us a nice easy way to get a quick overview of

188

what might be happening for a set of events, but normally we'd

189

need a little more detail in order to understand what's going on

190

in a way that we can act on in a useful way.

</para>

<para>

To dive down into a next level of detail, we can use 'perf

195

record'/'perf report' which will collect profiling data and

196

present it to use using an interactive text-based UI (or

197

simply as text if we specify --stdio to 'perf report').

</para>

<para>

As our first attempt at profiling this workload, we'll simply

202

run 'perf record', handing it the workload we want to profile

203

(everything after 'perf record' and any perf options we hand

204

it - here none - will be executed in a new shell). perf collects

205

samples until the process exits and records them in a file named

206

'perf.data' in the current working directory.

207

208

root@crownbay:~# perf record wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

209

210

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

211

linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA

212

[ perf record: Woken up 1 times to write data ]

213

[ perf record: Captured and wrote 0.176 MB perf.data (~7700 samples) ]

214

</literallayout>

215

To see the results in a 'text-based UI' (tui), simply run

216

'perf report', which will read the perf.data file in the current

217

working directory and display the results in an interactive UI:

218

219

root@crownbay:~# perf report

</literallayout>

</para>

<para>

</para>

<para>

The above screenshot displays a 'flat' profile, one entry for

229

each 'bucket' corresponding to the functions that were profiled

230

during the profiling run, ordered from the most popular to the

231

least (perf has options to sort in various orders and keys as

232

well as display entries only above a certain threshold and so

233

on - see the perf documentation for details). Note that this

234

includes both userspace functions (entries containing a [.]) and

235

kernel functions accounted to the process (entries containing

236

a [k]). (perf has command-line modifiers that can be used to

237

restrict the profiling to kernel or userspace, among others).

</para>

<para>

Notice also that the above report shows an entry for 'busybox',

242

which is the executable that implements 'wget' in Yocto, but that

243

instead of a useful function name in that entry, it displays

244

a not-so-friendly hex value instead. The steps below will show

245

how to fix that problem.

</para>

<para>

Before we do that, however, let's try running a different profile,

250

one which shows something a little more interesting. The only

251

difference between the new profile and the previous one is that

252

we'll add the -g option, which will record not just the address

253

of a sampled function, but the entire callchain to the sampled

254

function as well:

255

256

root@crownbay:~# perf record -g wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

257

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

258

linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA

259

[ perf record: Woken up 3 times to write data ]

260

[ perf record: Captured and wrote 0.652 MB perf.data (~28476 samples) ]

261

262

263

root@crownbay:~# perf report

</literallayout>

</para>

<para>

</para>

<para>

Using the callgraph view, we can actually see not only which

273

functions took the most time, but we can also see a summary of

274

how those functions were called and learn something about how the

275

program interacts with the kernel in the process.

</para>

<para>

Notice that each entry in the above screenshot now contains a '+'

280

on the left-hand side. This means that we can expand the entry and

281

drill down into the callchains that feed into that entry.

282

Pressing 'enter' on any one of them will expand the callchain

283

(you can also press 'E' to expand them all at the same time or 'C'

284

to collapse them all).

</para>

<para>

In the screenshot above, we've toggled the __copy_to_user_ll()

289

entry and several subnodes all the way down. This lets us see

290

which callchains contributed to the profiled __copy_to_user_ll()

291

function which contributed 1.77% to the total profile.

</para>

<para>

As a bit of background explanation for these callchains, think

296

about what happens at a high level when you run wget to get a file

297

out on the network. Basically what happens is that the data comes

298

into the kernel via the network connection (socket) and is passed

299

to the userspace program 'wget' (which is actually a part of

300

busybox, but that's not important for now), which takes the buffers

301

the kernel passes to it and writes it to a disk file to save it.

</para>

<para>

The part of this process that we're looking at in the above call

306

stacks is the part where the kernel passes the data it's read from

307

the socket down to wget i.e. a copy-to-user.

</para>

<para>

Notice also that here there's also a case where the hex value

312

is displayed in the callstack, here in the expanded

313

sys_clock_gettime() function. Later we'll see it resolve to a

314

userspace function call in busybox.

</para>

<para>

</para>

<para>

The above screenshot shows the other half of the journey for the

323

data - from the wget program's userspace buffers to disk. To get

324

the buffers to disk, the wget program issues a write(2), which

325

does a copy-from-user to the kernel, which then takes care via

326

some circuitous path (probably also present somewhere in the

327

profile data), to get it safely to disk.

</para>

<para>

Now that we've seen the basic layout of the profile data and the

332

basics of how to extract useful information out of it, let's get

333

back to the task at hand and see if we can get some basic idea

334

about where the time is spent in the program we're profiling,

335

wget. Remember that wget is actually implemented as an applet

336

in busybox, so while the process name is 'wget', the executable

337

we're actually interested in is busybox. So let's expand the

338

first entry containing busybox:

</para>

<para>

</para>

<para>

Again, before we expanded we saw that the function was labeled

347

with a hex value instead of a symbol as with most of the kernel

348

entries. Expanding the busybox entry doesn't make it any better.

</para>

<para>

The problem is that perf can't find the symbol information for the

353

busybox binary, which is actually stripped out by the Yocto build

system.

</para>

<para>

One way around that is to put the following in your local.conf

359

when you build the image:

360

361

INHIBIT_PACKAGE_STRIP = "1"

362

</literallayout>

363

However, we already have an image with the binaries stripped,

364

so what can we do to get perf to resolve the symbols? Basically

365

we need to install the debuginfo for the busybox package.

</para>

<para>

To generate the debug info for the packages in the image, we can

370

add dbg-pkgs to EXTRA_IMAGE_FEATURES in local.conf. For example:

371

372

EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile dbg-pkgs"

373

</literallayout>

374

Additionally, in order to generate the type of debuginfo that

375

perf understands, we also need to add the following to local.conf:

376

377

PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory'

378

</literallayout>

379

Once we've done that, we can install the debuginfo for busybox.

380

The debug packages once built can be found in

381

build/tmp/deploy/rpm/* on the host system. Find the

382

busybox-dbg-...rpm file and copy it to the target. For example:

383

384

[trz@empanada core2]$ scp /home/trz/yocto/crownbay-tracing-dbg/build/tmp/deploy/rpm/core2_32/busybox-dbg-1.20.2-r2.core2_32.rpm root@192.168.1.31:

385

root@192.168.1.31's password:

386

busybox-dbg-1.20.2-r2.core2_32.rpm 100% 1826KB 1.8MB/s 00:01

387

</literallayout>

388

Now install the debug rpm on the target:

389

390

root@crownbay:~# rpm -i busybox-dbg-1.20.2-r2.core2_32.rpm

391

</literallayout>

392

Now that the debuginfo is installed, we see that the busybox

393

entries now display their functions symbolically:

</para>

<para>

</para>

<para>

If we expand one of the entries and press 'enter' on a leaf node,

402

we're presented with a menu of actions we can take to get more

403

information related to that entry:

</para>

<para>

</para>

<para>

One of these actions allows us to show a view that displays a

412

busybox-centric view of the profiled functions (in this case we've

413

also expanded all the nodes using the 'E' key):

</para>

<para>

</para>

<para>

Finally, we can see that now that the busybox debuginfo is

422

installed, the previously unresolved symbol in the

423

sys_clock_gettime() entry mentioned previously is now resolved,

424

and shows that the sys_clock_gettime system call that was the

425

source of 6.75% of the copy-to-user overhead was initiated by

426

the handle_input() busybox function:

</para>

<para>

</para>

<para>

At the lowest level of detail, we can dive down to the assembly

435

level and see which instructions caused the most overhead in a

436

function. Pressing 'enter' on the 'udhcpc_main' function, we're

437

again presented with a menu:

</para>

<para>

</para>

<para>

Selecting 'Annotate udhcpc_main', we get a detailed listing of

446

percentages by instruction for the udhcpc_main function. From the

447

display, we can see that over 50% of the time spent in this

448

function is taken up by a couple tests and the move of a

449

constant (1) to a register:

</para>

<para>

</para>

<para>

As a segue into tracing, let's try another profile using a

458

different counter, something other than the default 'cycles'.

</para>

<para>

The tracing and profiling infrastructure in Linux has become

463

unified in a way that allows us to use the same tool with a

464

completely different set of counters, not just the standard

465

hardware counters that traditional tools have had to restrict

466

themselves to (of course the traditional tools can also make use

467

of the expanded possibilities now available to them, and in some

468

cases have, as mentioned previously).

</para>

<para>

We can get a list of the available events that can be used to

473

profile a workload via 'perf list':

474

475

root@crownbay:~# perf list

476

477

List of pre-defined events (to be used in -e):

478

cpu-cycles OR cycles [Hardware event]

479

stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]

480

stalled-cycles-backend OR idle-cycles-backend [Hardware event]

481

instructions [Hardware event]

482

cache-references [Hardware event]

483

cache-misses [Hardware event]

484

branch-instructions OR branches [Hardware event]

485

branch-misses [Hardware event]

486

bus-cycles [Hardware event]

487

ref-cycles [Hardware event]

488

489

cpu-clock [Software event]

490

task-clock [Software event]

491

page-faults OR faults [Software event]

492

minor-faults [Software event]

493

major-faults [Software event]

494

context-switches OR cs [Software event]

495

cpu-migrations OR migrations [Software event]

496

alignment-faults [Software event]

497

emulation-faults [Software event]

498

499

L1-dcache-loads [Hardware cache event]

500

L1-dcache-load-misses [Hardware cache event]

501

L1-dcache-prefetch-misses [Hardware cache event]

502

L1-icache-loads [Hardware cache event]

503

L1-icache-load-misses [Hardware cache event]

.

.

.

rNNN [Raw hardware event descriptor]

508

cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]

509

(see 'perf list --help' on how to encode it)

510

511

mem:<addr>[:access] [Hardware breakpoint]

512

513

sunrpc:rpc_call_status [Tracepoint event]

514

sunrpc:rpc_bind_status [Tracepoint event]

515

sunrpc:rpc_connect_status [Tracepoint event]

516

sunrpc:rpc_task_begin [Tracepoint event]

517

skb:kfree_skb [Tracepoint event]

518

skb:consume_skb [Tracepoint event]

519

skb:skb_copy_datagram_iovec [Tracepoint event]

520

net:net_dev_xmit [Tracepoint event]

521

net:net_dev_queue [Tracepoint event]

522

net:netif_receive_skb [Tracepoint event]

523

net:netif_rx [Tracepoint event]

524

napi:napi_poll [Tracepoint event]

525

sock:sock_rcvqueue_full [Tracepoint event]

526

sock:sock_exceed_buf_limit [Tracepoint event]

527

udp:udp_fail_queue_rcv_skb [Tracepoint event]

528

hda:hda_send_cmd [Tracepoint event]

529

hda:hda_get_response [Tracepoint event]

530

hda:hda_bus_reset [Tracepoint event]

531

scsi:scsi_dispatch_cmd_start [Tracepoint event]

532

scsi:scsi_dispatch_cmd_error [Tracepoint event]

533

scsi:scsi_eh_wakeup [Tracepoint event]

534

drm:drm_vblank_event [Tracepoint event]

535

drm:drm_vblank_event_queued [Tracepoint event]

536

drm:drm_vblank_event_delivered [Tracepoint event]

537

random:mix_pool_bytes [Tracepoint event]

538

random:mix_pool_bytes_nolock [Tracepoint event]

539

random:credit_entropy_bits [Tracepoint event]

540

gpio:gpio_direction [Tracepoint event]

541

gpio:gpio_value [Tracepoint event]

542

block:block_rq_abort [Tracepoint event]

543

block:block_rq_requeue [Tracepoint event]

544

block:block_rq_issue [Tracepoint event]

545

block:block_bio_bounce [Tracepoint event]

546

block:block_bio_complete [Tracepoint event]

547

block:block_bio_backmerge [Tracepoint event]

548

.

549

.

550

writeback:writeback_wake_thread [Tracepoint event]

551

writeback:writeback_wake_forker_thread [Tracepoint event]

552

writeback:writeback_bdi_register [Tracepoint event]

553

.

554

.

555

writeback:writeback_single_inode_requeue [Tracepoint event]

556

writeback:writeback_single_inode [Tracepoint event]

557

kmem:kmalloc [Tracepoint event]

558

kmem:kmem_cache_alloc [Tracepoint event]

559

kmem:mm_page_alloc [Tracepoint event]

560

kmem:mm_page_alloc_zone_locked [Tracepoint event]

561

kmem:mm_page_pcpu_drain [Tracepoint event]

562

kmem:mm_page_alloc_extfrag [Tracepoint event]

563

vmscan:mm_vmscan_kswapd_sleep [Tracepoint event]

564

vmscan:mm_vmscan_kswapd_wake [Tracepoint event]

565

vmscan:mm_vmscan_wakeup_kswapd [Tracepoint event]

566

vmscan:mm_vmscan_direct_reclaim_begin [Tracepoint event]

567

.

568

.

569

module:module_get [Tracepoint event]

570

module:module_put [Tracepoint event]

571

module:module_request [Tracepoint event]

572

sched:sched_kthread_stop [Tracepoint event]

573

sched:sched_wakeup [Tracepoint event]

574

sched:sched_wakeup_new [Tracepoint event]

575

sched:sched_process_fork [Tracepoint event]

576

sched:sched_process_exec [Tracepoint event]

577

sched:sched_stat_runtime [Tracepoint event]

578

rcu:rcu_utilization [Tracepoint event]

579

workqueue:workqueue_queue_work [Tracepoint event]

580

workqueue:workqueue_execute_end [Tracepoint event]

581

signal:signal_generate [Tracepoint event]

582

signal:signal_deliver [Tracepoint event]

583

timer:timer_init [Tracepoint event]

584

timer:timer_start [Tracepoint event]

585

timer:hrtimer_cancel [Tracepoint event]

586

timer:itimer_state [Tracepoint event]

587

timer:itimer_expire [Tracepoint event]

588

irq:irq_handler_entry [Tracepoint event]

589

irq:irq_handler_exit [Tracepoint event]

590

irq:softirq_entry [Tracepoint event]

591

irq:softirq_exit [Tracepoint event]

592

irq:softirq_raise [Tracepoint event]

593

printk:console [Tracepoint event]

594

task:task_newtask [Tracepoint event]

595

task:task_rename [Tracepoint event]

596

syscalls:sys_enter_socketcall [Tracepoint event]

597

syscalls:sys_exit_socketcall [Tracepoint event]

.

.

.

syscalls:sys_enter_unshare [Tracepoint event]

602

syscalls:sys_exit_unshare [Tracepoint event]

603

raw_syscalls:sys_enter [Tracepoint event]

604

raw_syscalls:sys_exit [Tracepoint event]

</literallayout>

</para>

<emphasis>Tying it Together:</emphasis> These are exactly the same set of events defined

610

by the trace event subsystem and exposed by

611

ftrace/tracecmd/kernelshark as files in

612

/sys/kernel/debug/tracing/events, by SystemTap as

613

kernel.trace("tracepoint_name") and (partially) accessed by LTTng.

</informalexample>

<para>

Only a subset of these would be of interest to us when looking at

618

this workload, so let's choose the most likely subsystems

619

(identified by the string before the colon in the Tracepoint events)

620

and do a 'perf stat' run using only those wildcarded subsystems:

621

622

root@crownbay:~# perf stat -e skb:* -e net:* -e napi:* -e sched:* -e workqueue:* -e irq:* -e syscalls:* wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

623

Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':

23323 skb:kfree_skb

0 skb:consume_skb

49897 skb:skb_copy_datagram_iovec

628

6217 net:net_dev_xmit

629

6217 net:net_dev_queue

630

7962 net:netif_receive_skb

631

2 net:netif_rx

632

8340 napi:napi_poll

633

0 sched:sched_kthread_stop

634

0 sched:sched_kthread_stop_ret

635

3749 sched:sched_wakeup

636

0 sched:sched_wakeup_new

637

0 sched:sched_switch

638

29 sched:sched_migrate_task

639

0 sched:sched_process_free

640

1 sched:sched_process_exit

641

0 sched:sched_wait_task

642

0 sched:sched_process_wait

643

0 sched:sched_process_fork

644

1 sched:sched_process_exec

645

0 sched:sched_stat_wait

646

2106519415641 sched:sched_stat_sleep

647

0 sched:sched_stat_iowait

648

147453613 sched:sched_stat_blocked

649

12903026955 sched:sched_stat_runtime

650

0 sched:sched_pi_setprio

651

3574 workqueue:workqueue_queue_work

652

3574 workqueue:workqueue_activate_work

653

0 workqueue:workqueue_execute_start

654

0 workqueue:workqueue_execute_end

655

16631 irq:irq_handler_entry

656

16631 irq:irq_handler_exit

657

28521 irq:softirq_entry

658

28521 irq:softirq_exit

659

28728 irq:softirq_raise

660

1 syscalls:sys_enter_sendmmsg

661

1 syscalls:sys_exit_sendmmsg

662

0 syscalls:sys_enter_recvmmsg

663

0 syscalls:sys_exit_recvmmsg

664

14 syscalls:sys_enter_socketcall

665

14 syscalls:sys_exit_socketcall

.

.

.

16965 syscalls:sys_enter_read

670

16965 syscalls:sys_exit_read

671

12854 syscalls:sys_enter_write

672

12854 syscalls:sys_exit_write

.

.

.

58.029710972 seconds time elapsed

678

</literallayout>

679

Let's pick one of these tracepoints and tell perf to do a profile

680

using it as the sampling event:

681

682

root@crownbay:~# perf record -g -e sched:sched_wakeup wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

</literallayout>

</para>

<para>

</para>

<para>

The screenshot above shows the results of running a profile using

692

sched:sched_switch tracepoint, which shows the relative costs of

693

various paths to sched_wakeup (note that sched_wakeup is the

694

name of the tracepoint - it's actually defined just inside

695

ttwu_do_wakeup(), which accounts for the function name actually

696

displayed in the profile:

697

698

/*

699

* Mark the task runnable and perform wakeup-preemption.

700

*/

701

static void

702

ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

703

{

704

trace_sched_wakeup(p, true);

.

.

.

}

</literallayout>

A couple of the more interesting callchains are expanded and

711

displayed above, basically some network receive paths that

712

presumably end up waking up wget (busybox) when network data is

ready.

</para>

<para>

Note that because tracepoints are normally used for tracing,

718

the default sampling period for tracepoints is 1 i.e. for

719

tracepoints perf will sample on every event occurrence (this

720

can be changed using the -c option). This is in contrast to

721

hardware counters such as for example the default 'cycles'

722

hardware counter used for normal profiling, where sampling

723

periods are much higher (in the thousands) because profiling should

724

have as low an overhead as possible and sampling on every cycle

725

would be prohibitively expensive.

</para>

</section>

<title>Using perf to do Basic Tracing</title>

731

732

<para>

733

Profiling is a great tool for solving many problems or for

734

getting a high-level view of what's going on with a workload or

735

across the system. It is however by definition an approximation,

736

as suggested by the most prominent word associated with it,

737

'sampling'. On the one hand, it allows a representative picture of

738

what's going on in the system to be cheaply taken, but on the other

739

hand, that cheapness limits its utility when that data suggests a

740

need to 'dive down' more deeply to discover what's really going

741

on. In such cases, the only way to see what's really going on is

742

to be able to look at (or summarize more intelligently) the

743

individual steps that go into the higher-level behavior exposed

744

by the coarse-grained profiling data.

</para>

<para>

As a concrete example, we can trace all the events we think might

749

be applicable to our workload:

750

751

root@crownbay:~# perf record -g -e skb:* -e net:* -e napi:* -e sched:sched_switch -e sched:sched_wakeup -e irq:*

752

-e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write

753

wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

754

</literallayout>

755

We can look at the raw trace output using 'perf script' with no

756

arguments:

757

758

root@crownbay:~# perf script

759

760

perf 1262 [000] 11624.857082: sys_exit_read: 0x0

761

perf 1262 [000] 11624.857193: sched_wakeup: comm=migration/0 pid=6 prio=0 success=1 target_cpu=000

762

wget 1262 [001] 11624.858021: softirq_raise: vec=1 [action=TIMER]

763

wget 1262 [001] 11624.858074: softirq_entry: vec=1 [action=TIMER]

764

wget 1262 [001] 11624.858081: softirq_exit: vec=1 [action=TIMER]

765

wget 1262 [001] 11624.858166: sys_enter_read: fd: 0x0003, buf: 0xbf82c940, count: 0x0200

766

wget 1262 [001] 11624.858177: sys_exit_read: 0x200

767

wget 1262 [001] 11624.858878: kfree_skb: skbaddr=0xeb248d80 protocol=0 location=0xc15a5308

768

wget 1262 [001] 11624.858945: kfree_skb: skbaddr=0xeb248000 protocol=0 location=0xc15a5308

769

wget 1262 [001] 11624.859020: softirq_raise: vec=1 [action=TIMER]

770

wget 1262 [001] 11624.859076: softirq_entry: vec=1 [action=TIMER]

771

wget 1262 [001] 11624.859083: softirq_exit: vec=1 [action=TIMER]

772

wget 1262 [001] 11624.859167: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400

773

wget 1262 [001] 11624.859192: sys_exit_read: 0x1d7

774

wget 1262 [001] 11624.859228: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400

775

wget 1262 [001] 11624.859233: sys_exit_read: 0x0

776

wget 1262 [001] 11624.859573: sys_enter_read: fd: 0x0003, buf: 0xbf82c580, count: 0x0200

777

wget 1262 [001] 11624.859584: sys_exit_read: 0x200

778

wget 1262 [001] 11624.859864: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400

779

wget 1262 [001] 11624.859888: sys_exit_read: 0x400

780

wget 1262 [001] 11624.859935: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400

781

wget 1262 [001] 11624.859944: sys_exit_read: 0x400

782

</literallayout>

783

This gives us a detailed timestamped sequence of events that

784

occurred within the workload with respect to those events.

</para>

<para>

In many ways, profiling can be viewed as a subset of tracing -

789

theoretically, if you have a set of trace events that's sufficient

790

to capture all the important aspects of a workload, you can derive

791

any of the results or views that a profiling run can.

</para>

<para>

Another aspect of traditional profiling is that while powerful in

796

many ways, it's limited by the granularity of the underlying data.

797

Profiling tools offer various ways of sorting and presenting the

798

sample data, which make it much more useful and amenable to user

799

experimentation, but in the end it can't be used in an open-ended

800

way to extract data that just isn't present as a consequence of

801

the fact that conceptually, most of it has been thrown away.

</para>

<para>

Full-blown detailed tracing data does however offer the opportunity

806

to manipulate and present the information collected during a

807

tracing run in an infinite variety of ways.

</para>

<para>

Another way to look at it is that there are only so many ways that

812

the 'primitive' counters can be used on their own to generate

813

interesting output; to get anything more complicated than simple

814

counts requires some amount of additional logic, which is typically

815

very specific to the problem at hand. For example, if we wanted to

816

make use of a 'counter' that maps to the value of the time

817

difference between when a process was scheduled to run on a

818

processor and the time it actually ran, we wouldn't expect such

819

a counter to exist on its own, but we could derive one called say

820

'wakeup_latency' and use it to extract a useful view of that metric

821

from trace data. Likewise, we really can't figure out from standard

822

profiling tools how much data every process on the system reads and

823

writes, along with how many of those reads and writes fail

824

completely. If we have sufficient trace data, however, we could

825

with the right tools easily extract and present that information,

826

but we'd need something other than pre-canned profiling tools to

do that.

</para>

<para>

Luckily, there is a general-purpose way to handle such needs,

832

called 'programming languages'. Making programming languages

833

easily available to apply to such problems given the specific

834

format of data is called a 'programming language binding' for

835

that data and language. Perf supports two programming language

836

bindings, one for Python and one for Perl.

</para>

<emphasis>Tying it Together:</emphasis> Language bindings for manipulating and

841

aggregating trace data are of course not a new

842

idea. One of the first projects to do this was IBM's DProbes

843

dpcc compiler, an ANSI C compiler which targeted a low-level

844

assembly language running on an in-kernel interpreter on the

845

target system. This is exactly analogous to what Sun's DTrace

846

did, except that DTrace invented its own language for the purpose.

847

Systemtap, heavily inspired by DTrace, also created its own

848

one-off language, but rather than running the product on an

849

in-kernel interpreter, created an elaborate compiler-based

850

machinery to translate its language into kernel modules written

in C.

</informalexample>

<para>

Now that we have the trace data in perf.data, we can use

856

'perf script -g' to generate a skeleton script with handlers

857

for the read/write entry/exit events we recorded:

858

859

root@crownbay:~# perf script -g python

860

generated Python script: perf-script.py

861

</literallayout>

862

The skeleton script simply creates a python function for each

863

event type in the perf.data file. The body of each function simply

864

prints the event name along with its parameters. For example:

865

866

def net__netif_rx(event_name, context, common_cpu,

867

common_secs, common_nsecs, common_pid, common_comm,

868

skbaddr, len, name):

869

print_header(event_name, common_cpu, common_secs, common_nsecs,

870

common_pid, common_comm)

871

872

print "skbaddr=%u, len=%u, name=%s\n" % (skbaddr, len, name),

873

</literallayout>

874

We can run that script directly to print all of the events

875

contained in the perf.data file:

876

877

root@crownbay:~# perf script -s perf-script.py

878

879

in trace_begin

880

syscalls__sys_exit_read 0 11624.857082795 1262 perf nr=3, ret=0

881

sched__sched_wakeup 0 11624.857193498 1262 perf comm=migration/0, pid=6, prio=0, success=1, target_cpu=0

882

irq__softirq_raise 1 11624.858021635 1262 wget vec=TIMER

883

irq__softirq_entry 1 11624.858074075 1262 wget vec=TIMER

884

irq__softirq_exit 1 11624.858081389 1262 wget vec=TIMER

885

syscalls__sys_enter_read 1 11624.858166434 1262 wget nr=3, fd=3, buf=3213019456, count=512

886

syscalls__sys_exit_read 1 11624.858177924 1262 wget nr=3, ret=512

887

skb__kfree_skb 1 11624.858878188 1262 wget skbaddr=3945041280, location=3243922184, protocol=0

888

skb__kfree_skb 1 11624.858945608 1262 wget skbaddr=3945037824, location=3243922184, protocol=0

889

irq__softirq_raise 1 11624.859020942 1262 wget vec=TIMER

890

irq__softirq_entry 1 11624.859076935 1262 wget vec=TIMER

891

irq__softirq_exit 1 11624.859083469 1262 wget vec=TIMER

892

syscalls__sys_enter_read 1 11624.859167565 1262 wget nr=3, fd=3, buf=3077701632, count=1024

893

syscalls__sys_exit_read 1 11624.859192533 1262 wget nr=3, ret=471

894

syscalls__sys_enter_read 1 11624.859228072 1262 wget nr=3, fd=3, buf=3077701632, count=1024

895

syscalls__sys_exit_read 1 11624.859233707 1262 wget nr=3, ret=0

896

syscalls__sys_enter_read 1 11624.859573008 1262 wget nr=3, fd=3, buf=3213018496, count=512

897

syscalls__sys_exit_read 1 11624.859584818 1262 wget nr=3, ret=512

898

syscalls__sys_enter_read 1 11624.859864562 1262 wget nr=3, fd=3, buf=3077701632, count=1024

899

syscalls__sys_exit_read 1 11624.859888770 1262 wget nr=3, ret=1024

900

syscalls__sys_enter_read 1 11624.859935140 1262 wget nr=3, fd=3, buf=3077701632, count=1024

901

syscalls__sys_exit_read 1 11624.859944032 1262 wget nr=3, ret=1024

902

</literallayout>

903

That in itself isn't very useful; after all, we can accomplish

904

pretty much the same thing by simply running 'perf script'

905

without arguments in the same directory as the perf.data file.

</para>

<para>

We can however replace the print statements in the generated

910

function bodies with whatever we want, and thereby make it

911

infinitely more useful.

</para>

<para>

As a simple example, let's just replace the print statements in

916

the function bodies with a simple function that does nothing but

917

increment a per-event count. When the program is run against a

918

perf.data file, each time a particular event is encountered,

919

a tally is incremented for that event. For example:

920

921

def net__netif_rx(event_name, context, common_cpu,

922

common_secs, common_nsecs, common_pid, common_comm,

923

skbaddr, len, name):

924

inc_counts(event_name)

925

</literallayout>

926

Each event handler function in the generated code is modified

927

to do this. For convenience, we define a common function called

928

inc_counts() that each handler calls; inc_counts() simply tallies

929

a count for each event using the 'counts' hash, which is a

930

specialized hash function that does Perl-like autovivification, a

931

capability that's extremely useful for kinds of multi-level

932

aggregation commonly used in processing traces (see perf's

933

documentation on the Python language binding for details):

counts = autodict()

def inc_counts(event_name):

938

try:

939

counts[event_name] += 1

940

except TypeError:

941

counts[event_name] = 1

942

</literallayout>

943

Finally, at the end of the trace processing run, we want to

944

print the result of all the per-event tallies. For that, we

945

use the special 'trace_end()' function:

946

947

def trace_end():

948

for event_name, count in counts.iteritems():

949

print "%-40s %10s\n" % (event_name, count)

950

</literallayout>

951

The end result is a summary of all the events recorded in the

952

trace:

953

954

skb__skb_copy_datagram_iovec 13148

955

irq__softirq_entry 4796

956

irq__irq_handler_exit 3805

957

irq__softirq_exit 4795

958

syscalls__sys_enter_write 8990

959

net__net_dev_xmit 652

960

skb__kfree_skb 4047

961

sched__sched_wakeup 1155

962

irq__irq_handler_entry 3804

963

irq__softirq_raise 4799

964

net__net_dev_queue 652

965

syscalls__sys_enter_read 17599

966

net__netif_receive_skb 1743

967

syscalls__sys_exit_read 17598

968

net__netif_rx 2

969

napi__napi_poll 1877

970

syscalls__sys_exit_write 8990

971

</literallayout>

972

Note that this is pretty much exactly the same information we get

973

from 'perf stat', which goes a little way to support the idea

974

mentioned previously that given the right kind of trace data,

975

higher-level profiling-type summaries can be derived from it.

</para>

<para>

Documentation on using the

980

<ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.

</para>

</section>

<title>System-Wide Tracing and Profiling</title>

986

987

<para>

988

The examples so far have focused on tracing a particular program or

989

workload - in other words, every profiling run has specified the

990

program to profile in the command-line e.g. 'perf record wget ...'.

</para>

<para>

It's also possible, and more interesting in many cases, to run a

995

system-wide profile or trace while running the workload in a

separate shell.

</para>

<para>

To do system-wide profiling or tracing, you typically use

1001

the -a flag to 'perf record'.

</para>

<para>

To demonstrate this, open up one window and start the profile

1006

using the -a flag (press Ctrl-C to stop tracing):

1007

1008

root@crownbay:~# perf record -g -a

1009

^C[ perf record: Woken up 6 times to write data ]

1010

[ perf record: Captured and wrote 1.400 MB perf.data (~61172 samples) ]

1011

</literallayout>

1012

In another window, run the wget test:

1013

1014

root@crownbay:~# wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

1015

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

1016

linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA

1017

</literallayout>

1018

Here we see entries not only for our wget load, but for other

1019

processes running on the system as well:

</para>

<para>

</para>

<para>

In the snapshot above, we can see callchains that originate in

1028

libc, and a callchain from Xorg that demonstrates that we're

1029

using a proprietary X driver in userspace (notice the presence

1030

of 'PVR' and some other unresolvable symbols in the expanded

Xorg callchain).

</para>

<para>

Note also that we have both kernel and userspace entries in the

1036

above snapshot. We can also tell perf to focus on userspace but

1037

providing a modifier, in this case 'u', to the 'cycles' hardware

1038

counter when we record a profile:

1039

1040

root@crownbay:~# perf record -g -a -e cycles:u

1041

^C[ perf record: Woken up 2 times to write data ]

1042

[ perf record: Captured and wrote 0.376 MB perf.data (~16443 samples) ]

</literallayout>

</para>

<para>

</para>

<para>

Notice in the screenshot above, we see only userspace entries ([.])

</para>

<para>

Finally, we can press 'enter' on a leaf node and select the 'Zoom

1056

into DSO' menu item to show only entries associated with a

1057

specific DSO. In the screenshot below, we've zoomed into the

1058

'libc' DSO which shows all the entries associated with the

libc-xxx.so DSO.

</para>

<para>

</para>

<para>

We can also use the system-wide -a switch to do system-wide

1068

tracing. Here we'll trace a couple of scheduler events:

1069

1070

root@crownbay:~# perf record -a -e sched:sched_switch -e sched:sched_wakeup

1071

^C[ perf record: Woken up 38 times to write data ]

1072

[ perf record: Captured and wrote 9.780 MB perf.data (~427299 samples) ]

1073

</literallayout>

1074

We can look at the raw output using 'perf script' with no

1075

arguments:

1076

1077

root@crownbay:~# perf script

1078

1079

perf 1383 [001] 6171.460045: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1080

perf 1383 [001] 6171.460066: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120

1081

kworker/1:1 21 [001] 6171.460093: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120

1082

swapper 0 [000] 6171.468063: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000

1083

swapper 0 [000] 6171.468107: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120

1084

kworker/0:3 1209 [000] 6171.468143: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120

1085

perf 1383 [001] 6171.470039: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1086

perf 1383 [001] 6171.470058: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120

1087

kworker/1:1 21 [001] 6171.470082: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120

1088

perf 1383 [001] 6171.480035: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

</literallayout>

</para>

<title>Filtering</title>

1094

1095

<para>

1096

Notice that there are a lot of events that don't really have

1097

anything to do with what we're interested in, namely events

1098

that schedule 'perf' itself in and out or that wake perf up.

1099

We can get rid of those by using the '--filter' option -

1100

for each event we specify using -e, we can add a --filter

1101

after that to filter out trace events that contain fields

1102

with specific values:

1103

1104

root@crownbay:~# perf record -a -e sched:sched_switch --filter 'next_comm != perf && prev_comm != perf' -e sched:sched_wakeup --filter 'comm != perf'

1105

^C[ perf record: Woken up 38 times to write data ]

1106

[ perf record: Captured and wrote 9.688 MB perf.data (~423279 samples) ]

1107

1108

1109

root@crownbay:~# perf script

1110

1111

swapper 0 [000] 7932.162180: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120

1112

kworker/0:3 1209 [000] 7932.162236: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120

1113

perf 1407 [001] 7932.170048: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1114

perf 1407 [001] 7932.180044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1115

perf 1407 [001] 7932.190038: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1116

perf 1407 [001] 7932.200044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1117

perf 1407 [001] 7932.210044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1118

perf 1407 [001] 7932.220044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1119

swapper 0 [001] 7932.230111: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1120

swapper 0 [001] 7932.230146: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/1:1 next_pid=21 next_prio=120

1121

kworker/1:1 21 [001] 7932.230205: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120

1122

swapper 0 [000] 7932.326109: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000

1123

swapper 0 [000] 7932.326171: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120

1124

kworker/0:3 1209 [000] 7932.326214: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120

1125

</literallayout>

1126

In this case, we've filtered out all events that have 'perf'

1127

in their 'comm' or 'comm_prev' or 'comm_next' fields. Notice

1128

that there are still events recorded for perf, but notice

1129

that those events don't have values of 'perf' for the filtered

1130

fields. To completely filter out anything from perf will

1131

require a bit more work, but for the purpose of demonstrating

1132

how to use filters, it's close enough.

</para>

<emphasis>Tying it Together:</emphasis> These are exactly the same set of event

1137

filters defined by the trace event subsystem. See the

1138

ftrace/tracecmd/kernelshark section for more discussion about

these event filters.

</informalexample>

<emphasis>Tying it Together:</emphasis> These event filters are implemented by a

1144

special-purpose pseudo-interpreter in the kernel and are an

1145

integral and indispensable part of the perf design as it

1146

relates to tracing. kernel-based event filters provide a

1147

mechanism to precisely throttle the event stream that appears

1148

in user space, where it makes sense to provide bindings to real

1149

programming languages for postprocessing the event stream.

1150

This architecture allows for the intelligent and flexible

1151

partitioning of processing between the kernel and user space.

1152

Contrast this with other tools such as SystemTap, which does

1153

all of its processing in the kernel and as such requires a

1154

special project-defined language in order to accommodate that

1155

design, or LTTng, where everything is sent to userspace and

1156

as such requires a super-efficient kernel-to-userspace

1157

transport mechanism in order to function properly. While

1158

perf certainly can benefit from for instance advances in

1159

the design of the transport, it doesn't fundamentally depend

1160

on them. Basically, if you find that your perf tracing

1161

application is causing buffer I/O overruns, it probably

1162

means that you aren't taking enough advantage of the

1163

kernel filtering engine.

</informalexample>

</section>

</section>

<title>Using Dynamic Tracepoints</title>

1170

1171

<para>

1172

perf isn't restricted to the fixed set of static tracepoints

1173

listed by 'perf list'. Users can also add their own 'dynamic'

1174

tracepoints anywhere in the kernel. For instance, suppose we

1175

want to define our own tracepoint on do_fork(). We can do that

1176

using the 'perf probe' perf subcommand:

1177

1178

root@crownbay:~# perf probe do_fork

1179

Added new event:

1180

probe:do_fork (on do_fork)

1181

1182

You can now use it in all perf tools, such as:

1183

1184

perf record -e probe:do_fork -aR sleep 1

1185

</literallayout>

1186

Adding a new tracepoint via 'perf probe' results in an event

1187

with all the expected files and format in

1188

/sys/kernel/debug/tracing/events, just the same as for static

1189

tracepoints (as discussed in more detail in the trace events

1190

subsystem section:

1191

1192

root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# ls -al

1193

drwxr-xr-x 2 root root 0 Oct 28 11:42 .

1194

drwxr-xr-x 3 root root 0 Oct 28 11:42 ..

1195

-rw-r--r-- 1 root root 0 Oct 28 11:42 enable

1196

-rw-r--r-- 1 root root 0 Oct 28 11:42 filter

1197

-r--r--r-- 1 root root 0 Oct 28 11:42 format

1198

-r--r--r-- 1 root root 0 Oct 28 11:42 id

1199

1200

root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# cat format

name: do_fork

ID: 944

format:

field:unsigned short common_type; offset:0; size:2; signed:0;

1205

field:unsigned char common_flags; offset:2; size:1; signed:0;

1206

field:unsigned char common_preempt_count; offset:3; size:1; signed:0;

1207

field:int common_pid; offset:4; size:4; signed:1;

1208

field:int common_padding; offset:8; size:4; signed:1;

1209

1210

field:unsigned long __probe_ip; offset:12; size:4; signed:0;

1211

1212

print fmt: "(%lx)", REC->__probe_ip

1213

</literallayout>

1214

We can list all dynamic tracepoints currently in existence:

1215

1216

root@crownbay:~# perf probe -l

1217

probe:do_fork (on do_fork)

1218

probe:schedule (on schedule)

1219

</literallayout>

1220

Let's record system-wide ('sleep 30' is a trick for recording

1221

system-wide but basically do nothing and then wake up after

1222

30 seconds):

1223

1224

root@crownbay:~# perf record -g -a -e probe:do_fork sleep 30

1225

[ perf record: Woken up 1 times to write data ]

1226

[ perf record: Captured and wrote 0.087 MB perf.data (~3812 samples) ]

1227

</literallayout>

1228

Using 'perf script' we can see each do_fork event that fired:

1229

1230

root@crownbay:~# perf script

1231

1232

# ========

1233

# captured on: Sun Oct 28 11:55:18 2012

1234

# hostname : crownbay

1235

# os release : 3.4.11-yocto-standard

1236

# perf version : 3.4.11

# arch : i686

# nrcpus online : 2

# nrcpus avail : 2

# cpudesc : Intel(R) Atom(TM) CPU E660 @ 1.30GHz

1241

# cpuid : GenuineIntel,6,38,1

1242

# total memory : 1017184 kB

1243

# cmdline : /usr/bin/perf record -g -a -e probe:do_fork sleep 30

1244

# event : name = probe:do_fork, type = 2, config = 0x3b0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern

1245

= 0, id = { 5, 6 }

1246

# HEADER_CPU_TOPOLOGY info available, use -I to display

1247

# ========

1248

#

1249

matchbox-deskto 1197 [001] 34211.378318: do_fork: (c1028460)

1250

matchbox-deskto 1295 [001] 34211.380388: do_fork: (c1028460)

1251

pcmanfm 1296 [000] 34211.632350: do_fork: (c1028460)

1252

pcmanfm 1296 [000] 34211.639917: do_fork: (c1028460)

1253

matchbox-deskto 1197 [001] 34217.541603: do_fork: (c1028460)

1254

matchbox-deskto 1299 [001] 34217.543584: do_fork: (c1028460)

1255

gthumb 1300 [001] 34217.697451: do_fork: (c1028460)

1256

gthumb 1300 [001] 34219.085734: do_fork: (c1028460)

1257

gthumb 1300 [000] 34219.121351: do_fork: (c1028460)

1258

gthumb 1300 [001] 34219.264551: do_fork: (c1028460)

1259

pcmanfm 1296 [000] 34219.590380: do_fork: (c1028460)

1260

matchbox-deskto 1197 [001] 34224.955965: do_fork: (c1028460)

1261

matchbox-deskto 1306 [001] 34224.957972: do_fork: (c1028460)

1262

matchbox-termin 1307 [000] 34225.038214: do_fork: (c1028460)

1263

matchbox-termin 1307 [001] 34225.044218: do_fork: (c1028460)

1264

matchbox-termin 1307 [000] 34225.046442: do_fork: (c1028460)

1265

matchbox-deskto 1197 [001] 34237.112138: do_fork: (c1028460)

1266

matchbox-deskto 1311 [001] 34237.114106: do_fork: (c1028460)

1267

gaku 1312 [000] 34237.202388: do_fork: (c1028460)

1268

</literallayout>

1269

And using 'perf report' on the same file, we can see the

1270

callgraphs from starting a few programs during those 30 seconds:

</para>

<para>

</para>

<emphasis>Tying it Together:</emphasis> The trace events subsystem accommodate static

1279

and dynamic tracepoints in exactly the same way - there's no

1280

difference as far as the infrastructure is concerned. See the

1281

ftrace section for more details on the trace event subsystem.

</informalexample>

<emphasis>Tying it Together:</emphasis> Dynamic tracepoints are implemented under the

1286

covers by kprobes and uprobes. kprobes and uprobes are also used

1287

by and in fact are the main focus of SystemTap.

</informalexample>

</section>

</section>

<title>Documentation</title>

1294

1295

<para>

1296

Online versions of the man pages for the commands discussed in this

1297

section can be found here:

1298

1299

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-stat'>'perf stat' manpage</ulink>.

1300

</para></listitem>

1301

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-record'>'perf record' manpage</ulink>.

1302

</para></listitem>

1303

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-report'>'perf report' manpage</ulink>.

1304

</para></listitem>

1305

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-probe'>'perf probe' manpage</ulink>.

1306

</para></listitem>

1307

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-script'>'perf script' manpage</ulink>.

1308

</para></listitem>

1309

<listitem><para>Documentation on using the

1310

<ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.

1311

</para></listitem>

1312

<listitem><para>The top-level

1313

<ulink url='http://linux.die.net/man/1/perf'>perf(1) manpage</ulink>.

</para></listitem>

</itemizedlist>

</para>

<para>

Normally, you should be able to invoke the man pages via perf

1320

itself e.g. 'perf help' or 'perf help record'.

</para>

<para>

However, by default Yocto doesn't install man pages, but perf

1325

invokes the man pages for most help functionality. This is a bug

1326

and is being addressed by a Yocto bug:

1327

<ulink url='https://bugzilla.yoctoproject.org/show_bug.cgi?id=3388'>Bug 3388 - perf: enable man pages for basic 'help' functionality</ulink>.

</para>

<para>

The man pages in text form, along with some other files, such as

1332

a set of examples, can be found in the 'perf' directory of the

1333

kernel tree:

1334

1335

tools/perf/Documentation

1336

</literallayout>

1337

There's also a nice perf tutorial on the perf wiki that goes

1338

into more detail than we do here in certain areas:

1339

<ulink url='https://perf.wiki.kernel.org/index.php/Tutorial'>Perf Tutorial</ulink>

</para>

</section>

</section>

<title>ftrace</title>

1346

1347

<para>

1348

'ftrace' literally refers to the 'ftrace function tracer' but in

1349

reality this encompasses a number of related tracers along with

1350

the infrastructure that they all make use of.

</para>

<title>Setup</title>

<para>

For this section, we'll assume you've already performed the basic

1358

setup outlined in the General Setup section.

</para>

<para>

ftrace, trace-cmd, and kernelshark run on the target system,

1363

and are ready to go out-of-the-box - no additional setup is

1364

necessary. For the rest of this section we assume you've ssh'ed

1365

to the host and will be running ftrace on the target. kernelshark

1366

is a GUI application and if you use the '-X' option to ssh you

1367

can have the kernelshark GUI run on the target but display

1368

remotely on the host if you want.

</para>

</section>

<title>Basic ftrace usage</title>

1374

1375

<para>

1376

'ftrace' essentially refers to everything included in

1377

the /tracing directory of the mounted debugfs filesystem

1378

(Yocto follows the standard convention and mounts it

1379

at /sys/kernel/debug). Here's a listing of all the files

1380

found in /sys/kernel/debug/tracing on a Yocto system:

1381

1382

root@sugarbay:/sys/kernel/debug/tracing# ls

1383

README kprobe_events trace

1384

available_events kprobe_profile trace_clock

1385

available_filter_functions options trace_marker

1386

available_tracers per_cpu trace_options

1387

buffer_size_kb printk_formats trace_pipe

1388

buffer_total_size_kb saved_cmdlines tracing_cpumask

1389

current_tracer set_event tracing_enabled

1390

dyn_ftrace_total_info set_ftrace_filter tracing_on

1391

enabled_functions set_ftrace_notrace tracing_thresh

1392

events set_ftrace_pid

1393

free_buffer set_graph_function

1394

</literallayout>

1395

The files listed above are used for various purposes -

1396

some relate directly to the tracers themselves, others are

1397

used to set tracing options, and yet others actually contain

1398

the tracing output when a tracer is in effect. Some of the

1399

functions can be guessed from their names, others need

1400

explanation; in any case, we'll cover some of the files we

1401

see here below but for an explanation of the others, please

1402

see the ftrace documentation.

</para>

<para>

We'll start by looking at some of the available built-in

tracers.

</para>

<para>

cat'ing the 'available_tracers' file lists the set of

1412

available tracers:

1413

1414

root@sugarbay:/sys/kernel/debug/tracing# cat available_tracers

1415

blk function_graph function nop

1416

</literallayout>

1417

The 'current_tracer' file contains the tracer currently in

1418

effect:

1419

1420

root@sugarbay:/sys/kernel/debug/tracing# cat current_tracer

1421

nop

1422

</literallayout>

1423

The above listing of current_tracer shows that

1424

the 'nop' tracer is in effect, which is just another

1425

way of saying that there's actually no tracer

currently in effect.

</para>

<para>

echo'ing one of the available_tracers into current_tracer

1431

makes the specified tracer the current tracer:

1432

1433

root@sugarbay:/sys/kernel/debug/tracing# echo function > current_tracer

1434

root@sugarbay:/sys/kernel/debug/tracing# cat current_tracer

1435

function

1436

</literallayout>

1437

The above sets the current tracer to be the

1438

'function tracer'. This tracer traces every function

1439

call in the kernel and makes it available as the

1440

contents of the 'trace' file. Reading the 'trace' file

1441

lists the currently buffered function calls that have been

1442

traced by the function tracer:

1443

1444

root@sugarbay:/sys/kernel/debug/tracing# cat trace | less

# tracer: function

#

# entries-in-buffer/entries-written: 310629/766471 #P:8

1449

#

1450

# _-----=> irqs-off

1451

# / _----=> need-resched

1452

# | / _---=> hardirq/softirq

1453

# || / _--=> preempt-depth

1454

# ||| / delay

1455

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

1456

# | | | |||| | |

1457

<idle>-0 [004] d..1 470.867169: ktime_get_real <-intel_idle

1458

<idle>-0 [004] d..1 470.867170: getnstimeofday <-ktime_get_real

1459

<idle>-0 [004] d..1 470.867171: ns_to_timeval <-intel_idle

1460

<idle>-0 [004] d..1 470.867171: ns_to_timespec <-ns_to_timeval

1461

<idle>-0 [004] d..1 470.867172: smp_apic_timer_interrupt <-apic_timer_interrupt

1462

<idle>-0 [004] d..1 470.867172: native_apic_mem_write <-smp_apic_timer_interrupt

1463

<idle>-0 [004] d..1 470.867172: irq_enter <-smp_apic_timer_interrupt

1464

<idle>-0 [004] d..1 470.867172: rcu_irq_enter <-irq_enter

1465

<idle>-0 [004] d..1 470.867173: rcu_idle_exit_common.isra.33 <-rcu_irq_enter

1466

<idle>-0 [004] d..1 470.867173: local_bh_disable <-irq_enter

1467

<idle>-0 [004] d..1 470.867173: add_preempt_count <-local_bh_disable

1468

<idle>-0 [004] d.s1 470.867174: tick_check_idle <-irq_enter

1469

<idle>-0 [004] d.s1 470.867174: tick_check_oneshot_broadcast <-tick_check_idle

1470

<idle>-0 [004] d.s1 470.867174: ktime_get <-tick_check_idle

1471

<idle>-0 [004] d.s1 470.867174: tick_nohz_stop_idle <-tick_check_idle

1472

<idle>-0 [004] d.s1 470.867175: update_ts_time_stats <-tick_nohz_stop_idle

1473

<idle>-0 [004] d.s1 470.867175: nr_iowait_cpu <-update_ts_time_stats

1474

<idle>-0 [004] d.s1 470.867175: tick_do_update_jiffies64 <-tick_check_idle

1475

<idle>-0 [004] d.s1 470.867175: _raw_spin_lock <-tick_do_update_jiffies64

1476

<idle>-0 [004] d.s1 470.867176: add_preempt_count <-_raw_spin_lock

1477

<idle>-0 [004] d.s2 470.867176: do_timer <-tick_do_update_jiffies64

1478

<idle>-0 [004] d.s2 470.867176: _raw_spin_lock <-do_timer

1479

<idle>-0 [004] d.s2 470.867176: add_preempt_count <-_raw_spin_lock

1480

<idle>-0 [004] d.s3 470.867177: ntp_tick_length <-do_timer

1481

<idle>-0 [004] d.s3 470.867177: _raw_spin_lock_irqsave <-ntp_tick_length

.

.

.

</literallayout>

Each line in the trace above shows what was happening in

1487

the kernel on a given cpu, to the level of detail of

1488

function calls. Each entry shows the function called,

1489

followed by its caller (after the arrow).

</para>

<para>

The function tracer gives you an extremely detailed idea

1494

of what the kernel was doing at the point in time the trace

1495

was taken, and is a great way to learn about how the kernel

1496

code works in a dynamic sense.

</para>

<emphasis>Tying it Together:</emphasis> The ftrace function tracer is also

1501

available from within perf, as the ftrace:function tracepoint.

</informalexample>

<para>

It is a little more difficult to follow the call chains than

1506

it needs to be - luckily there's a variant of the function

1507

tracer that displays the callchains explicitly, called the

1508

'function_graph' tracer:

1509

1510

root@sugarbay:/sys/kernel/debug/tracing# echo function_graph > current_tracer

1511

root@sugarbay:/sys/kernel/debug/tracing# cat trace | less

1512

1513

tracer: function_graph

1514

1515

CPU DURATION FUNCTION CALLS

1516

| | | | | | |

1517

7) 0.046 us | pick_next_task_fair();

1518

7) 0.043 us | pick_next_task_stop();

1519

7) 0.042 us | pick_next_task_rt();

1520

7) 0.032 us | pick_next_task_fair();

1521

7) 0.030 us | pick_next_task_idle();

1522

7) | _raw_spin_unlock_irq() {

1523

7) 0.033 us | sub_preempt_count();

1524

7) 0.258 us | }

1525

7) 0.032 us | sub_preempt_count();

1526

7) + 13.341 us | } /* __schedule */

1527

7) 0.095 us | } /* sub_preempt_count */

1528

7) | schedule() {

1529

7) | __schedule() {

1530

7) 0.060 us | add_preempt_count();

1531

7) 0.044 us | rcu_note_context_switch();

1532

7) | _raw_spin_lock_irq() {

1533

7) 0.033 us | add_preempt_count();

1534

7) 0.247 us | }

1535

7) | idle_balance() {

1536

7) | _raw_spin_unlock() {

1537

7) 0.031 us | sub_preempt_count();

1538

7) 0.246 us | }

1539

7) | update_shares() {

1540

7) 0.030 us | __rcu_read_lock();

1541

7) 0.029 us | __rcu_read_unlock();

1542

7) 0.484 us | }

1543

7) 0.030 us | __rcu_read_lock();

1544

7) | load_balance() {

1545

7) | find_busiest_group() {

1546

7) 0.031 us | idle_cpu();

1547

7) 0.029 us | idle_cpu();

1548

7) 0.035 us | idle_cpu();

1549

7) 0.906 us | }

1550

7) 1.141 us | }

1551

7) 0.022 us | msecs_to_jiffies();

1552

7) | load_balance() {

1553

7) | find_busiest_group() {

1554

7) 0.031 us | idle_cpu();

.

.

.

4) 0.062 us | msecs_to_jiffies();

1559

4) 0.062 us | __rcu_read_unlock();

1560

4) | _raw_spin_lock() {

1561

4) 0.073 us | add_preempt_count();

1562

4) 0.562 us | }

1563

4) + 17.452 us | }

1564

4) 0.108 us | put_prev_task_fair();

1565

4) 0.102 us | pick_next_task_fair();

1566

4) 0.084 us | pick_next_task_stop();

1567

4) 0.075 us | pick_next_task_rt();

1568

4) 0.062 us | pick_next_task_fair();

1569

4) 0.066 us | pick_next_task_idle();

1570

------------------------------------------

1571

4) kworker-74 => <idle>-0

1572

------------------------------------------

1573

1574

4) | finish_task_switch() {

1575

4) | _raw_spin_unlock_irq() {

1576

4) 0.100 us | sub_preempt_count();

1577

4) 0.582 us | }

1578

4) 1.105 us | }

1579

4) 0.088 us | sub_preempt_count();

4) ! 100.066 us | }

.

.

.

3) | sys_ioctl() {

3) 0.083 us | fget_light();

1586

3) | security_file_ioctl() {

1587

3) 0.066 us | cap_file_ioctl();

1588

3) 0.562 us | }

1589

3) | do_vfs_ioctl() {

1590

3) | drm_ioctl() {

1591

3) 0.075 us | drm_ut_debug_printk();

1592

3) | i915_gem_pwrite_ioctl() {

1593

3) | i915_mutex_lock_interruptible() {

1594

3) 0.070 us | mutex_lock_interruptible();

1595

3) 0.570 us | }

1596

3) | drm_gem_object_lookup() {

1597

3) | _raw_spin_lock() {

1598

3) 0.080 us | add_preempt_count();

1599

3) 0.620 us | }

1600

3) | _raw_spin_unlock() {

1601

3) 0.085 us | sub_preempt_count();

1602

3) 0.562 us | }

1603

3) 2.149 us | }

1604

3) 0.133 us | i915_gem_object_pin();

1605

3) | i915_gem_object_set_to_gtt_domain() {

1606

3) 0.065 us | i915_gem_object_flush_gpu_write_domain();

1607

3) 0.065 us | i915_gem_object_wait_rendering();

1608

3) 0.062 us | i915_gem_object_flush_cpu_write_domain();

1609

3) 1.612 us | }

1610

3) | i915_gem_object_put_fence() {

1611

3) 0.097 us | i915_gem_object_flush_fence.constprop.36();

1612

3) 0.645 us | }

1613

3) 0.070 us | add_preempt_count();

1614

3) 0.070 us | sub_preempt_count();

1615

3) 0.073 us | i915_gem_object_unpin();

1616

3) 0.068 us | mutex_unlock();

3) 9.924 us | }

3) + 11.236 us | }

3) + 11.770 us | }

3) + 13.784 us | }

3) | sys_ioctl() {

</literallayout>

As you can see, the function_graph display is much easier to

1624

follow. Also note that in addition to the function calls and

1625

associated braces, other events such as scheduler events

1626

are displayed in context. In fact, you can freely include

1627

any tracepoint available in the trace events subsystem described

1628

in the next section by simply enabling those events, and they'll

1629

appear in context in the function graph display. Quite a

1630

powerful tool for understanding kernel dynamics.

</para>

<para>

Also notice that there are various annotations on the left

1635

hand side of the display. For example if the total time it

1636

took for a given function to execute is above a certain

1637

threshold, an exclamation point or plus sign appears on the

1638

left hand side. Please see the ftrace documentation for

1639

details on all these fields.

</para>

</section>

<title>The 'trace events' Subsystem</title>

1645

1646

<para>

1647

One especially important directory contained within

1648

the /sys/kernel/debug/tracing directory is the 'events'

1649

subdirectory, which contains representations of every

1650

tracepoint in the system. Listing out the contents of

1651

the 'events' subdirectory, we see mainly another set of

1652

subdirectories:

1653

1654

root@sugarbay:/sys/kernel/debug/tracing# cd events

1655

root@sugarbay:/sys/kernel/debug/tracing/events# ls -al

1656

drwxr-xr-x 38 root root 0 Nov 14 23:19 .

1657

drwxr-xr-x 5 root root 0 Nov 14 23:19 ..

1658

drwxr-xr-x 19 root root 0 Nov 14 23:19 block

1659

drwxr-xr-x 32 root root 0 Nov 14 23:19 btrfs

1660

drwxr-xr-x 5 root root 0 Nov 14 23:19 drm

1661

-rw-r--r-- 1 root root 0 Nov 14 23:19 enable

1662

drwxr-xr-x 40 root root 0 Nov 14 23:19 ext3

1663

drwxr-xr-x 79 root root 0 Nov 14 23:19 ext4

1664

drwxr-xr-x 14 root root 0 Nov 14 23:19 ftrace

1665

drwxr-xr-x 8 root root 0 Nov 14 23:19 hda

1666

-r--r--r-- 1 root root 0 Nov 14 23:19 header_event

1667

-r--r--r-- 1 root root 0 Nov 14 23:19 header_page

1668

drwxr-xr-x 25 root root 0 Nov 14 23:19 i915

1669

drwxr-xr-x 7 root root 0 Nov 14 23:19 irq

1670

drwxr-xr-x 12 root root 0 Nov 14 23:19 jbd

1671

drwxr-xr-x 14 root root 0 Nov 14 23:19 jbd2

1672

drwxr-xr-x 14 root root 0 Nov 14 23:19 kmem

1673

drwxr-xr-x 7 root root 0 Nov 14 23:19 module

1674

drwxr-xr-x 3 root root 0 Nov 14 23:19 napi

1675

drwxr-xr-x 6 root root 0 Nov 14 23:19 net

1676

drwxr-xr-x 3 root root 0 Nov 14 23:19 oom

1677

drwxr-xr-x 12 root root 0 Nov 14 23:19 power

1678

drwxr-xr-x 3 root root 0 Nov 14 23:19 printk

1679

drwxr-xr-x 8 root root 0 Nov 14 23:19 random

1680

drwxr-xr-x 4 root root 0 Nov 14 23:19 raw_syscalls

1681

drwxr-xr-x 3 root root 0 Nov 14 23:19 rcu

1682

drwxr-xr-x 6 root root 0 Nov 14 23:19 rpm

1683

drwxr-xr-x 20 root root 0 Nov 14 23:19 sched

1684

drwxr-xr-x 7 root root 0 Nov 14 23:19 scsi

1685

drwxr-xr-x 4 root root 0 Nov 14 23:19 signal

1686

drwxr-xr-x 5 root root 0 Nov 14 23:19 skb

1687

drwxr-xr-x 4 root root 0 Nov 14 23:19 sock

1688

drwxr-xr-x 10 root root 0 Nov 14 23:19 sunrpc

1689

drwxr-xr-x 538 root root 0 Nov 14 23:19 syscalls

1690

drwxr-xr-x 4 root root 0 Nov 14 23:19 task

1691

drwxr-xr-x 14 root root 0 Nov 14 23:19 timer

1692

drwxr-xr-x 3 root root 0 Nov 14 23:19 udp

1693

drwxr-xr-x 21 root root 0 Nov 14 23:19 vmscan

1694

drwxr-xr-x 3 root root 0 Nov 14 23:19 vsyscall

1695

drwxr-xr-x 6 root root 0 Nov 14 23:19 workqueue

1696

drwxr-xr-x 26 root root 0 Nov 14 23:19 writeback

1697

</literallayout>

1698

Each one of these subdirectories corresponds to a

1699

'subsystem' and contains yet again more subdirectories,

1700

each one of those finally corresponding to a tracepoint.

1701

For example, here are the contents of the 'kmem' subsystem:

1702

1703

root@sugarbay:/sys/kernel/debug/tracing/events# cd kmem

1704

root@sugarbay:/sys/kernel/debug/tracing/events/kmem# ls -al

1705

drwxr-xr-x 14 root root 0 Nov 14 23:19 .

1706

drwxr-xr-x 38 root root 0 Nov 14 23:19 ..

1707

-rw-r--r-- 1 root root 0 Nov 14 23:19 enable

1708

-rw-r--r-- 1 root root 0 Nov 14 23:19 filter

1709

drwxr-xr-x 2 root root 0 Nov 14 23:19 kfree

1710

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmalloc

1711

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmalloc_node

1712

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_alloc

1713

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_alloc_node

1714

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_free

1715

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc

1716

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc_extfrag

1717

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc_zone_locked

1718

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_free

1719

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_free_batched

1720

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_pcpu_drain

1721

</literallayout>

1722

Let's see what's inside the subdirectory for a specific

1723

tracepoint, in this case the one for kmalloc:

1724

1725

root@sugarbay:/sys/kernel/debug/tracing/events/kmem# cd kmalloc

1726

root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# ls -al

1727

drwxr-xr-x 2 root root 0 Nov 14 23:19 .

1728

drwxr-xr-x 14 root root 0 Nov 14 23:19 ..

1729

-rw-r--r-- 1 root root 0 Nov 14 23:19 enable

1730

-rw-r--r-- 1 root root 0 Nov 14 23:19 filter

1731

-r--r--r-- 1 root root 0 Nov 14 23:19 format

1732

-r--r--r-- 1 root root 0 Nov 14 23:19 id

1733

</literallayout>

1734

The 'format' file for the tracepoint describes the event

1735

in memory, which is used by the various tracing tools

1736

that now make use of these tracepoint to parse the event

1737

and make sense of it, along with a 'print fmt' field that

1738

allows tools like ftrace to display the event as text.

1739

Here's what the format of the kmalloc event looks like:

1740

1741

root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# cat format

name: kmalloc

ID: 313

format:

field:unsigned short common_type; offset:0; size:2; signed:0;

1746

field:unsigned char common_flags; offset:2; size:1; signed:0;

1747

field:unsigned char common_preempt_count; offset:3; size:1; signed:0;

1748

field:int common_pid; offset:4; size:4; signed:1;

1749

field:int common_padding; offset:8; size:4; signed:1;

1750

1751

field:unsigned long call_site; offset:16; size:8; signed:0;

1752

field:const void * ptr; offset:24; size:8; signed:0;

1753

field:size_t bytes_req; offset:32; size:8; signed:0;

1754

field:size_t bytes_alloc; offset:40; size:8; signed:0;

1755

field:gfp_t gfp_flags; offset:48; size:4; signed:0;

1756

1757

print fmt: "call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s", REC->call_site, REC->ptr, REC->bytes_req, REC->bytes_alloc,

1758

(REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {(unsigned long)(((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((

1759

1760

gfp_t)0x400000u)), "GFP_TRANSHUGE"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x20000u) | ((

1761

gfp_t)0x02u) | (( gfp_t)0x08u)), "GFP_HIGHUSER_MOVABLE"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((

1762

gfp_t)0x20000u) | (( gfp_t)0x02u)), "GFP_HIGHUSER"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((

1763

gfp_t)0x20000u)), "GFP_USER"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x80000u)), GFP_TEMPORARY"},

1764

{(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u)), "GFP_KERNEL"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u)),

1765

"GFP_NOFS"}, {(unsigned long)((( gfp_t)0x20u)), "GFP_ATOMIC"}, {(unsigned long)((( gfp_t)0x10u)), "GFP_NOIO"}, {(unsigned long)((

1766

gfp_t)0x20u), "GFP_HIGH"}, {(unsigned long)(( gfp_t)0x10u), "GFP_WAIT"}, {(unsigned long)(( gfp_t)0x40u), "GFP_IO"}, {(unsigned long)((

1767

gfp_t)0x100u), "GFP_COLD"}, {(unsigned long)(( gfp_t)0x200u), "GFP_NOWARN"}, {(unsigned long)(( gfp_t)0x400u), "GFP_REPEAT"}, {(unsigned

1768

long)(( gfp_t)0x800u), "GFP_NOFAIL"}, {(unsigned long)(( gfp_t)0x1000u), "GFP_NORETRY"}, {(unsigned long)(( gfp_t)0x4000u), "GFP_COMP"},

1769

{(unsigned long)(( gfp_t)0x8000u), "GFP_ZERO"}, {(unsigned long)(( gfp_t)0x10000u), "GFP_NOMEMALLOC"}, {(unsigned long)(( gfp_t)0x20000u),

1770

"GFP_HARDWALL"}, {(unsigned long)(( gfp_t)0x40000u), "GFP_THISNODE"}, {(unsigned long)(( gfp_t)0x80000u), "GFP_RECLAIMABLE"}, {(unsigned

1771

long)(( gfp_t)0x08u), "GFP_MOVABLE"}, {(unsigned long)(( gfp_t)0), "GFP_NOTRACK"}, {(unsigned long)(( gfp_t)0x400000u), "GFP_NO_KSWAPD"},

1772

{(unsigned long)(( gfp_t)0x800000u), "GFP_OTHER_NODE"} ) : "GFP_NOWAIT"

1773

</literallayout>

1774

The 'enable' file in the tracepoint directory is what allows

1775

the user (or tools such as trace-cmd) to actually turn the

1776

tracepoint on and off. When enabled, the corresponding

1777

tracepoint will start appearing in the ftrace 'trace'

1778

file described previously. For example, this turns on the

1779

kmalloc tracepoint:

1780

1781

root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# echo 1 > enable

1782

</literallayout>

1783

At the moment, we're not interested in the function tracer or

1784

some other tracer that might be in effect, so we first turn

1785

it off, but if we do that, we still need to turn tracing on in

1786

order to see the events in the output buffer:

1787

1788

root@sugarbay:/sys/kernel/debug/tracing# echo nop > current_tracer

1789

root@sugarbay:/sys/kernel/debug/tracing# echo 1 > tracing_on

1790

</literallayout>

1791

Now, if we look at the the 'trace' file, we see nothing

1792

but the kmalloc events we just turned on:

1793

1794

root@sugarbay:/sys/kernel/debug/tracing# cat trace | less

1795

# tracer: nop

1796

#

1797

# entries-in-buffer/entries-written: 1897/1897 #P:8

1798

#

1799

# _-----=> irqs-off

1800

# / _----=> need-resched

1801

# | / _---=> hardirq/softirq

1802

# || / _--=> preempt-depth

1803

# ||| / delay

1804

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

1805

# | | | |||| | |

1806

dropbear-1465 [000] ...1 18154.620753: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1807

<idle>-0 [000] ..s3 18154.621640: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1808

<idle>-0 [000] ..s3 18154.621656: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1809

matchbox-termin-1361 [001] ...1 18154.755472: kmalloc: call_site=ffffffff81614050 ptr=ffff88006d5f0e00 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_KERNEL|GFP_REPEAT

1810

Xorg-1264 [002] ...1 18154.755581: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY

1811

Xorg-1264 [002] ...1 18154.755583: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO

1812

Xorg-1264 [002] ...1 18154.755589: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO

1813

matchbox-termin-1361 [001] ...1 18155.354594: kmalloc: call_site=ffffffff81614050 ptr=ffff88006db35400 bytes_req=576 bytes_alloc=1024 gfp_flags=GFP_KERNEL|GFP_REPEAT

1814

Xorg-1264 [002] ...1 18155.354703: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY

1815

Xorg-1264 [002] ...1 18155.354705: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO

1816

Xorg-1264 [002] ...1 18155.354711: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO

1817

<idle>-0 [000] ..s3 18155.673319: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1818

dropbear-1465 [000] ...1 18155.673525: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1819

<idle>-0 [000] ..s3 18155.674821: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1820

<idle>-0 [000] ..s3 18155.793014: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1821

dropbear-1465 [000] ...1 18155.793219: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1822

<idle>-0 [000] ..s3 18155.794147: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1823

<idle>-0 [000] ..s3 18155.936705: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1824

dropbear-1465 [000] ...1 18155.936910: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1825

<idle>-0 [000] ..s3 18155.937869: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1826

matchbox-termin-1361 [001] ...1 18155.953667: kmalloc: call_site=ffffffff81614050 ptr=ffff88006d5f2000 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_KERNEL|GFP_REPEAT

1827

Xorg-1264 [002] ...1 18155.953775: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY

1828

Xorg-1264 [002] ...1 18155.953777: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO

1829

Xorg-1264 [002] ...1 18155.953783: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO

1830

<idle>-0 [000] ..s3 18156.176053: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1831

dropbear-1465 [000] ...1 18156.176257: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1832

<idle>-0 [000] ..s3 18156.177717: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1833

<idle>-0 [000] ..s3 18156.399229: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1834

dropbear-1465 [000] ...1 18156.399434: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_http://rostedt.homelinux.com/kernelshark/req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1835

<idle>-0 [000] ..s3 18156.400660: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1836

matchbox-termin-1361 [001] ...1 18156.552800: kmalloc: call_site=ffffffff81614050 ptr=ffff88006db34800 bytes_req=576 bytes_alloc=1024 gfp_flags=GFP_KERNEL|GFP_REPEAT

1837

</literallayout>

1838

To again disable the kmalloc event, we need to send 0 to the

1839

enable file:

1840

1841

root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# echo 0 > enable

1842

</literallayout>

1843

You can enable any number of events or complete subsystems

1844

(by using the 'enable' file in the subsystem directory) and

1845

get an arbitrarily fine-grained idea of what's going on in the

1846

system by enabling as many of the appropriate tracepoints

as applicable.

</para>

<para>

A number of the tools described in this HOWTO do just that,

1852

including trace-cmd and kernelshark in the next section.

</para>

<emphasis>Tying it Together:</emphasis> These tracepoints and their representation

1857

are used not only by ftrace, but by many of the other tools

1858

covered in this document and they form a central point of

1859

integration for the various tracers available in Linux.

1860

They form a central part of the instrumentation for the

1861

following tools: perf, lttng, ftrace, blktrace and SystemTap

</informalexample>

<emphasis>Tying it Together:</emphasis> Eventually all the special-purpose tracers

1866

currently available in /sys/kernel/debug/tracing will be

1867

removed and replaced with equivalent tracers based on the

1868

'trace events' subsystem.

</informalexample>

</section>

<title>trace-cmd/kernelshark</title>

1874

1875

<para>

1876

trace-cmd is essentially an extensive command-line 'wrapper'

1877

interface that hides the details of all the individual files

1878

in /sys/kernel/debug/tracing, allowing users to specify

1879

specific particular events within the

1880

/sys/kernel/debug/tracing/events/ subdirectory and to collect

1881

traces and avoid having to deal with those details directly.

</para>

<para>

As yet another layer on top of that, kernelshark provides a GUI

1886

that allows users to start and stop traces and specify sets

1887

of events using an intuitive interface, and view the

1888

output as both trace events and as a per-CPU graphical

1889

display. It directly uses 'trace-cmd' as the plumbing

1890

that accomplishes all that underneath the covers (and

1891

actually displays the trace-cmd command it uses, as we'll see).

</para>

<para>

To start a trace using kernelshark, first start kernelshark:

1896

1897

root@sugarbay:~# kernelshark

1898

</literallayout>

1899

Then bring up the 'Capture' dialog by choosing from the

kernelshark menu:

Capture | Record

</literallayout>

That will display the following dialog, which allows you to

1905

choose one or more events (or even one or more complete

1906

subsystems) to trace:

</para>

<para>

</para>

<para>

Note that these are exactly the same sets of events described

1915

in the previous trace events subsystem section, and in fact

1916

is where trace-cmd gets them for kernelshark.

</para>

<para>

In the above screenshot, we've decided to explore the

1921

graphics subsystem a bit and so have chosen to trace all

1922

the tracepoints contained within the 'i915' and 'drm'

subsystems.

</para>

<para>

After doing that, we can start and stop the trace using

1928

the 'Run' and 'Stop' button on the lower right corner of

1929

the dialog (the same button will turn into the 'Stop'

1930

button after the trace has started):

</para>

<para>

</para>

<para>

Notice that the right-hand pane shows the exact trace-cmd

1939

command-line that's used to run the trace, along with the

1940

results of the trace-cmd run.

</para>

<para>

Once the 'Stop' button is pressed, the graphical view magically

1945

fills up with a colorful per-cpu display of the trace data,

1946

along with the detailed event listing below that:

</para>

<para>

</para>

<para>

Here's another example, this time a display resulting

1955

from tracing 'all events':

</para>

<para>

</para>

<para>

The tool is pretty self-explanatory, but for more detailed

1964

information on navigating through the data, see the

1965

<ulink url='http://rostedt.homelinux.com/kernelshark/'>kernelshark website</ulink>.

</para>

</section>

<title>Documentation</title>

1971

1972

<para>

1973

The documentation for ftrace can be found in the kernel

1974

Documentation directory:

1975

1976

Documentation/trace/ftrace.txt

1977

</literallayout>

1978

The documentation for the trace event subsystem can also

1979

be found in the kernel Documentation directory:

1980

1981

Documentation/trace/events.txt

1982

</literallayout>

1983

There is a nice series of articles on using

1984

ftrace and trace-cmd at LWN:

1985

1986

<listitem><para><ulink url='http://lwn.net/Articles/365835/'>Debugging the kernel using Ftrace - part 1</ulink>

1987

</para></listitem>

1988

<listitem><para><ulink url='http://lwn.net/Articles/366796/'>Debugging the kernel using Ftrace - part 2</ulink>

1989

</para></listitem>

1990

<listitem><para><ulink url='http://lwn.net/Articles/370423/'>Secrets of the Ftrace function tracer</ulink>

1991

</para></listitem>

1992

<listitem><para><ulink url='https://lwn.net/Articles/410200/'>trace-cmd: A front-end for Ftrace</ulink>

</para></listitem>

</itemizedlist>

</para>

<para>

There's more detailed documentation kernelshark usage here:

1999

<ulink url='http://rostedt.homelinux.com/kernelshark/'>KernelShark</ulink>

</para>

<para>

An amusing yet useful README (a tracing mini-HOWTO) can be

2004

found in /sys/kernel/debug/tracing/README.

</para>

</section>

</section>

<title>systemtap</title>

2011

2012

<para>

2013

SystemTap is a system-wide script-based tracing and profiling tool.

</para>

<para>

SystemTap scripts are C-like programs that are executed in the

2018

kernel to gather/print/aggregate data extracted from the context

2019

they end up being invoked under.

</para>

<para>

For example, this probe from the

2024

<ulink url='http://sourceware.org/systemtap/tutorial/'>SystemTap tutorial</ulink>

2025

simply prints a line every time any process on the system open()s

2026

a file. For each line, it prints the executable name of the

2027

program that opened the file, along with its PID, and the name

2028

of the file it opened (or tried to open), which it extracts

2029

from the open syscall's argstr.

probe syscall.open

{

printf ("%s(%d) open (%s)\n", execname(), pid(), argstr)

2034

}

2035

2036

probe timer.ms(4000) # after 4 seconds

{

exit ()

}

</literallayout>

Normally, to execute this probe, you'd simply install

2042

systemtap on the system you want to probe, and directly run

2043

the probe on that system e.g. assuming the name of the file

2044

containing the above text is trace_open.stp:

2045

2046

# stap trace_open.stp

2047

</literallayout>

2048

What systemtap does under the covers to run this probe is 1)

2049

parse and convert the probe to an equivalent 'C' form, 2)

2050

compile the 'C' form into a kernel module, 3) insert the

2051

module into the kernel, which arms it, and 4) collect the data

2052

generated by the probe and display it to the user.

</para>

<para>

In order to accomplish steps 1 and 2, the 'stap' program needs

2057

access to the kernel build system that produced the kernel

2058

that the probed system is running. In the case of a typical

2059

embedded system (the 'target'), the kernel build system

2060

unfortunately isn't typically part of the image running on

2061

the target. It is normally available on the 'host' system

2062

that produced the target image however; in such cases,

2063

steps 1 and 2 are executed on the host system, and steps

2064

3 and 4 are executed on the target system, using only the

systemtap 'runtime'.

</para>

<para>

The systemtap support in Yocto assumes that only steps

2070

3 and 4 are run on the target; it is possible to do

2071

everything on the target, but this section assumes only

2072

the typical embedded use-case.

</para>

<para>

So basically what you need to do in order to run a systemtap

2077

script on the target is to 1) on the host system, compile the

2078

probe into a kernel module that makes sense to the target, 2)

2079

copy the module onto the target system and 3) insert the

2080

module into the target kernel, which arms it, and 4) collect

2081

the data generated by the probe and display it to the user.

</para>

<title>Setup</title>

<para>

Those are a lot of steps and a lot of details, but

2089

fortunately Yocto includes a script called 'crosstap'

2090

that will take care of those details, allowing you to

2091

simply execute a systemtap script on the remote target,

2092

with arguments if necessary.

</para>

<para>

In order to do this from a remote host, however, you

2097

need to have access to the build for the image you

2098

booted. The 'crosstap' script provides details on how

2099

to do this if you run the script on the host without having

2100

done a build:

2101

<note>

2102

SystemTap, which uses 'crosstap', assumes you can establish an

2103

ssh connection to the remote target.

2104

Please refer to the crosstap wiki page for details on verifying

2105

ssh connections at

2106

<ulink url='https://wiki.yoctoproject.org/wiki/Tracing_and_Profiling#systemtap'></ulink>.

2107

Also, the ability to ssh into the target system is not enabled

2108

by default in *-minimal images.

2109

</note>

2110

2111

$ crosstap root@192.168.1.88 trace_open.stp

2112

2113

Error: No target kernel build found.

2114

Did you forget to create a local build of your image?

2115

2116

'crosstap' requires a local sdk build of the target system

2117

(or a build that includes 'tools-profile') in order to build

2118

kernel modules that can probe the target system.

2119

2120

Practically speaking, that means you need to do the following:

2121

- If you're running a pre-built image, download the release

2122

and/or BSP tarballs used to build the image.

2123

- If you're working from git sources, just clone the metadata

2124

and BSP layers needed to build the image you'll be booting.

2125

- Make sure you're properly set up to build a new image (see

2126

the BSP README and/or the widely available basic documentation

2127

that discusses how to build images).

2128

- Build an -sdk version of the image e.g.:

2129

$ bitbake core-image-sato-sdk

2130

OR

2131

- Build a non-sdk image but include the profiling tools:

2132

[ edit local.conf and add 'tools-profile' to the end of

2133

the EXTRA_IMAGE_FEATURES variable ]

2134

$ bitbake core-image-sato

2135

2136

Once you've build the image on the host system, you're ready to

2137

boot it (or the equivalent pre-built image) and use 'crosstap'

2138

to probe it (you need to source the environment as usual first):

2139

2140

$ source oe-init-build-env

2141

$ cd ~/my/systemtap/scripts

2142

$ crosstap root@192.168.1.xxx myscript.stp

2143

</literallayout>

2144

So essentially what you need to do is build an SDK image or

2145

image with 'tools-profile' as detailed in the

2146

"<link linkend='profile-manual-general-setup'>General Setup</link>"

2147

section of this manual, and boot the resulting target image.

</para>

<note>

If you have a build directory containing multiple machines,

2152

you need to have the MACHINE you're connecting to selected

2153

in local.conf, and the kernel in that machine's build

2154

directory must match the kernel on the booted system exactly,

2155

or you'll get the above 'crosstap' message when you try to

invoke a script.

</note>

</section>

<title>Running a Script on a Target</title>

2162

2163

<para>

2164

Once you've done that, you should be able to run a systemtap

2165

script on the target:

2166

2167

$ cd /path/to/yocto

2168

$ source oe-init-build-env

2169

2170

### Shell environment set up for builds. ###

2171

Patrick Williams

d8c66bc

2016-06-20 12:57:21 -0500

[diff] [blame^]

2172

You can now run 'bitbake <target>'

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

2173

2174

Common targets are:

Patrick Williams

d8c66bc

2016-06-20 12:57:21 -0500

[diff] [blame^]

core-image-minimal

core-image-sato

meta-toolchain

meta-ide-support

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

2179

2180

You can also run generated qemu images with a command like 'runqemu qemux86'

Patrick Williams

d8c66bc

2016-06-20 12:57:21 -0500

[diff] [blame^]

2181

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

2182

</literallayout>

2183

Once you've done that, you can cd to whatever directory

2184

contains your scripts and use 'crosstap' to run the script:

2185

2186

$ cd /path/to/my/systemap/script

2187

$ crosstap root@192.168.7.2 trace_open.stp

2188

</literallayout>

2189

If you get an error connecting to the target e.g.:

2190

2191

$ crosstap root@192.168.7.2 trace_open.stp

2192

error establishing ssh connection on remote 'root@192.168.7.2'

2193

</literallayout>

2194

Try ssh'ing to the target and see what happens:

2195

2196

$ ssh root@192.168.7.2

2197

</literallayout>

2198

A lot of the time, connection problems are due specifying a

2199

wrong IP address or having a 'host key verification error'.

</para>

<para>

If everything worked as planned, you should see something

2204

like this (enter the password when prompted, or press enter

2205

if it's set up to use no password):

2206

2207

$ crosstap root@192.168.7.2 trace_open.stp

2208

root@192.168.7.2's password:

2209

matchbox-termin(1036) open ("/tmp/vte3FS2LW", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600)

2210

matchbox-termin(1036) open ("/tmp/vteJMC7LW", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600)

</literallayout>

</para>

</section>

<title>Documentation</title>

2217

2218

<para>

2219

The SystemTap language reference can be found here:

2220

<ulink url='http://sourceware.org/systemtap/langref/'>SystemTap Language Reference</ulink>

</para>

<para>

Links to other SystemTap documents, tutorials, and examples can be

2225

found here:

2226

<ulink url='http://sourceware.org/systemtap/documentation.html'>SystemTap documentation page</ulink>

</para>

</section>

</section>

Patrick Williams

2015-09-15 14:41:29 -0500

[diff] [blame]

2231

2232

<title>Sysprof</title>

2233

2234

<para>

2235

Sysprof is a very easy to use system-wide profiler that consists

2236

of a single window with three panes and a few buttons which allow

2237

you to start, stop, and view the profile from one place.

</para>

<title>Setup</title>

<para>

For this section, we'll assume you've already performed the

2245

basic setup outlined in the General Setup section.

</para>

<para>

Sysprof is a GUI-based application that runs on the target

2250

system. For the rest of this document we assume you've

2251

ssh'ed to the host and will be running Sysprof on the

2252

target (you can use the '-X' option to ssh and have the

2253

Sysprof GUI run on the target but display remotely on the

host if you want).

</para>

</section>

<title>Basic Usage</title>

2260

2261

<para>

2262

To start profiling the system, you simply press the 'Start'

2263

button. To stop profiling and to start viewing the profile data

2264

in one easy step, press the 'Profile' button.

</para>

<para>

Once you've pressed the profile button, the three panes will

2269

fill up with profiling data:

</para>

<para>

</para>

<para>

The left pane shows a list of functions and processes.

2278

Selecting one of those expands that function in the right

2279

pane, showing all its callees. Note that this caller-oriented

2280

display is essentially the inverse of perf's default

2281

callee-oriented callchain display.

</para>

<para>

In the screenshot above, we're focusing on __copy_to_user_ll()

2286

and looking up the callchain we can see that one of the callers

2287

of __copy_to_user_ll is sys_read() and the complete callpath

2288

between them. Notice that this is essentially a portion of the

2289

same information we saw in the perf display shown in the perf

2290

section of this page.

</para>

<para>

</para>

<para>

Similarly, the above is a snapshot of the Sysprof display of a

2299

copy-from-user callchain.

</para>

<para>

Finally, looking at the third Sysprof pane in the lower left,

2304

we can see a list of all the callers of a particular function

2305

selected in the top left pane. In this case, the lower pane is

2306

showing all the callers of __mark_inode_dirty:

</para>

<para>

</para>

<para>

Double-clicking on one of those functions will in turn change the

2315

focus to the selected function, and so on.

</para>

<emphasis>Tying it Together:</emphasis> If you like sysprof's 'caller-oriented'

2320

display, you may be able to approximate it in other tools as

2321

well. For example, 'perf report' has the -g (--call-graph)

2322

option that you can experiment with; one of the options is

2323

'caller' for an inverted caller-based callgraph display.

</informalexample>

</section>

<title>Documentation</title>

2329

2330

<para>

2331

There doesn't seem to be any documentation for Sysprof, but

2332

maybe that's because it's pretty self-explanatory.

2333

The Sysprof website, however, is here:

2334

<ulink url='http://sysprof.com/'>Sysprof, System-wide Performance Profiler for Linux</ulink>

</para>

</section>

</section>

<title>LTTng (Linux Trace Toolkit, next generation)</title>

<title>Setup</title>

<para>

For this section, we'll assume you've already performed the

2347

basic setup outlined in the General Setup section.

</para>

<para>

LTTng is run on the target system by ssh'ing to it.

2352

However, if you want to see the traces graphically,

2353

install Eclipse as described in section

2354

"<link linkend='manually-copying-a-trace-to-the-host-and-viewing-it-in-eclipse'>Manually copying a trace to the host and viewing it in Eclipse (i.e. using Eclipse without network support)</link>"

2355

and follow the directions to manually copy traces to the host and

2356

view them in Eclipse (i.e. using Eclipse without network support).

</para>

<note>

Be sure to download and install/run the 'SR1' or later Juno release

2361

of eclipse e.g.:

2362

<ulink url='http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/juno/SR1/eclipse-cpp-juno-SR1-linux-gtk-x86_64.tar.gz'>http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/juno/SR1/eclipse-cpp-juno-SR1-linux-gtk-x86_64.tar.gz</ulink>

</note>

</section>

<title>Collecting and Viewing Traces</title>

2368

2369

<para>

2370

Once you've applied the above commits and built and booted your

2371

image (you need to build the core-image-sato-sdk image or use one of the

2372

other methods described in the General Setup section), you're

2373

ready to start tracing.

</para>

<title>Collecting and viewing a trace on the target (inside a shell)</title>

2378

2379

<para>

2380

First, from the host, ssh to the target:

2381

2382

$ ssh -l root 192.168.1.47

2383

The authenticity of host '192.168.1.47 (192.168.1.47)' can't be established.

2384

RSA key fingerprint is 23:bd:c8:b1:a8:71:52:00:ee:00:4f:64:9e:10:b9:7e.

2385

Are you sure you want to continue connecting (yes/no)? yes

2386

Warning: Permanently added '192.168.1.47' (RSA) to the list of known hosts.

2387

root@192.168.1.47's password:

2388

</literallayout>

2389

Once on the target, use these steps to create a trace:

2390

2391

root@crownbay:~# lttng create

2392

Spawning a session daemon

2393

Session auto-20121015-232120 created.

2394

Traces will be written in /home/root/lttng-traces/auto-20121015-232120

2395

</literallayout>

2396

Enable the events you want to trace (in this case all

2397

kernel events):

2398

2399

root@crownbay:~# lttng enable-event --kernel --all

2400

All kernel events are enabled in channel channel0

</literallayout>

Start the trace:

root@crownbay:~# lttng start

2405

Tracing started for session auto-20121015-232120

2406

</literallayout>

2407

And then stop the trace after awhile or after running

2408

a particular workload that you want to trace:

2409

2410

root@crownbay:~# lttng stop

2411

Tracing stopped for session auto-20121015-232120

2412

</literallayout>

2413

You can now view the trace in text form on the target:

2414

2415

root@crownbay:~# lttng view

2416

[23:21:56.989270399] (+?.?????????) sys_geteuid: { 1 }, { }

2417

[23:21:56.989278081] (+0.000007682) exit_syscall: { 1 }, { ret = 0 }

2418

[23:21:56.989286043] (+0.000007962) sys_pipe: { 1 }, { fildes = 0xB77B9E8C }

2419

[23:21:56.989321802] (+0.000035759) exit_syscall: { 1 }, { ret = 0 }

2420

[23:21:56.989329345] (+0.000007543) sys_mmap_pgoff: { 1 }, { addr = 0x0, len = 10485760, prot = 3, flags = 131362, fd = 4294967295, pgoff = 0 }

2421

[23:21:56.989351694] (+0.000022349) exit_syscall: { 1 }, { ret = -1247805440 }

2422

[23:21:56.989432989] (+0.000081295) sys_clone: { 1 }, { clone_flags = 0x411, newsp = 0xB5EFFFE4, parent_tid = 0xFFFFFFFF, child_tid = 0x0 }

2423

[23:21:56.989477129] (+0.000044140) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 681660, vruntime = 43367983388 }

2424

[23:21:56.989486697] (+0.000009568) sched_migrate_task: { 1 }, { comm = "lttng-consumerd", tid = 1193, prio = 20, orig_cpu = 1, dest_cpu = 1 }

2425

[23:21:56.989508418] (+0.000021721) hrtimer_init: { 1 }, { hrtimer = 3970832076, clockid = 1, mode = 1 }

2426

[23:21:56.989770462] (+0.000262044) hrtimer_cancel: { 1 }, { hrtimer = 3993865440 }

2427

[23:21:56.989771580] (+0.000001118) hrtimer_cancel: { 0 }, { hrtimer = 3993812192 }

2428

[23:21:56.989776957] (+0.000005377) hrtimer_expire_entry: { 1 }, { hrtimer = 3993865440, now = 79815980007057, function = 3238465232 }

2429

[23:21:56.989778145] (+0.000001188) hrtimer_expire_entry: { 0 }, { hrtimer = 3993812192, now = 79815980008174, function = 3238465232 }

2430

[23:21:56.989791695] (+0.000013550) softirq_raise: { 1 }, { vec = 1 }

2431

[23:21:56.989795396] (+0.000003701) softirq_raise: { 0 }, { vec = 1 }

2432

[23:21:56.989800635] (+0.000005239) softirq_raise: { 0 }, { vec = 9 }

2433

[23:21:56.989807130] (+0.000006495) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 330710, vruntime = 43368314098 }

2434

[23:21:56.989809993] (+0.000002863) sched_stat_runtime: { 0 }, { comm = "lttng-sessiond", tid = 1181, runtime = 1015313, vruntime = 36976733240 }

2435

[23:21:56.989818514] (+0.000008521) hrtimer_expire_exit: { 0 }, { hrtimer = 3993812192 }

2436

[23:21:56.989819631] (+0.000001117) hrtimer_expire_exit: { 1 }, { hrtimer = 3993865440 }

2437

[23:21:56.989821866] (+0.000002235) hrtimer_start: { 0 }, { hrtimer = 3993812192, function = 3238465232, expires = 79815981000000, softexpires = 79815981000000 }

2438

[23:21:56.989822984] (+0.000001118) hrtimer_start: { 1 }, { hrtimer = 3993865440, function = 3238465232, expires = 79815981000000, softexpires = 79815981000000 }

2439

[23:21:56.989832762] (+0.000009778) softirq_entry: { 1 }, { vec = 1 }

2440

[23:21:56.989833879] (+0.000001117) softirq_entry: { 0 }, { vec = 1 }

2441

[23:21:56.989838069] (+0.000004190) timer_cancel: { 1 }, { timer = 3993871956 }

2442

[23:21:56.989839187] (+0.000001118) timer_cancel: { 0 }, { timer = 3993818708 }

2443

[23:21:56.989841492] (+0.000002305) timer_expire_entry: { 1 }, { timer = 3993871956, now = 79515980, function = 3238277552 }

2444

[23:21:56.989842819] (+0.000001327) timer_expire_entry: { 0 }, { timer = 3993818708, now = 79515980, function = 3238277552 }

2445

[23:21:56.989854831] (+0.000012012) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 49237, vruntime = 43368363335 }

2446

[23:21:56.989855949] (+0.000001118) sched_stat_runtime: { 0 }, { comm = "lttng-sessiond", tid = 1181, runtime = 45121, vruntime = 36976778361 }

2447

[23:21:56.989861257] (+0.000005308) sched_stat_sleep: { 1 }, { comm = "kworker/1:1", tid = 21, delay = 9451318 }

2448

[23:21:56.989862374] (+0.000001117) sched_stat_sleep: { 0 }, { comm = "kworker/0:0", tid = 4, delay = 9958820 }

2449

[23:21:56.989868241] (+0.000005867) sched_wakeup: { 0 }, { comm = "kworker/0:0", tid = 4, prio = 120, success = 1, target_cpu = 0 }

2450

[23:21:56.989869358] (+0.000001117) sched_wakeup: { 1 }, { comm = "kworker/1:1", tid = 21, prio = 120, success = 1, target_cpu = 1 }

2451

[23:21:56.989877460] (+0.000008102) timer_expire_exit: { 1 }, { timer = 3993871956 }

2452

[23:21:56.989878577] (+0.000001117) timer_expire_exit: { 0 }, { timer = 3993818708 }

.

.

.

</literallayout>

You can now safely destroy the trace session (note that

2458

this doesn't delete the trace - it's still there

2459

in ~/lttng-traces):

2460

2461

root@crownbay:~# lttng destroy

2462

Session auto-20121015-232120 destroyed at /home/root

2463

</literallayout>

2464

Note that the trace is saved in a directory of the same

2465

name as returned by 'lttng create', under the ~/lttng-traces

2466

directory (note that you can change this by supplying your

2467

own name to 'lttng create'):

2468

2469

root@crownbay:~# ls -al ~/lttng-traces

2470

drwxrwx--- 3 root root 1024 Oct 15 23:21 .

2471

drwxr-xr-x 5 root root 1024 Oct 15 23:57 ..

2472

drwxrwx--- 3 root root 1024 Oct 15 23:21 auto-20121015-232120

</literallayout>

</para>

</section>

<title>Collecting and viewing a userspace trace on the target (inside a shell)</title>

2479

2480

<para>

2481

For LTTng userspace tracing, you need to have a properly

2482

instrumented userspace program. For this example, we'll use

2483

the 'hello' test program generated by the lttng-ust build.

</para>

<para>

The 'hello' test program isn't installed on the rootfs by

2488

the lttng-ust build, so we need to copy it over manually.

2489

First cd into the build directory that contains the hello

2490

executable:

2491

2492

$ cd build/tmp/work/core2_32-poky-linux/lttng-ust/2.0.5-r0/git/tests/hello/.libs

2493

</literallayout>

2494

Copy that over to the target machine:

2495

2496

$ scp hello root@192.168.1.20:

2497

</literallayout>

2498

You now have the instrumented lttng 'hello world' test

2499

program on the target, ready to test.

</para>

<para>

First, from the host, ssh to the target:

2504

2505

$ ssh -l root 192.168.1.47

2506

The authenticity of host '192.168.1.47 (192.168.1.47)' can't be established.

2507

RSA key fingerprint is 23:bd:c8:b1:a8:71:52:00:ee:00:4f:64:9e:10:b9:7e.

2508

Are you sure you want to continue connecting (yes/no)? yes

2509

Warning: Permanently added '192.168.1.47' (RSA) to the list of known hosts.

2510

root@192.168.1.47's password:

2511

</literallayout>

2512

Once on the target, use these steps to create a trace:

2513

2514

root@crownbay:~# lttng create

2515

Session auto-20190303-021943 created.

2516

Traces will be written in /home/root/lttng-traces/auto-20190303-021943

2517

</literallayout>

2518

Enable the events you want to trace (in this case all

2519

userspace events):

2520

2521

root@crownbay:~# lttng enable-event --userspace --all

2522

All UST events are enabled in channel channel0

</literallayout>

Start the trace:

root@crownbay:~# lttng start

2527

Tracing started for session auto-20190303-021943

2528

</literallayout>

2529

Run the instrumented hello world program:

2530

2531

root@crownbay:~# ./hello

Hello, World!

Tracing... done.

</literallayout>

And then stop the trace after awhile or after running a

2536

particular workload that you want to trace:

2537

2538

root@crownbay:~# lttng stop

2539

Tracing stopped for session auto-20190303-021943

2540

</literallayout>

2541

You can now view the trace in text form on the target:

2542

2543

root@crownbay:~# lttng view

2544

[02:31:14.906146544] (+?.?????????) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 0, intfield2 = 0x0, longfield = 0, netintfield = 0, netintfieldhex = 0x0, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }

2545

[02:31:14.906170360] (+0.000023816) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 1, intfield2 = 0x1, longfield = 1, netintfield = 1, netintfieldhex = 0x1, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }

2546

[02:31:14.906183140] (+0.000012780) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 2, intfield2 = 0x2, longfield = 2, netintfield = 2, netintfieldhex = 0x2, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }

2547

[02:31:14.906194385] (+0.000011245) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 3, intfield2 = 0x3, longfield = 3, netintfield = 3, netintfieldhex = 0x3, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }

.

.

.

</literallayout>

You can now safely destroy the trace session (note that

2553

this doesn't delete the trace - it's still

2554

there in ~/lttng-traces):

2555

2556

root@crownbay:~# lttng destroy

2557

Session auto-20190303-021943 destroyed at /home/root

</literallayout>

</para>

</section>

<title>Manually copying a trace to the host and viewing it in Eclipse (i.e. using Eclipse without network support)</title>

2564

2565

<para>

2566

If you already have an LTTng trace on a remote target and

2567

would like to view it in Eclipse on the host, you can easily

2568

copy it from the target to the host and import it into

2569

Eclipse to view it using the LTTng Eclipse plug-in already

2570

bundled in the Eclipse (Juno SR1 or greater).

</para>

<para>

Using the trace we created in the previous section, archive

2575

it and copy it to your host system:

2576

2577

root@crownbay:~/lttng-traces# tar zcvf auto-20121015-232120.tar.gz auto-20121015-232120

2578

auto-20121015-232120/

2579

auto-20121015-232120/kernel/

2580

auto-20121015-232120/kernel/metadata

2581

auto-20121015-232120/kernel/channel0_1

2582

auto-20121015-232120/kernel/channel0_0

2583

2584

$ scp root@192.168.1.47:lttng-traces/auto-20121015-232120.tar.gz .

2585

root@192.168.1.47's password:

2586

auto-20121015-232120.tar.gz 100% 1566KB 1.5MB/s 00:01

2587

</literallayout>

2588

Unarchive it on the host:

2589

2590

$ gunzip -c auto-20121015-232120.tar.gz | tar xvf -

2591

auto-20121015-232120/

2592

auto-20121015-232120/kernel/

2593

auto-20121015-232120/kernel/metadata

2594

auto-20121015-232120/kernel/channel0_1

2595

auto-20121015-232120/kernel/channel0_0

2596

</literallayout>

2597

We can now import the trace into Eclipse and view it:

2598

2599

<listitem><para>First, start eclipse and open the

2600

'LTTng Kernel' perspective by selecting the following

2601

menu item:

2602

2603

Window | Open Perspective | Other...

2604

</literallayout></para></listitem>

2605

<listitem><para>In the dialog box that opens, select

2606

'LTTng Kernel' from the list.</para></listitem>

2607

<listitem><para>Back at the main menu, select the

2608

following menu item:

2609

2610

File | New | Project...

2611

</literallayout></para></listitem>

2612

<listitem><para>In the dialog box that opens, select

2613

the 'Tracing | Tracing Project' wizard and press

2614

'Next>'.</para></listitem>

2615

<listitem><para>Give the project a name and press

2616

'Finish'.</para></listitem>

2617

<listitem><para>In the 'Project Explorer' pane under

2618

the project you created, right click on the

2619

'Traces' item.</para></listitem>

2620

<listitem><para>Select 'Import..." and in the dialog

2621

that's displayed:</para></listitem>

2622

<listitem><para>Browse the filesystem and find the

2623

select the 'kernel' directory containing the trace

2624

you copied from the target

2625

e.g. auto-20121015-232120/kernel</para></listitem>

2626

<listitem><para>'Checkmark' the directory in the tree

2627

that's displayed for the trace</para></listitem>

2628

<listitem><para>Below that, select 'Common Trace Format:

2629

Kernel Trace' for the 'Trace Type'</para></listitem>

2630

<listitem><para>Press 'Finish' to close the dialog

2631

</para></listitem>

2632

<listitem><para>Back in the 'Project Explorer' pane,

2633

double-click on the 'kernel' item for the

2634

trace you just imported under 'Traces'

2635

</para></listitem>

2636

</orderedlist>

2637

You should now see your trace data displayed graphically

2638

in several different views in Eclipse:

</para>

<para>

</para>

<para>

You can access extensive help information on how to use

2647

the LTTng plug-in to search and analyze captured traces via

2648

the Eclipse help system:

2649

2650

Help | Help Contents | LTTng Plug-in User Guide

</literallayout>

</para>

</section>

<title>Collecting and viewing a trace in Eclipse</title>

2657

2658

<note>

2659

This section on collecting traces remotely doesn't currently

2660

work because of Eclipse 'RSE' connectivity problems. Manually

2661

tracing on the target, copying the trace files to the host,

2662

and viewing the trace in Eclipse on the host as outlined in

2663

previous steps does work however - please use the manual

2664

steps outlined above to view traces in Eclipse.

</note>

<para>

In order to trace a remote target, you also need to add

2669

a 'tracing' group on the target and connect as a user

2670

who's part of that group e.g:

2671

2672

# adduser tomz

2673

# groupadd -r tracing

2674

# usermod -a -G tracing tomz

2675

</literallayout>

2676

2677

<listitem><para>First, start eclipse and open the

2678

'LTTng Kernel' perspective by selecting the following

2679

menu item:

2680

2681

Window | Open Perspective | Other...

2682

</literallayout></para></listitem>

2683

<listitem><para>In the dialog box that opens, select

2684

'LTTng Kernel' from the list.</para></listitem>

2685

<listitem><para>Back at the main menu, select the

2686

following menu item:

2687

2688

File | New | Project...

2689

</literallayout></para></listitem>

2690

<listitem><para>In the dialog box that opens, select

2691

the 'Tracing | Tracing Project' wizard and

2692

press 'Next>'.</para></listitem>

2693

<listitem><para>Give the project a name and press

2694

'Finish'. That should result in an entry in the

2695

'Project' subwindow.</para></listitem>

2696

<listitem><para>In the 'Control' subwindow just below

2697

it, press 'New Connection'.</para></listitem>

2698

<listitem><para>Add a new connection, giving it the

2699

hostname or IP address of the target system.

2700

</para></listitem>

2701

<listitem><para>Provide the username and password

2702

of a qualified user (a member of the 'tracing' group)

2703

or root account on the target system.

2704

</para></listitem>

2705

<listitem><para>Provide appropriate answers to whatever

2706

else is asked for e.g. 'secure storage password'

2707

can be anything you want.

2708

If you get an 'RSE Error' it may be due to proxies.

2709

It may be possible to get around the problem by

2710

changing the following setting:

2711

2712

Window | Preferences | Network Connections

2713

</literallayout>

2714

Switch 'Active Provider' to 'Direct'

</para></listitem>

</orderedlist>

</para>

</section>

</section>

<title>Documentation</title>

2723

2724

<para>

2725

You can find the primary LTTng Documentation on the

2726

<ulink url='https://lttng.org/docs/'>LTTng Documentation</ulink>

2727

site.

2728

The documentation on this site is appropriate for intermediate to

2729

advanced software developers who are working in a Linux environment

2730

and are interested in efficient software tracing.

</para>

<para>

For information on LTTng in general, visit the

2735

<ulink url='http://lttng.org/lttng2.0'>LTTng Project</ulink>

2736

site.

2737

You can find a "Getting Started" link on this site that takes

2738

you to an LTTng Quick Start.

</para>

<para>

Finally, you can access extensive help information on how to use

2743

the LTTng plug-in to search and analyze captured traces via the

2744

Eclipse help system:

2745

2746

Help | Help Contents | LTTng Plug-in User Guide

</literallayout>

</para>

</section>

</section>

<title>blktrace</title>

2754

2755

<para>

2756

blktrace is a tool for tracing and reporting low-level disk I/O.

2757

blktrace provides the tracing half of the equation; its output can

2758

be piped into the blkparse program, which renders the data in a

2759

human-readable form and does some basic analysis:

</para>

<title>Setup</title>

<para>

For this section, we'll assume you've already performed the

2767

basic setup outlined in the

2768

"<link linkend='profile-manual-general-setup'>General Setup</link>"

section.

</para>

<para>

blktrace is an application that runs on the target system.

2774

You can run the entire blktrace and blkparse pipeline on the

2775

target, or you can run blktrace in 'listen' mode on the target

2776

and have blktrace and blkparse collect and analyze the data on

2777

the host (see the

2778

"<link linkend='using-blktrace-remotely'>Using blktrace Remotely</link>"

2779

section below).

2780

For the rest of this section we assume you've ssh'ed to the

2781

host and will be running blkrace on the target.

</para>

</section>

<title>Basic Usage</title>

2787

2788

<para>

2789

To record a trace, simply run the 'blktrace' command, giving it

2790

the name of the block device you want to trace activity on:

2791

2792

root@crownbay:~# blktrace /dev/sdc

2793

</literallayout>

2794

In another shell, execute a workload you want to trace.

2795

2796

root@crownbay:/media/sdc# rm linux-2.6.19.2.tar.bz2; wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>; sync

2797

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

2798

linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA

2799

</literallayout>

2800

Press Ctrl-C in the blktrace shell to stop the trace. It will

2801

display how many events were logged, along with the per-cpu file

2802

sizes (blktrace records traces in per-cpu kernel buffers and

2803

simply dumps them to userspace for blkparse to merge and sort

later).

^C=== sdc ===

CPU 0: 7082 events, 332 KiB data

2808

CPU 1: 1578 events, 74 KiB data

2809

Total: 8660 events (dropped 0), 406 KiB data

2810

</literallayout>

2811

If you examine the files saved to disk, you see multiple files,

2812

one per CPU and with the device name as the first part of the

2813

filename:

2814

2815

root@crownbay:~# ls -al

2816

drwxr-xr-x 6 root root 1024 Oct 27 22:39 .

2817

drwxr-sr-x 4 root root 1024 Oct 26 18:24 ..

2818

-rw-r--r-- 1 root root 339938 Oct 27 22:40 sdc.blktrace.0

2819

-rw-r--r-- 1 root root 75753 Oct 27 22:40 sdc.blktrace.1

2820

</literallayout>

2821

To view the trace events, simply invoke 'blkparse' in the

2822

directory containing the trace files, giving it the device name

2823

that forms the first part of the filenames:

2824

2825

root@crownbay:~# blkparse sdc

2826

2827

8,32 1 1 0.000000000 1225 Q WS 3417048 + 8 [jbd2/sdc-8]

2828

8,32 1 2 0.000025213 1225 G WS 3417048 + 8 [jbd2/sdc-8]

2829

8,32 1 3 0.000033384 1225 P N [jbd2/sdc-8]

2830

8,32 1 4 0.000043301 1225 I WS 3417048 + 8 [jbd2/sdc-8]

2831

8,32 1 0 0.000057270 0 m N cfq1225 insert_request

2832

8,32 1 0 0.000064813 0 m N cfq1225 add_to_rr

2833

8,32 1 5 0.000076336 1225 U N [jbd2/sdc-8] 1

2834

8,32 1 0 0.000088559 0 m N cfq workload slice:150

2835

8,32 1 0 0.000097359 0 m N cfq1225 set_active wl_prio:0 wl_type:1

2836

8,32 1 0 0.000104063 0 m N cfq1225 Not idling. st->count:1

2837

8,32 1 0 0.000112584 0 m N cfq1225 fifo= (null)

2838

8,32 1 0 0.000118730 0 m N cfq1225 dispatch_insert

2839

8,32 1 0 0.000127390 0 m N cfq1225 dispatched a request

2840

8,32 1 0 0.000133536 0 m N cfq1225 activate rq, drv=1

2841

8,32 1 6 0.000136889 1225 D WS 3417048 + 8 [jbd2/sdc-8]

2842

8,32 1 7 0.000360381 1225 Q WS 3417056 + 8 [jbd2/sdc-8]

2843

8,32 1 8 0.000377422 1225 G WS 3417056 + 8 [jbd2/sdc-8]

2844

8,32 1 9 0.000388876 1225 P N [jbd2/sdc-8]

2845

8,32 1 10 0.000397886 1225 Q WS 3417064 + 8 [jbd2/sdc-8]

2846

8,32 1 11 0.000404800 1225 M WS 3417064 + 8 [jbd2/sdc-8]

2847

8,32 1 12 0.000412343 1225 Q WS 3417072 + 8 [jbd2/sdc-8]

2848

8,32 1 13 0.000416533 1225 M WS 3417072 + 8 [jbd2/sdc-8]

2849

8,32 1 14 0.000422121 1225 Q WS 3417080 + 8 [jbd2/sdc-8]

2850

8,32 1 15 0.000425194 1225 M WS 3417080 + 8 [jbd2/sdc-8]

2851

8,32 1 16 0.000431968 1225 Q WS 3417088 + 8 [jbd2/sdc-8]

2852

8,32 1 17 0.000435251 1225 M WS 3417088 + 8 [jbd2/sdc-8]

2853

8,32 1 18 0.000440279 1225 Q WS 3417096 + 8 [jbd2/sdc-8]

2854

8,32 1 19 0.000443911 1225 M WS 3417096 + 8 [jbd2/sdc-8]

2855

8,32 1 20 0.000450336 1225 Q WS 3417104 + 8 [jbd2/sdc-8]

2856

8,32 1 21 0.000454038 1225 M WS 3417104 + 8 [jbd2/sdc-8]

2857

8,32 1 22 0.000462070 1225 Q WS 3417112 + 8 [jbd2/sdc-8]

2858

8,32 1 23 0.000465422 1225 M WS 3417112 + 8 [jbd2/sdc-8]

2859

8,32 1 24 0.000474222 1225 I WS 3417056 + 64 [jbd2/sdc-8]

2860

8,32 1 0 0.000483022 0 m N cfq1225 insert_request

2861

8,32 1 25 0.000489727 1225 U N [jbd2/sdc-8] 1

2862

8,32 1 0 0.000498457 0 m N cfq1225 Not idling. st->count:1

2863

8,32 1 0 0.000503765 0 m N cfq1225 dispatch_insert

2864

8,32 1 0 0.000512914 0 m N cfq1225 dispatched a request

2865

8,32 1 0 0.000518851 0 m N cfq1225 activate rq, drv=2

.

.

.

8,32 0 0 58.515006138 0 m N cfq3551 complete rqnoidle 1

2870

8,32 0 2024 58.516603269 3 C WS 3156992 + 16 [0]

2871

8,32 0 0 58.516626736 0 m N cfq3551 complete rqnoidle 1

2872

8,32 0 0 58.516634558 0 m N cfq3551 arm_idle: 8 group_idle: 0

2873

8,32 0 0 58.516636933 0 m N cfq schedule dispatch

2874

8,32 1 0 58.516971613 0 m N cfq3551 slice expired t=0

2875

8,32 1 0 58.516982089 0 m N cfq3551 sl_used=13 disp=6 charge=13 iops=0 sect=80

2876

8,32 1 0 58.516985511 0 m N cfq3551 del_from_rr

2877

8,32 1 0 58.516990819 0 m N cfq3551 put_queue

2878

2879

CPU0 (sdc):

2880

Reads Queued: 0, 0KiB Writes Queued: 331, 26,284KiB

2881

Read Dispatches: 0, 0KiB Write Dispatches: 485, 40,484KiB

2882

Reads Requeued: 0 Writes Requeued: 0

2883

Reads Completed: 0, 0KiB Writes Completed: 511, 41,000KiB

2884

Read Merges: 0, 0KiB Write Merges: 13, 160KiB

2885

Read depth: 0 Write depth: 2

2886

IO unplugs: 23 Timer unplugs: 0

2887

CPU1 (sdc):

2888

Reads Queued: 0, 0KiB Writes Queued: 249, 15,800KiB

2889

Read Dispatches: 0, 0KiB Write Dispatches: 42, 1,600KiB

2890

Reads Requeued: 0 Writes Requeued: 0

2891

Reads Completed: 0, 0KiB Writes Completed: 16, 1,084KiB

2892

Read Merges: 0, 0KiB Write Merges: 40, 276KiB

2893

Read depth: 0 Write depth: 2

2894

IO unplugs: 30 Timer unplugs: 1

2895

2896

Total (sdc):

2897

Reads Queued: 0, 0KiB Writes Queued: 580, 42,084KiB

2898

Read Dispatches: 0, 0KiB Write Dispatches: 527, 42,084KiB

2899

Reads Requeued: 0 Writes Requeued: 0

2900

Reads Completed: 0, 0KiB Writes Completed: 527, 42,084KiB

2901

Read Merges: 0, 0KiB Write Merges: 53, 436KiB

2902

IO unplugs: 53 Timer unplugs: 1

2903

2904

Throughput (R/W): 0KiB/s / 719KiB/s

2905

Events (sdc): 6,592 entries

2906

Skips: 0 forward (0 - 0.0%)

2907

Input file sdc.blktrace.0 added

2908

Input file sdc.blktrace.1 added

2909

</literallayout>

2910

The report shows each event that was found in the blktrace data,

2911

along with a summary of the overall block I/O traffic during

2912

the run. You can look at the

2913

<ulink url='http://linux.die.net/man/1/blkparse'>blkparse</ulink>

2914

manpage to learn the

2915

meaning of each field displayed in the trace listing.

</para>

<para>

blktrace and blkparse are designed from the ground up to

2923

be able to operate together in a 'pipe mode' where the

2924

stdout of blktrace can be fed directly into the stdin of

2925

blkparse:

2926

2927

root@crownbay:~# blktrace /dev/sdc -o - | blkparse -i -

2928

</literallayout>

2929

This enables long-lived tracing sessions to run without

2930

writing anything to disk, and allows the user to look for

2931

certain conditions in the trace data in 'real-time' by

2932

viewing the trace output as it scrolls by on the screen or

2933

by passing it along to yet another program in the pipeline

2934

such as grep which can be used to identify and capture

2935

conditions of interest.

</para>

<para>

There's actually another blktrace command that implements

2940

the above pipeline as a single command, so the user doesn't

2941

have to bother typing in the above command sequence:

2942

2943

root@crownbay:~# btrace /dev/sdc

</literallayout>

</para>

</section>

<title>Using blktrace Remotely</title>

2950

2951

<para>

2952

Because blktrace traces block I/O and at the same time

2953

normally writes its trace data to a block device, and

2954

in general because it's not really a great idea to make

2955

the device being traced the same as the device the tracer

2956

writes to, blktrace provides a way to trace without

2957

perturbing the traced device at all by providing native

2958

support for sending all trace data over the network.

</para>

<para>

To have blktrace operate in this mode, start blktrace on

2963

the target system being traced with the -l option, along with

2964

the device to trace:

2965

2966

root@crownbay:~# blktrace -l /dev/sdc

2967

server: waiting for connections...

2968

</literallayout>

2969

On the host system, use the -h option to connect to the

2970

target system, also passing it the device to trace:

2971

2972

$ blktrace -d /dev/sdc -h 192.168.1.43

2973

blktrace: connecting to 192.168.1.43

2974

blktrace: connected!

2975

</literallayout>

2976

On the target system, you should see this:

2977

2978

server: connection from 192.168.1.43

2979

</literallayout>

2980

In another shell, execute a workload you want to trace.

2981

2982

root@crownbay:/media/sdc# rm linux-2.6.19.2.tar.bz2; wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>; sync

2983

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

2984

linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA

2985

</literallayout>

2986

When it's done, do a Ctrl-C on the host system to

stop the trace:

^C=== sdc ===

CPU 0: 7691 events, 361 KiB data

2991

CPU 1: 4109 events, 193 KiB data

2992

Total: 11800 events (dropped 0), 554 KiB data

2993

</literallayout>

2994

On the target system, you should also see a trace

2995

summary for the trace just ended:

2996

2997

server: end of run for 192.168.1.43:sdc

2998

=== sdc ===

2999

CPU 0: 7691 events, 361 KiB data

3000

CPU 1: 4109 events, 193 KiB data

3001

Total: 11800 events (dropped 0), 554 KiB data

3002

</literallayout>

3003

The blktrace instance on the host will save the target

3004

output inside a hostname-timestamp directory:

3005

3006

$ ls -al

3007

drwxr-xr-x 10 root root 1024 Oct 28 02:40 .

3008

drwxr-sr-x 4 root root 1024 Oct 26 18:24 ..

3009

drwxr-xr-x 2 root root 1024 Oct 28 02:40 192.168.1.43-2012-10-28-02:40:56

3010

</literallayout>

3011

cd into that directory to see the output files:

3012

3013

$ ls -l

3014

-rw-r--r-- 1 root root 369193 Oct 28 02:44 sdc.blktrace.0

3015

-rw-r--r-- 1 root root 197278 Oct 28 02:44 sdc.blktrace.1

3016

</literallayout>

3017

And run blkparse on the host system using the device name:

$ blkparse sdc

8,32 1 1 0.000000000 1263 Q RM 6016 + 8 [ls]

3022

8,32 1 0 0.000036038 0 m N cfq1263 alloced

3023

8,32 1 2 0.000039390 1263 G RM 6016 + 8 [ls]

3024

8,32 1 3 0.000049168 1263 I RM 6016 + 8 [ls]

3025

8,32 1 0 0.000056152 0 m N cfq1263 insert_request

3026

8,32 1 0 0.000061600 0 m N cfq1263 add_to_rr

3027

8,32 1 0 0.000075498 0 m N cfq workload slice:300

.

.

.

8,32 0 0 177.266385696 0 m N cfq1267 arm_idle: 8 group_idle: 0

3032

8,32 0 0 177.266388140 0 m N cfq schedule dispatch

3033

8,32 1 0 177.266679239 0 m N cfq1267 slice expired t=0

3034

8,32 1 0 177.266689297 0 m N cfq1267 sl_used=9 disp=6 charge=9 iops=0 sect=56

3035

8,32 1 0 177.266692649 0 m N cfq1267 del_from_rr

3036

8,32 1 0 177.266696560 0 m N cfq1267 put_queue

3037

3038

CPU0 (sdc):

3039

Reads Queued: 0, 0KiB Writes Queued: 270, 21,708KiB

3040

Read Dispatches: 59, 2,628KiB Write Dispatches: 495, 39,964KiB

3041

Reads Requeued: 0 Writes Requeued: 0

3042

Reads Completed: 90, 2,752KiB Writes Completed: 543, 41,596KiB

3043

Read Merges: 0, 0KiB Write Merges: 9, 344KiB

3044

Read depth: 2 Write depth: 2

3045

IO unplugs: 20 Timer unplugs: 1

3046

CPU1 (sdc):

3047

Reads Queued: 688, 2,752KiB Writes Queued: 381, 20,652KiB

3048

Read Dispatches: 31, 124KiB Write Dispatches: 59, 2,396KiB

3049

Reads Requeued: 0 Writes Requeued: 0

3050

Reads Completed: 0, 0KiB Writes Completed: 11, 764KiB

3051

Read Merges: 598, 2,392KiB Write Merges: 88, 448KiB

3052

Read depth: 2 Write depth: 2

3053

IO unplugs: 52 Timer unplugs: 0

3054

3055

Total (sdc):

3056

Reads Queued: 688, 2,752KiB Writes Queued: 651, 42,360KiB

3057

Read Dispatches: 90, 2,752KiB Write Dispatches: 554, 42,360KiB

3058

Reads Requeued: 0 Writes Requeued: 0

3059

Reads Completed: 90, 2,752KiB Writes Completed: 554, 42,360KiB

3060

Read Merges: 598, 2,392KiB Write Merges: 97, 792KiB

3061

IO unplugs: 72 Timer unplugs: 1

3062

3063

Throughput (R/W): 15KiB/s / 238KiB/s

3064

Events (sdc): 9,301 entries

3065

Skips: 0 forward (0 - 0.0%)

3066

</literallayout>

3067

You should see the trace events and summary just as

3068

you would have if you'd run the same command on the target.

</para>

</section>

<title>Tracing Block I/O via 'ftrace'</title>

3074

3075

<para>

3076

It's also possible to trace block I/O using only

3077

3078

which can be useful for casual tracing

3079

if you don't want to bother dealing with the userspace tools.

</para>

<para>

To enable tracing for a given device, use

3084

/sys/block/xxx/trace/enable, where xxx is the device name.

3085

This for example enables tracing for /dev/sdc:

3086

3087

root@crownbay:/sys/kernel/debug/tracing# echo 1 > /sys/block/sdc/trace/enable

3088

</literallayout>

3089

Once you've selected the device(s) you want to trace,

3090

selecting the 'blk' tracer will turn the blk tracer on:

3091

3092

root@crownbay:/sys/kernel/debug/tracing# cat available_tracers

3093

blk function_graph function nop

3094

3095

root@crownbay:/sys/kernel/debug/tracing# echo blk > current_tracer

3096

</literallayout>

3097

Execute the workload you're interested in:

3098

3099

root@crownbay:/sys/kernel/debug/tracing# cat /media/sdc/testfile.txt

3100

</literallayout>

3101

And look at the output (note here that we're using

3102

'trace_pipe' instead of trace to capture this trace -

3103

this allows us to wait around on the pipe for data to

3104

appear):

3105

3106

root@crownbay:/sys/kernel/debug/tracing# cat trace_pipe

3107

cat-3587 [001] d..1 3023.276361: 8,32 Q R 1699848 + 8 [cat]

3108

cat-3587 [001] d..1 3023.276410: 8,32 m N cfq3587 alloced

3109

cat-3587 [001] d..1 3023.276415: 8,32 G R 1699848 + 8 [cat]

3110

cat-3587 [001] d..1 3023.276424: 8,32 P N [cat]

3111

cat-3587 [001] d..2 3023.276432: 8,32 I R 1699848 + 8 [cat]

3112

cat-3587 [001] d..1 3023.276439: 8,32 m N cfq3587 insert_request

3113

cat-3587 [001] d..1 3023.276445: 8,32 m N cfq3587 add_to_rr

3114

cat-3587 [001] d..2 3023.276454: 8,32 U N [cat] 1

3115

cat-3587 [001] d..1 3023.276464: 8,32 m N cfq workload slice:150

3116

cat-3587 [001] d..1 3023.276471: 8,32 m N cfq3587 set_active wl_prio:0 wl_type:2

3117

cat-3587 [001] d..1 3023.276478: 8,32 m N cfq3587 fifo= (null)

3118

cat-3587 [001] d..1 3023.276483: 8,32 m N cfq3587 dispatch_insert

3119

cat-3587 [001] d..1 3023.276490: 8,32 m N cfq3587 dispatched a request

3120

cat-3587 [001] d..1 3023.276497: 8,32 m N cfq3587 activate rq, drv=1

3121

cat-3587 [001] d..2 3023.276500: 8,32 D R 1699848 + 8 [cat]

3122

</literallayout>

3123

And this turns off tracing for the specified device:

3124

3125

root@crownbay:/sys/kernel/debug/tracing# echo 0 > /sys/block/sdc/trace/enable

</literallayout>

</para>

</section>

</section>

<title>Documentation</title>

3133

3134

<para>

3135

Online versions of the man pages for the commands discussed

3136

in this section can be found here:

3137

3138

<listitem><para><ulink url='http://linux.die.net/man/8/blktrace'>http://linux.die.net/man/8/blktrace</ulink>

3139

</para></listitem>

3140

<listitem><para><ulink url='http://linux.die.net/man/1/blkparse'>http://linux.die.net/man/1/blkparse</ulink>

3141

</para></listitem>

3142

<listitem><para><ulink url='http://linux.die.net/man/8/btrace'>http://linux.die.net/man/8/btrace</ulink>

</para></listitem>

</itemizedlist>

</para>

<para>

The above manpages, along with manpages for the other

3149

blktrace utilities (btt, blkiomon, etc) can be found in the

3150

/doc directory of the blktrace tools git repo:

3151

3152

$ git clone git://git.kernel.dk/blktrace.git

</literallayout>

</para>

</section>

</section>

</chapter>

<!--

vim: expandtab tw=80 ts=4

3160

-->