Blame - poky/documentation/profile-manual/profile-manual-usage.xml - openbmc/openbmc

wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

131

</literallayout>

132

The quickest and easiest way to get some basic overall data about

133

what's going on for a particular workload is to profile it using

134

'perf stat'. 'perf stat' basically profiles using a few default

135

counters and displays the summed counts at the end of the run:

136

137

root@crownbay:~# perf stat wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

138

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

139

linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA

140

141

Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':

142

143

4597.223902 task-clock # 0.077 CPUs utilized

144

23568 context-switches # 0.005 M/sec

145

68 CPU-migrations # 0.015 K/sec

146

241 page-faults # 0.052 K/sec

147

3045817293 cycles # 0.663 GHz

148

<not supported> stalled-cycles-frontend

149

<not supported> stalled-cycles-backend

150

858909167 instructions # 0.28 insns per cycle

151

165441165 branches # 35.987 M/sec

152

19550329 branch-misses # 11.82% of all branches

153

154

59.836627620 seconds time elapsed

155

</literallayout>

156

Many times such a simple-minded test doesn't yield much of

157

interest, but sometimes it does (see Real-world Yocto bug

158

(slow loop-mounted write speed)).

</para>

<para>

Also, note that 'perf stat' isn't restricted to a fixed set of

163

counters - basically any event listed in the output of 'perf list'

164

can be tallied by 'perf stat'. For example, suppose we wanted to

165

see a summary of all the events related to kernel memory

166

allocation/freeing along with cache hits and misses:

167

168

root@crownbay:~# perf stat -e kmem:* -e cache-references -e cache-misses wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

169

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

170

linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA

171

172

Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':

173

174

5566 kmem:kmalloc

175

125517 kmem:kmem_cache_alloc

176

0 kmem:kmalloc_node

177

0 kmem:kmem_cache_alloc_node

178

34401 kmem:kfree

179

69920 kmem:kmem_cache_free

180

133 kmem:mm_page_free

181

41 kmem:mm_page_free_batched

182

11502 kmem:mm_page_alloc

183

11375 kmem:mm_page_alloc_zone_locked

184

0 kmem:mm_page_pcpu_drain

185

0 kmem:mm_page_alloc_extfrag

186

66848602 cache-references

187

2917740 cache-misses # 4.365 % of all cache refs

188

189

44.831023415 seconds time elapsed

190

</literallayout>

191

So 'perf stat' gives us a nice easy way to get a quick overview of

192

what might be happening for a set of events, but normally we'd

193

need a little more detail in order to understand what's going on

194

in a way that we can act on in a useful way.

</para>

<para>

To dive down into a next level of detail, we can use 'perf

199

record'/'perf report' which will collect profiling data and

200

present it to use using an interactive text-based UI (or

201

simply as text if we specify --stdio to 'perf report').

</para>

<para>

As our first attempt at profiling this workload, we'll simply

206

run 'perf record', handing it the workload we want to profile

207

(everything after 'perf record' and any perf options we hand

208

it - here none - will be executed in a new shell). perf collects

209

samples until the process exits and records them in a file named

210

'perf.data' in the current working directory.

211

212

root@crownbay:~# perf record wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

213

214

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

215

linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA

216

[ perf record: Woken up 1 times to write data ]

217

[ perf record: Captured and wrote 0.176 MB perf.data (~7700 samples) ]

218

</literallayout>

219

To see the results in a 'text-based UI' (tui), simply run

220

'perf report', which will read the perf.data file in the current

221

working directory and display the results in an interactive UI:

222

223

root@crownbay:~# perf report

</literallayout>

</para>

<para>

</para>

<para>

The above screenshot displays a 'flat' profile, one entry for

233

each 'bucket' corresponding to the functions that were profiled

234

during the profiling run, ordered from the most popular to the

235

least (perf has options to sort in various orders and keys as

236

well as display entries only above a certain threshold and so

237

on - see the perf documentation for details). Note that this

238

includes both userspace functions (entries containing a [.]) and

239

kernel functions accounted to the process (entries containing

240

a [k]). (perf has command-line modifiers that can be used to

241

restrict the profiling to kernel or userspace, among others).

</para>

<para>

Notice also that the above report shows an entry for 'busybox',

246

which is the executable that implements 'wget' in Yocto, but that

247

instead of a useful function name in that entry, it displays

248

a not-so-friendly hex value instead. The steps below will show

249

how to fix that problem.

</para>

<para>

Before we do that, however, let's try running a different profile,

254

one which shows something a little more interesting. The only

255

difference between the new profile and the previous one is that

256

we'll add the -g option, which will record not just the address

257

of a sampled function, but the entire callchain to the sampled

258

function as well:

259

260

root@crownbay:~# perf record -g wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

261

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

262

linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA

263

[ perf record: Woken up 3 times to write data ]

264

[ perf record: Captured and wrote 0.652 MB perf.data (~28476 samples) ]

265

266

267

root@crownbay:~# perf report

</literallayout>

</para>

<para>

</para>

<para>

Using the callgraph view, we can actually see not only which

277

functions took the most time, but we can also see a summary of

278

how those functions were called and learn something about how the

279

program interacts with the kernel in the process.

</para>

<para>

Notice that each entry in the above screenshot now contains a '+'

284

on the left-hand side. This means that we can expand the entry and

285

drill down into the callchains that feed into that entry.

286

Pressing 'enter' on any one of them will expand the callchain

287

(you can also press 'E' to expand them all at the same time or 'C'

288

to collapse them all).

</para>

<para>

In the screenshot above, we've toggled the __copy_to_user_ll()

293

entry and several subnodes all the way down. This lets us see

294

which callchains contributed to the profiled __copy_to_user_ll()

295

function which contributed 1.77% to the total profile.

</para>

<para>

As a bit of background explanation for these callchains, think

300

about what happens at a high level when you run wget to get a file

301

out on the network. Basically what happens is that the data comes

302

into the kernel via the network connection (socket) and is passed

303

to the userspace program 'wget' (which is actually a part of

304

busybox, but that's not important for now), which takes the buffers

305

the kernel passes to it and writes it to a disk file to save it.

</para>

<para>

The part of this process that we're looking at in the above call

310

stacks is the part where the kernel passes the data it's read from

311

the socket down to wget i.e. a copy-to-user.

</para>

<para>

Notice also that here there's also a case where the hex value

316

is displayed in the callstack, here in the expanded

317

sys_clock_gettime() function. Later we'll see it resolve to a

318

userspace function call in busybox.

</para>

<para>

</para>

<para>

The above screenshot shows the other half of the journey for the

327

data - from the wget program's userspace buffers to disk. To get

328

the buffers to disk, the wget program issues a write(2), which

329

does a copy-from-user to the kernel, which then takes care via

330

some circuitous path (probably also present somewhere in the

331

profile data), to get it safely to disk.

</para>

<para>

Now that we've seen the basic layout of the profile data and the

336

basics of how to extract useful information out of it, let's get

337

back to the task at hand and see if we can get some basic idea

338

about where the time is spent in the program we're profiling,

339

wget. Remember that wget is actually implemented as an applet

340

in busybox, so while the process name is 'wget', the executable

341

we're actually interested in is busybox. So let's expand the

342

first entry containing busybox:

</para>

<para>

</para>

<para>

Again, before we expanded we saw that the function was labeled

351

with a hex value instead of a symbol as with most of the kernel

352

entries. Expanding the busybox entry doesn't make it any better.

</para>

<para>

The problem is that perf can't find the symbol information for the

357

busybox binary, which is actually stripped out by the Yocto build

system.

</para>

<para>

Patrick Williams

c0f7c04

2017-02-23 20:41:17 -0600

[diff] [blame]

362

One way around that is to put the following in your

363

<filename>local.conf</filename> file when you build the image:

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

364

Patrick Williams

c0f7c04

2017-02-23 20:41:17 -0600

[diff] [blame]

365

<ulink url='&YOCTO_DOCS_REF_URL;#var-INHIBIT_PACKAGE_STRIP'>INHIBIT_PACKAGE_STRIP</ulink> = "1"

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

366

</literallayout>

367

However, we already have an image with the binaries stripped,

368

so what can we do to get perf to resolve the symbols? Basically

369

we need to install the debuginfo for the busybox package.

</para>

<para>

To generate the debug info for the packages in the image, we can

374

add dbg-pkgs to EXTRA_IMAGE_FEATURES in local.conf. For example:

375

376

EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile dbg-pkgs"

377

</literallayout>

378

Additionally, in order to generate the type of debuginfo that

Brad Bishop

1a4b7ee

2018-12-16 17:11:34 -0800

[diff] [blame]

379

perf understands, we also need to set

380

<ulink url='&YOCTO_DOCS_REF_URL;#var-PACKAGE_DEBUG_SPLIT_STYLE'><filename>PACKAGE_DEBUG_SPLIT_STYLE</filename></ulink>

381

in the <filename>local.conf</filename> file:

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

382

383

PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory'

384

</literallayout>

385

Once we've done that, we can install the debuginfo for busybox.

386

The debug packages once built can be found in

387

build/tmp/deploy/rpm/* on the host system. Find the

388

busybox-dbg-...rpm file and copy it to the target. For example:

389

390

[trz@empanada core2]$ scp /home/trz/yocto/crownbay-tracing-dbg/build/tmp/deploy/rpm/core2_32/busybox-dbg-1.20.2-r2.core2_32.rpm root@192.168.1.31:

391

root@192.168.1.31's password:

392

busybox-dbg-1.20.2-r2.core2_32.rpm 100% 1826KB 1.8MB/s 00:01

393

</literallayout>

394

Now install the debug rpm on the target:

395

396

root@crownbay:~# rpm -i busybox-dbg-1.20.2-r2.core2_32.rpm

397

</literallayout>

398

Now that the debuginfo is installed, we see that the busybox

399

entries now display their functions symbolically:

</para>

<para>

</para>

<para>

If we expand one of the entries and press 'enter' on a leaf node,

408

we're presented with a menu of actions we can take to get more

409

information related to that entry:

</para>

<para>

</para>

<para>

One of these actions allows us to show a view that displays a

418

busybox-centric view of the profiled functions (in this case we've

419

also expanded all the nodes using the 'E' key):

</para>

<para>

</para>

<para>

Finally, we can see that now that the busybox debuginfo is

428

installed, the previously unresolved symbol in the

429

sys_clock_gettime() entry mentioned previously is now resolved,

430

and shows that the sys_clock_gettime system call that was the

431

source of 6.75% of the copy-to-user overhead was initiated by

432

the handle_input() busybox function:

</para>

<para>

</para>

<para>

At the lowest level of detail, we can dive down to the assembly

441

level and see which instructions caused the most overhead in a

442

function. Pressing 'enter' on the 'udhcpc_main' function, we're

443

again presented with a menu:

</para>

<para>

</para>

<para>

Selecting 'Annotate udhcpc_main', we get a detailed listing of

452

percentages by instruction for the udhcpc_main function. From the

453

display, we can see that over 50% of the time spent in this

454

function is taken up by a couple tests and the move of a

455

constant (1) to a register:

</para>

<para>

</para>

<para>

As a segue into tracing, let's try another profile using a

464

different counter, something other than the default 'cycles'.

</para>

<para>

The tracing and profiling infrastructure in Linux has become

469

unified in a way that allows us to use the same tool with a

470

completely different set of counters, not just the standard

471

hardware counters that traditional tools have had to restrict

472

themselves to (of course the traditional tools can also make use

473

of the expanded possibilities now available to them, and in some

474

cases have, as mentioned previously).

</para>

<para>

We can get a list of the available events that can be used to

479

profile a workload via 'perf list':

480

481

root@crownbay:~# perf list

482

483

List of pre-defined events (to be used in -e):

484

cpu-cycles OR cycles [Hardware event]

485

stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]

486

stalled-cycles-backend OR idle-cycles-backend [Hardware event]

487

instructions [Hardware event]

488

cache-references [Hardware event]

489

cache-misses [Hardware event]

490

branch-instructions OR branches [Hardware event]

491

branch-misses [Hardware event]

492

bus-cycles [Hardware event]

493

ref-cycles [Hardware event]

494

495

cpu-clock [Software event]

496

task-clock [Software event]

497

page-faults OR faults [Software event]

498

minor-faults [Software event]

499

major-faults [Software event]

500

context-switches OR cs [Software event]

501

cpu-migrations OR migrations [Software event]

502

alignment-faults [Software event]

503

emulation-faults [Software event]

504

505

L1-dcache-loads [Hardware cache event]

506

L1-dcache-load-misses [Hardware cache event]

507

L1-dcache-prefetch-misses [Hardware cache event]

508

L1-icache-loads [Hardware cache event]

509

L1-icache-load-misses [Hardware cache event]

.

.

.

rNNN [Raw hardware event descriptor]

514

cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]

515

(see 'perf list --help' on how to encode it)

516

517

mem:<addr>[:access] [Hardware breakpoint]

518

519

sunrpc:rpc_call_status [Tracepoint event]

520

sunrpc:rpc_bind_status [Tracepoint event]

521

sunrpc:rpc_connect_status [Tracepoint event]

522

sunrpc:rpc_task_begin [Tracepoint event]

523

skb:kfree_skb [Tracepoint event]

524

skb:consume_skb [Tracepoint event]

525

skb:skb_copy_datagram_iovec [Tracepoint event]

526

net:net_dev_xmit [Tracepoint event]

527

net:net_dev_queue [Tracepoint event]

528

net:netif_receive_skb [Tracepoint event]

529

net:netif_rx [Tracepoint event]

530

napi:napi_poll [Tracepoint event]

531

sock:sock_rcvqueue_full [Tracepoint event]

532

sock:sock_exceed_buf_limit [Tracepoint event]

533

udp:udp_fail_queue_rcv_skb [Tracepoint event]

534

hda:hda_send_cmd [Tracepoint event]

535

hda:hda_get_response [Tracepoint event]

536

hda:hda_bus_reset [Tracepoint event]

537

scsi:scsi_dispatch_cmd_start [Tracepoint event]

538

scsi:scsi_dispatch_cmd_error [Tracepoint event]

539

scsi:scsi_eh_wakeup [Tracepoint event]

540

drm:drm_vblank_event [Tracepoint event]

541

drm:drm_vblank_event_queued [Tracepoint event]

542

drm:drm_vblank_event_delivered [Tracepoint event]

543

random:mix_pool_bytes [Tracepoint event]

544

random:mix_pool_bytes_nolock [Tracepoint event]

545

random:credit_entropy_bits [Tracepoint event]

546

gpio:gpio_direction [Tracepoint event]

547

gpio:gpio_value [Tracepoint event]

548

block:block_rq_abort [Tracepoint event]

549

block:block_rq_requeue [Tracepoint event]

550

block:block_rq_issue [Tracepoint event]

551

block:block_bio_bounce [Tracepoint event]

552

block:block_bio_complete [Tracepoint event]

553

block:block_bio_backmerge [Tracepoint event]

554

.

555

.

556

writeback:writeback_wake_thread [Tracepoint event]

557

writeback:writeback_wake_forker_thread [Tracepoint event]

558

writeback:writeback_bdi_register [Tracepoint event]

559

.

560

.

561

writeback:writeback_single_inode_requeue [Tracepoint event]

562

writeback:writeback_single_inode [Tracepoint event]

563

kmem:kmalloc [Tracepoint event]

564

kmem:kmem_cache_alloc [Tracepoint event]

565

kmem:mm_page_alloc [Tracepoint event]

566

kmem:mm_page_alloc_zone_locked [Tracepoint event]

567

kmem:mm_page_pcpu_drain [Tracepoint event]

568

kmem:mm_page_alloc_extfrag [Tracepoint event]

569

vmscan:mm_vmscan_kswapd_sleep [Tracepoint event]

570

vmscan:mm_vmscan_kswapd_wake [Tracepoint event]

571

vmscan:mm_vmscan_wakeup_kswapd [Tracepoint event]

572

vmscan:mm_vmscan_direct_reclaim_begin [Tracepoint event]

573

.

574

.

575

module:module_get [Tracepoint event]

576

module:module_put [Tracepoint event]

577

module:module_request [Tracepoint event]

578

sched:sched_kthread_stop [Tracepoint event]

579

sched:sched_wakeup [Tracepoint event]

580

sched:sched_wakeup_new [Tracepoint event]

581

sched:sched_process_fork [Tracepoint event]

582

sched:sched_process_exec [Tracepoint event]

583

sched:sched_stat_runtime [Tracepoint event]

584

rcu:rcu_utilization [Tracepoint event]

585

workqueue:workqueue_queue_work [Tracepoint event]

586

workqueue:workqueue_execute_end [Tracepoint event]

587

signal:signal_generate [Tracepoint event]

588

signal:signal_deliver [Tracepoint event]

589

timer:timer_init [Tracepoint event]

590

timer:timer_start [Tracepoint event]

591

timer:hrtimer_cancel [Tracepoint event]

592

timer:itimer_state [Tracepoint event]

593

timer:itimer_expire [Tracepoint event]

594

irq:irq_handler_entry [Tracepoint event]

595

irq:irq_handler_exit [Tracepoint event]

596

irq:softirq_entry [Tracepoint event]

597

irq:softirq_exit [Tracepoint event]

598

irq:softirq_raise [Tracepoint event]

599

printk:console [Tracepoint event]

600

task:task_newtask [Tracepoint event]

601

task:task_rename [Tracepoint event]

602

syscalls:sys_enter_socketcall [Tracepoint event]

603

syscalls:sys_exit_socketcall [Tracepoint event]

.

.

.

syscalls:sys_enter_unshare [Tracepoint event]

608

syscalls:sys_exit_unshare [Tracepoint event]

609

raw_syscalls:sys_enter [Tracepoint event]

610

raw_syscalls:sys_exit [Tracepoint event]

</literallayout>

</para>

<emphasis>Tying it Together:</emphasis> These are exactly the same set of events defined

616

by the trace event subsystem and exposed by

617

ftrace/tracecmd/kernelshark as files in

618

/sys/kernel/debug/tracing/events, by SystemTap as

619

kernel.trace("tracepoint_name") and (partially) accessed by LTTng.

</informalexample>

<para>

Only a subset of these would be of interest to us when looking at

624

this workload, so let's choose the most likely subsystems

625

(identified by the string before the colon in the Tracepoint events)

626

and do a 'perf stat' run using only those wildcarded subsystems:

627

628

root@crownbay:~# perf stat -e skb:* -e net:* -e napi:* -e sched:* -e workqueue:* -e irq:* -e syscalls:* wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

629

Performance counter stats for 'wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>':

23323 skb:kfree_skb

0 skb:consume_skb

49897 skb:skb_copy_datagram_iovec

634

6217 net:net_dev_xmit

635

6217 net:net_dev_queue

636

7962 net:netif_receive_skb

637

2 net:netif_rx

638

8340 napi:napi_poll

639

0 sched:sched_kthread_stop

640

0 sched:sched_kthread_stop_ret

641

3749 sched:sched_wakeup

642

0 sched:sched_wakeup_new

643

0 sched:sched_switch

644

29 sched:sched_migrate_task

645

0 sched:sched_process_free

646

1 sched:sched_process_exit

647

0 sched:sched_wait_task

648

0 sched:sched_process_wait

649

0 sched:sched_process_fork

650

1 sched:sched_process_exec

651

0 sched:sched_stat_wait

652

2106519415641 sched:sched_stat_sleep

653

0 sched:sched_stat_iowait

654

147453613 sched:sched_stat_blocked

655

12903026955 sched:sched_stat_runtime

656

0 sched:sched_pi_setprio

657

3574 workqueue:workqueue_queue_work

658

3574 workqueue:workqueue_activate_work

659

0 workqueue:workqueue_execute_start

660

0 workqueue:workqueue_execute_end

661

16631 irq:irq_handler_entry

662

16631 irq:irq_handler_exit

663

28521 irq:softirq_entry

664

28521 irq:softirq_exit

665

28728 irq:softirq_raise

666

1 syscalls:sys_enter_sendmmsg

667

1 syscalls:sys_exit_sendmmsg

668

0 syscalls:sys_enter_recvmmsg

669

0 syscalls:sys_exit_recvmmsg

670

14 syscalls:sys_enter_socketcall

671

14 syscalls:sys_exit_socketcall

.

.

.

16965 syscalls:sys_enter_read

676

16965 syscalls:sys_exit_read

677

12854 syscalls:sys_enter_write

678

12854 syscalls:sys_exit_write

.

.

.

58.029710972 seconds time elapsed

684

</literallayout>

685

Let's pick one of these tracepoints and tell perf to do a profile

686

using it as the sampling event:

687

688

root@crownbay:~# perf record -g -e sched:sched_wakeup wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

</literallayout>

</para>

<para>

</para>

<para>

The screenshot above shows the results of running a profile using

698

sched:sched_switch tracepoint, which shows the relative costs of

699

various paths to sched_wakeup (note that sched_wakeup is the

700

name of the tracepoint - it's actually defined just inside

701

ttwu_do_wakeup(), which accounts for the function name actually

702

displayed in the profile:

703

704

/*

705

* Mark the task runnable and perform wakeup-preemption.

706

*/

707

static void

708

ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

709

{

710

trace_sched_wakeup(p, true);

.

.

.

}

</literallayout>

A couple of the more interesting callchains are expanded and

717

displayed above, basically some network receive paths that

718

presumably end up waking up wget (busybox) when network data is

ready.

</para>

<para>

Note that because tracepoints are normally used for tracing,

724

the default sampling period for tracepoints is 1 i.e. for

725

tracepoints perf will sample on every event occurrence (this

726

can be changed using the -c option). This is in contrast to

727

hardware counters such as for example the default 'cycles'

728

hardware counter used for normal profiling, where sampling

729

periods are much higher (in the thousands) because profiling should

730

have as low an overhead as possible and sampling on every cycle

731

would be prohibitively expensive.

</para>

</section>

<title>Using perf to do Basic Tracing</title>

737

738

<para>

739

Profiling is a great tool for solving many problems or for

740

getting a high-level view of what's going on with a workload or

741

across the system. It is however by definition an approximation,

742

as suggested by the most prominent word associated with it,

743

'sampling'. On the one hand, it allows a representative picture of

744

what's going on in the system to be cheaply taken, but on the other

745

hand, that cheapness limits its utility when that data suggests a

746

need to 'dive down' more deeply to discover what's really going

747

on. In such cases, the only way to see what's really going on is

748

to be able to look at (or summarize more intelligently) the

749

individual steps that go into the higher-level behavior exposed

750

by the coarse-grained profiling data.

</para>

<para>

As a concrete example, we can trace all the events we think might

755

be applicable to our workload:

756

757

root@crownbay:~# perf record -g -e skb:* -e net:* -e napi:* -e sched:sched_switch -e sched:sched_wakeup -e irq:*

758

-e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write

759

wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

760

</literallayout>

761

We can look at the raw trace output using 'perf script' with no

762

arguments:

763

764

root@crownbay:~# perf script

765

766

perf 1262 [000] 11624.857082: sys_exit_read: 0x0

767

perf 1262 [000] 11624.857193: sched_wakeup: comm=migration/0 pid=6 prio=0 success=1 target_cpu=000

768

wget 1262 [001] 11624.858021: softirq_raise: vec=1 [action=TIMER]

769

wget 1262 [001] 11624.858074: softirq_entry: vec=1 [action=TIMER]

770

wget 1262 [001] 11624.858081: softirq_exit: vec=1 [action=TIMER]

771

wget 1262 [001] 11624.858166: sys_enter_read: fd: 0x0003, buf: 0xbf82c940, count: 0x0200

772

wget 1262 [001] 11624.858177: sys_exit_read: 0x200

773

wget 1262 [001] 11624.858878: kfree_skb: skbaddr=0xeb248d80 protocol=0 location=0xc15a5308

774

wget 1262 [001] 11624.858945: kfree_skb: skbaddr=0xeb248000 protocol=0 location=0xc15a5308

775

wget 1262 [001] 11624.859020: softirq_raise: vec=1 [action=TIMER]

776

wget 1262 [001] 11624.859076: softirq_entry: vec=1 [action=TIMER]

777

wget 1262 [001] 11624.859083: softirq_exit: vec=1 [action=TIMER]

778

wget 1262 [001] 11624.859167: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400

779

wget 1262 [001] 11624.859192: sys_exit_read: 0x1d7

780

wget 1262 [001] 11624.859228: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400

781

wget 1262 [001] 11624.859233: sys_exit_read: 0x0

782

wget 1262 [001] 11624.859573: sys_enter_read: fd: 0x0003, buf: 0xbf82c580, count: 0x0200

783

wget 1262 [001] 11624.859584: sys_exit_read: 0x200

784

wget 1262 [001] 11624.859864: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400

785

wget 1262 [001] 11624.859888: sys_exit_read: 0x400

786

wget 1262 [001] 11624.859935: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400

787

wget 1262 [001] 11624.859944: sys_exit_read: 0x400

788

</literallayout>

789

This gives us a detailed timestamped sequence of events that

790

occurred within the workload with respect to those events.

</para>

<para>

In many ways, profiling can be viewed as a subset of tracing -

795

theoretically, if you have a set of trace events that's sufficient

796

to capture all the important aspects of a workload, you can derive

797

any of the results or views that a profiling run can.

</para>

<para>

Another aspect of traditional profiling is that while powerful in

802

many ways, it's limited by the granularity of the underlying data.

803

Profiling tools offer various ways of sorting and presenting the

804

sample data, which make it much more useful and amenable to user

805

experimentation, but in the end it can't be used in an open-ended

806

way to extract data that just isn't present as a consequence of

807

the fact that conceptually, most of it has been thrown away.

</para>

<para>

Full-blown detailed tracing data does however offer the opportunity

812

to manipulate and present the information collected during a

813

tracing run in an infinite variety of ways.

</para>

<para>

Another way to look at it is that there are only so many ways that

818

the 'primitive' counters can be used on their own to generate

819

interesting output; to get anything more complicated than simple

820

counts requires some amount of additional logic, which is typically

821

very specific to the problem at hand. For example, if we wanted to

822

make use of a 'counter' that maps to the value of the time

823

difference between when a process was scheduled to run on a

824

processor and the time it actually ran, we wouldn't expect such

825

a counter to exist on its own, but we could derive one called say

826

'wakeup_latency' and use it to extract a useful view of that metric

827

from trace data. Likewise, we really can't figure out from standard

828

profiling tools how much data every process on the system reads and

829

writes, along with how many of those reads and writes fail

830

completely. If we have sufficient trace data, however, we could

831

with the right tools easily extract and present that information,

832

but we'd need something other than pre-canned profiling tools to

do that.

</para>

<para>

Luckily, there is a general-purpose way to handle such needs,

838

called 'programming languages'. Making programming languages

839

easily available to apply to such problems given the specific

840

format of data is called a 'programming language binding' for

841

that data and language. Perf supports two programming language

842

bindings, one for Python and one for Perl.

</para>

<emphasis>Tying it Together:</emphasis> Language bindings for manipulating and

847

aggregating trace data are of course not a new

848

idea. One of the first projects to do this was IBM's DProbes

849

dpcc compiler, an ANSI C compiler which targeted a low-level

850

assembly language running on an in-kernel interpreter on the

851

target system. This is exactly analogous to what Sun's DTrace

852

did, except that DTrace invented its own language for the purpose.

853

Systemtap, heavily inspired by DTrace, also created its own

854

one-off language, but rather than running the product on an

855

in-kernel interpreter, created an elaborate compiler-based

856

machinery to translate its language into kernel modules written

in C.

</informalexample>

<para>

Now that we have the trace data in perf.data, we can use

862

'perf script -g' to generate a skeleton script with handlers

863

for the read/write entry/exit events we recorded:

864

865

root@crownbay:~# perf script -g python

866

generated Python script: perf-script.py

867

</literallayout>

868

The skeleton script simply creates a python function for each

869

event type in the perf.data file. The body of each function simply

870

prints the event name along with its parameters. For example:

871

872

def net__netif_rx(event_name, context, common_cpu,

873

common_secs, common_nsecs, common_pid, common_comm,

874

skbaddr, len, name):

875

print_header(event_name, common_cpu, common_secs, common_nsecs,

876

common_pid, common_comm)

877

878

print "skbaddr=%u, len=%u, name=%s\n" % (skbaddr, len, name),

879

</literallayout>

880

We can run that script directly to print all of the events

881

contained in the perf.data file:

882

883

root@crownbay:~# perf script -s perf-script.py

884

885

in trace_begin

886

syscalls__sys_exit_read 0 11624.857082795 1262 perf nr=3, ret=0

887

sched__sched_wakeup 0 11624.857193498 1262 perf comm=migration/0, pid=6, prio=0, success=1, target_cpu=0

888

irq__softirq_raise 1 11624.858021635 1262 wget vec=TIMER

889

irq__softirq_entry 1 11624.858074075 1262 wget vec=TIMER

890

irq__softirq_exit 1 11624.858081389 1262 wget vec=TIMER

891

syscalls__sys_enter_read 1 11624.858166434 1262 wget nr=3, fd=3, buf=3213019456, count=512

892

syscalls__sys_exit_read 1 11624.858177924 1262 wget nr=3, ret=512

893

skb__kfree_skb 1 11624.858878188 1262 wget skbaddr=3945041280, location=3243922184, protocol=0

894

skb__kfree_skb 1 11624.858945608 1262 wget skbaddr=3945037824, location=3243922184, protocol=0

895

irq__softirq_raise 1 11624.859020942 1262 wget vec=TIMER

896

irq__softirq_entry 1 11624.859076935 1262 wget vec=TIMER

897

irq__softirq_exit 1 11624.859083469 1262 wget vec=TIMER

898

syscalls__sys_enter_read 1 11624.859167565 1262 wget nr=3, fd=3, buf=3077701632, count=1024

899

syscalls__sys_exit_read 1 11624.859192533 1262 wget nr=3, ret=471

900

syscalls__sys_enter_read 1 11624.859228072 1262 wget nr=3, fd=3, buf=3077701632, count=1024

901

syscalls__sys_exit_read 1 11624.859233707 1262 wget nr=3, ret=0

902

syscalls__sys_enter_read 1 11624.859573008 1262 wget nr=3, fd=3, buf=3213018496, count=512

903

syscalls__sys_exit_read 1 11624.859584818 1262 wget nr=3, ret=512

904

syscalls__sys_enter_read 1 11624.859864562 1262 wget nr=3, fd=3, buf=3077701632, count=1024

905

syscalls__sys_exit_read 1 11624.859888770 1262 wget nr=3, ret=1024

906

syscalls__sys_enter_read 1 11624.859935140 1262 wget nr=3, fd=3, buf=3077701632, count=1024

907

syscalls__sys_exit_read 1 11624.859944032 1262 wget nr=3, ret=1024

908

</literallayout>

909

That in itself isn't very useful; after all, we can accomplish

910

pretty much the same thing by simply running 'perf script'

911

without arguments in the same directory as the perf.data file.

</para>

<para>

We can however replace the print statements in the generated

916

function bodies with whatever we want, and thereby make it

917

infinitely more useful.

</para>

<para>

As a simple example, let's just replace the print statements in

922

the function bodies with a simple function that does nothing but

923

increment a per-event count. When the program is run against a

924

perf.data file, each time a particular event is encountered,

925

a tally is incremented for that event. For example:

926

927

def net__netif_rx(event_name, context, common_cpu,

928

common_secs, common_nsecs, common_pid, common_comm,

929

skbaddr, len, name):

930

inc_counts(event_name)

931

</literallayout>

932

Each event handler function in the generated code is modified

933

to do this. For convenience, we define a common function called

934

inc_counts() that each handler calls; inc_counts() simply tallies

935

a count for each event using the 'counts' hash, which is a

936

specialized hash function that does Perl-like autovivification, a

937

capability that's extremely useful for kinds of multi-level

938

aggregation commonly used in processing traces (see perf's

939

documentation on the Python language binding for details):

counts = autodict()

def inc_counts(event_name):

944

try:

945

counts[event_name] += 1

946

except TypeError:

947

counts[event_name] = 1

948

</literallayout>

949

Finally, at the end of the trace processing run, we want to

950

print the result of all the per-event tallies. For that, we

951

use the special 'trace_end()' function:

952

953

def trace_end():

954

for event_name, count in counts.iteritems():

955

print "%-40s %10s\n" % (event_name, count)

956

</literallayout>

957

The end result is a summary of all the events recorded in the

958

trace:

959

960

skb__skb_copy_datagram_iovec 13148

961

irq__softirq_entry 4796

962

irq__irq_handler_exit 3805

963

irq__softirq_exit 4795

964

syscalls__sys_enter_write 8990

965

net__net_dev_xmit 652

966

skb__kfree_skb 4047

967

sched__sched_wakeup 1155

968

irq__irq_handler_entry 3804

969

irq__softirq_raise 4799

970

net__net_dev_queue 652

971

syscalls__sys_enter_read 17599

972

net__netif_receive_skb 1743

973

syscalls__sys_exit_read 17598

974

net__netif_rx 2

975

napi__napi_poll 1877

976

syscalls__sys_exit_write 8990

977

</literallayout>

978

Note that this is pretty much exactly the same information we get

979

from 'perf stat', which goes a little way to support the idea

980

mentioned previously that given the right kind of trace data,

981

higher-level profiling-type summaries can be derived from it.

</para>

<para>

Documentation on using the

986

<ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.

</para>

</section>

<title>System-Wide Tracing and Profiling</title>

992

993

<para>

994

The examples so far have focused on tracing a particular program or

995

workload - in other words, every profiling run has specified the

996

program to profile in the command-line e.g. 'perf record wget ...'.

</para>

<para>

It's also possible, and more interesting in many cases, to run a

1001

system-wide profile or trace while running the workload in a

separate shell.

</para>

<para>

To do system-wide profiling or tracing, you typically use

1007

the -a flag to 'perf record'.

</para>

<para>

To demonstrate this, open up one window and start the profile

1012

using the -a flag (press Ctrl-C to stop tracing):

1013

1014

root@crownbay:~# perf record -g -a

1015

^C[ perf record: Woken up 6 times to write data ]

1016

[ perf record: Captured and wrote 1.400 MB perf.data (~61172 samples) ]

1017

</literallayout>

1018

In another window, run the wget test:

1019

1020

root@crownbay:~# wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>

1021

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

1022

linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA

1023

</literallayout>

1024

Here we see entries not only for our wget load, but for other

1025

processes running on the system as well:

</para>

<para>

</para>

<para>

In the snapshot above, we can see callchains that originate in

1034

libc, and a callchain from Xorg that demonstrates that we're

1035

using a proprietary X driver in userspace (notice the presence

1036

of 'PVR' and some other unresolvable symbols in the expanded

Xorg callchain).

</para>

<para>

Note also that we have both kernel and userspace entries in the

1042

above snapshot. We can also tell perf to focus on userspace but

1043

providing a modifier, in this case 'u', to the 'cycles' hardware

1044

counter when we record a profile:

1045

1046

root@crownbay:~# perf record -g -a -e cycles:u

1047

^C[ perf record: Woken up 2 times to write data ]

1048

[ perf record: Captured and wrote 0.376 MB perf.data (~16443 samples) ]

</literallayout>

</para>

<para>

</para>

<para>

Notice in the screenshot above, we see only userspace entries ([.])

</para>

<para>

Finally, we can press 'enter' on a leaf node and select the 'Zoom

1062

into DSO' menu item to show only entries associated with a

1063

specific DSO. In the screenshot below, we've zoomed into the

1064

'libc' DSO which shows all the entries associated with the

libc-xxx.so DSO.

</para>

<para>

</para>

<para>

We can also use the system-wide -a switch to do system-wide

1074

tracing. Here we'll trace a couple of scheduler events:

1075

1076

root@crownbay:~# perf record -a -e sched:sched_switch -e sched:sched_wakeup

1077

^C[ perf record: Woken up 38 times to write data ]

1078

[ perf record: Captured and wrote 9.780 MB perf.data (~427299 samples) ]

1079

</literallayout>

1080

We can look at the raw output using 'perf script' with no

1081

arguments:

1082

1083

root@crownbay:~# perf script

1084

1085

perf 1383 [001] 6171.460045: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1086

perf 1383 [001] 6171.460066: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120

1087

kworker/1:1 21 [001] 6171.460093: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120

1088

swapper 0 [000] 6171.468063: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000

1089

swapper 0 [000] 6171.468107: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120

1090

kworker/0:3 1209 [000] 6171.468143: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120

1091

perf 1383 [001] 6171.470039: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1092

perf 1383 [001] 6171.470058: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120

1093

kworker/1:1 21 [001] 6171.470082: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120

1094

perf 1383 [001] 6171.480035: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

</literallayout>

</para>

<title>Filtering</title>

1100

1101

<para>

1102

Notice that there are a lot of events that don't really have

1103

anything to do with what we're interested in, namely events

1104

that schedule 'perf' itself in and out or that wake perf up.

1105

We can get rid of those by using the '--filter' option -

1106

for each event we specify using -e, we can add a --filter

1107

after that to filter out trace events that contain fields

1108

with specific values:

1109

1110

root@crownbay:~# perf record -a -e sched:sched_switch --filter 'next_comm != perf && prev_comm != perf' -e sched:sched_wakeup --filter 'comm != perf'

1111

^C[ perf record: Woken up 38 times to write data ]

1112

[ perf record: Captured and wrote 9.688 MB perf.data (~423279 samples) ]

1113

1114

1115

root@crownbay:~# perf script

1116

1117

swapper 0 [000] 7932.162180: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120

1118

kworker/0:3 1209 [000] 7932.162236: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120

1119

perf 1407 [001] 7932.170048: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1120

perf 1407 [001] 7932.180044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1121

perf 1407 [001] 7932.190038: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1122

perf 1407 [001] 7932.200044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1123

perf 1407 [001] 7932.210044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1124

perf 1407 [001] 7932.220044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1125

swapper 0 [001] 7932.230111: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001

1126

swapper 0 [001] 7932.230146: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/1:1 next_pid=21 next_prio=120

1127

kworker/1:1 21 [001] 7932.230205: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120

1128

swapper 0 [000] 7932.326109: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000

1129

swapper 0 [000] 7932.326171: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120

1130

kworker/0:3 1209 [000] 7932.326214: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120

1131

</literallayout>

1132

In this case, we've filtered out all events that have 'perf'

1133

in their 'comm' or 'comm_prev' or 'comm_next' fields. Notice

1134

that there are still events recorded for perf, but notice

1135

that those events don't have values of 'perf' for the filtered

1136

fields. To completely filter out anything from perf will

1137

require a bit more work, but for the purpose of demonstrating

1138

how to use filters, it's close enough.

</para>

<emphasis>Tying it Together:</emphasis> These are exactly the same set of event

1143

filters defined by the trace event subsystem. See the

1144

ftrace/tracecmd/kernelshark section for more discussion about

these event filters.

</informalexample>

<emphasis>Tying it Together:</emphasis> These event filters are implemented by a

1150

special-purpose pseudo-interpreter in the kernel and are an

1151

integral and indispensable part of the perf design as it

1152

relates to tracing. kernel-based event filters provide a

1153

mechanism to precisely throttle the event stream that appears

1154

in user space, where it makes sense to provide bindings to real

1155

programming languages for postprocessing the event stream.

1156

This architecture allows for the intelligent and flexible

1157

partitioning of processing between the kernel and user space.

1158

Contrast this with other tools such as SystemTap, which does

1159

all of its processing in the kernel and as such requires a

1160

special project-defined language in order to accommodate that

1161

design, or LTTng, where everything is sent to userspace and

1162

as such requires a super-efficient kernel-to-userspace

1163

transport mechanism in order to function properly. While

1164

perf certainly can benefit from for instance advances in

1165

the design of the transport, it doesn't fundamentally depend

1166

on them. Basically, if you find that your perf tracing

1167

application is causing buffer I/O overruns, it probably

1168

means that you aren't taking enough advantage of the

1169

kernel filtering engine.

</informalexample>

</section>

</section>

<title>Using Dynamic Tracepoints</title>

1176

1177

<para>

1178

perf isn't restricted to the fixed set of static tracepoints

1179

listed by 'perf list'. Users can also add their own 'dynamic'

1180

tracepoints anywhere in the kernel. For instance, suppose we

1181

want to define our own tracepoint on do_fork(). We can do that

1182

using the 'perf probe' perf subcommand:

1183

1184

root@crownbay:~# perf probe do_fork

1185

Added new event:

1186

probe:do_fork (on do_fork)

1187

1188

You can now use it in all perf tools, such as:

1189

1190

perf record -e probe:do_fork -aR sleep 1

1191

</literallayout>

1192

Adding a new tracepoint via 'perf probe' results in an event

1193

with all the expected files and format in

1194

/sys/kernel/debug/tracing/events, just the same as for static

1195

tracepoints (as discussed in more detail in the trace events

1196

subsystem section:

1197

1198

root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# ls -al

1199

drwxr-xr-x 2 root root 0 Oct 28 11:42 .

1200

drwxr-xr-x 3 root root 0 Oct 28 11:42 ..

1201

-rw-r--r-- 1 root root 0 Oct 28 11:42 enable

1202

-rw-r--r-- 1 root root 0 Oct 28 11:42 filter

1203

-r--r--r-- 1 root root 0 Oct 28 11:42 format

1204

-r--r--r-- 1 root root 0 Oct 28 11:42 id

1205

1206

root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# cat format

name: do_fork

ID: 944

format:

field:unsigned short common_type; offset:0; size:2; signed:0;

1211

field:unsigned char common_flags; offset:2; size:1; signed:0;

1212

field:unsigned char common_preempt_count; offset:3; size:1; signed:0;

1213

field:int common_pid; offset:4; size:4; signed:1;

1214

field:int common_padding; offset:8; size:4; signed:1;

1215

1216

field:unsigned long __probe_ip; offset:12; size:4; signed:0;

1217

1218

print fmt: "(%lx)", REC->__probe_ip

1219

</literallayout>

1220

We can list all dynamic tracepoints currently in existence:

1221

1222

root@crownbay:~# perf probe -l

1223

probe:do_fork (on do_fork)

1224

probe:schedule (on schedule)

1225

</literallayout>

1226

Let's record system-wide ('sleep 30' is a trick for recording

1227

system-wide but basically do nothing and then wake up after

1228

30 seconds):

1229

1230

root@crownbay:~# perf record -g -a -e probe:do_fork sleep 30

1231

[ perf record: Woken up 1 times to write data ]

1232

[ perf record: Captured and wrote 0.087 MB perf.data (~3812 samples) ]

1233

</literallayout>

1234

Using 'perf script' we can see each do_fork event that fired:

1235

1236

root@crownbay:~# perf script

1237

1238

# ========

1239

# captured on: Sun Oct 28 11:55:18 2012

1240

# hostname : crownbay

1241

# os release : 3.4.11-yocto-standard

1242

# perf version : 3.4.11

# arch : i686

# nrcpus online : 2

# nrcpus avail : 2

# cpudesc : Intel(R) Atom(TM) CPU E660 @ 1.30GHz

1247

# cpuid : GenuineIntel,6,38,1

1248

# total memory : 1017184 kB

1249

# cmdline : /usr/bin/perf record -g -a -e probe:do_fork sleep 30

1250

# event : name = probe:do_fork, type = 2, config = 0x3b0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern

1251

= 0, id = { 5, 6 }

1252

# HEADER_CPU_TOPOLOGY info available, use -I to display

1253

# ========

1254

#

1255

matchbox-deskto 1197 [001] 34211.378318: do_fork: (c1028460)

1256

matchbox-deskto 1295 [001] 34211.380388: do_fork: (c1028460)

1257

pcmanfm 1296 [000] 34211.632350: do_fork: (c1028460)

1258

pcmanfm 1296 [000] 34211.639917: do_fork: (c1028460)

1259

matchbox-deskto 1197 [001] 34217.541603: do_fork: (c1028460)

1260

matchbox-deskto 1299 [001] 34217.543584: do_fork: (c1028460)

1261

gthumb 1300 [001] 34217.697451: do_fork: (c1028460)

1262

gthumb 1300 [001] 34219.085734: do_fork: (c1028460)

1263

gthumb 1300 [000] 34219.121351: do_fork: (c1028460)

1264

gthumb 1300 [001] 34219.264551: do_fork: (c1028460)

1265

pcmanfm 1296 [000] 34219.590380: do_fork: (c1028460)

1266

matchbox-deskto 1197 [001] 34224.955965: do_fork: (c1028460)

1267

matchbox-deskto 1306 [001] 34224.957972: do_fork: (c1028460)

1268

matchbox-termin 1307 [000] 34225.038214: do_fork: (c1028460)

1269

matchbox-termin 1307 [001] 34225.044218: do_fork: (c1028460)

1270

matchbox-termin 1307 [000] 34225.046442: do_fork: (c1028460)

1271

matchbox-deskto 1197 [001] 34237.112138: do_fork: (c1028460)

1272

matchbox-deskto 1311 [001] 34237.114106: do_fork: (c1028460)

1273

gaku 1312 [000] 34237.202388: do_fork: (c1028460)

1274

</literallayout>

1275

And using 'perf report' on the same file, we can see the

1276

callgraphs from starting a few programs during those 30 seconds:

</para>

<para>

</para>

<emphasis>Tying it Together:</emphasis> The trace events subsystem accommodate static

1285

and dynamic tracepoints in exactly the same way - there's no

1286

difference as far as the infrastructure is concerned. See the

1287

ftrace section for more details on the trace event subsystem.

</informalexample>

<emphasis>Tying it Together:</emphasis> Dynamic tracepoints are implemented under the

1292

covers by kprobes and uprobes. kprobes and uprobes are also used

1293

by and in fact are the main focus of SystemTap.

</informalexample>

</section>

</section>

<title>Documentation</title>

1300

1301

<para>

1302

Online versions of the man pages for the commands discussed in this

1303

section can be found here:

1304

1305

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-stat'>'perf stat' manpage</ulink>.

1306

</para></listitem>

1307

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-record'>'perf record' manpage</ulink>.

1308

</para></listitem>

1309

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-report'>'perf report' manpage</ulink>.

1310

</para></listitem>

1311

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-probe'>'perf probe' manpage</ulink>.

1312

</para></listitem>

1313

<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-script'>'perf script' manpage</ulink>.

1314

</para></listitem>

1315

<listitem><para>Documentation on using the

1316

<ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.

1317

</para></listitem>

1318

<listitem><para>The top-level

1319

<ulink url='http://linux.die.net/man/1/perf'>perf(1) manpage</ulink>.

</para></listitem>

</itemizedlist>

</para>

<para>

Normally, you should be able to invoke the man pages via perf

1326

itself e.g. 'perf help' or 'perf help record'.

</para>

<para>

However, by default Yocto doesn't install man pages, but perf

1331

invokes the man pages for most help functionality. This is a bug

1332

and is being addressed by a Yocto bug:

1333

<ulink url='https://bugzilla.yoctoproject.org/show_bug.cgi?id=3388'>Bug 3388 - perf: enable man pages for basic 'help' functionality</ulink>.

</para>

<para>

The man pages in text form, along with some other files, such as

1338

a set of examples, can be found in the 'perf' directory of the

1339

kernel tree:

1340

1341

tools/perf/Documentation

1342

</literallayout>

1343

There's also a nice perf tutorial on the perf wiki that goes

1344

into more detail than we do here in certain areas:

1345

<ulink url='https://perf.wiki.kernel.org/index.php/Tutorial'>Perf Tutorial</ulink>

</para>

</section>

</section>

<title>ftrace</title>

1352

1353

<para>

1354

'ftrace' literally refers to the 'ftrace function tracer' but in

1355

reality this encompasses a number of related tracers along with

1356

the infrastructure that they all make use of.

</para>

<title>Setup</title>

<para>

For this section, we'll assume you've already performed the basic

1364

setup outlined in the General Setup section.

</para>

<para>

ftrace, trace-cmd, and kernelshark run on the target system,

1369

and are ready to go out-of-the-box - no additional setup is

1370

necessary. For the rest of this section we assume you've ssh'ed

1371

to the host and will be running ftrace on the target. kernelshark

1372

is a GUI application and if you use the '-X' option to ssh you

1373

can have the kernelshark GUI run on the target but display

1374

remotely on the host if you want.

</para>

</section>

<title>Basic ftrace usage</title>

1380

1381

<para>

1382

'ftrace' essentially refers to everything included in

1383

the /tracing directory of the mounted debugfs filesystem

1384

(Yocto follows the standard convention and mounts it

1385

at /sys/kernel/debug). Here's a listing of all the files

1386

found in /sys/kernel/debug/tracing on a Yocto system:

1387

1388

root@sugarbay:/sys/kernel/debug/tracing# ls

1389

README kprobe_events trace

1390

available_events kprobe_profile trace_clock

1391

available_filter_functions options trace_marker

1392

available_tracers per_cpu trace_options

1393

buffer_size_kb printk_formats trace_pipe

1394

buffer_total_size_kb saved_cmdlines tracing_cpumask

1395

current_tracer set_event tracing_enabled

1396

dyn_ftrace_total_info set_ftrace_filter tracing_on

1397

enabled_functions set_ftrace_notrace tracing_thresh

1398

events set_ftrace_pid

1399

free_buffer set_graph_function

1400

</literallayout>

1401

The files listed above are used for various purposes -

1402

some relate directly to the tracers themselves, others are

1403

used to set tracing options, and yet others actually contain

1404

the tracing output when a tracer is in effect. Some of the

1405

functions can be guessed from their names, others need

1406

explanation; in any case, we'll cover some of the files we

1407

see here below but for an explanation of the others, please

1408

see the ftrace documentation.

</para>

<para>

We'll start by looking at some of the available built-in

tracers.

</para>

<para>

cat'ing the 'available_tracers' file lists the set of

1418

available tracers:

1419

1420

root@sugarbay:/sys/kernel/debug/tracing# cat available_tracers

1421

blk function_graph function nop

1422

</literallayout>

1423

The 'current_tracer' file contains the tracer currently in

1424

effect:

1425

1426

root@sugarbay:/sys/kernel/debug/tracing# cat current_tracer

1427

nop

1428

</literallayout>

1429

The above listing of current_tracer shows that

1430

the 'nop' tracer is in effect, which is just another

1431

way of saying that there's actually no tracer

currently in effect.

</para>

<para>

echo'ing one of the available_tracers into current_tracer

1437

makes the specified tracer the current tracer:

1438

1439

root@sugarbay:/sys/kernel/debug/tracing# echo function > current_tracer

1440

root@sugarbay:/sys/kernel/debug/tracing# cat current_tracer

1441

function

1442

</literallayout>

1443

The above sets the current tracer to be the

1444

'function tracer'. This tracer traces every function

1445

call in the kernel and makes it available as the

1446

contents of the 'trace' file. Reading the 'trace' file

1447

lists the currently buffered function calls that have been

1448

traced by the function tracer:

1449

1450

root@sugarbay:/sys/kernel/debug/tracing# cat trace | less

# tracer: function

#

# entries-in-buffer/entries-written: 310629/766471 #P:8

1455

#

1456

# _-----=> irqs-off

1457

# / _----=> need-resched

1458

# | / _---=> hardirq/softirq

1459

# || / _--=> preempt-depth

1460

# ||| / delay

1461

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

1462

# | | | |||| | |

1463

<idle>-0 [004] d..1 470.867169: ktime_get_real <-intel_idle

1464

<idle>-0 [004] d..1 470.867170: getnstimeofday <-ktime_get_real

1465

<idle>-0 [004] d..1 470.867171: ns_to_timeval <-intel_idle

1466

<idle>-0 [004] d..1 470.867171: ns_to_timespec <-ns_to_timeval

1467

<idle>-0 [004] d..1 470.867172: smp_apic_timer_interrupt <-apic_timer_interrupt

1468

<idle>-0 [004] d..1 470.867172: native_apic_mem_write <-smp_apic_timer_interrupt

1469

<idle>-0 [004] d..1 470.867172: irq_enter <-smp_apic_timer_interrupt

1470

<idle>-0 [004] d..1 470.867172: rcu_irq_enter <-irq_enter

1471

<idle>-0 [004] d..1 470.867173: rcu_idle_exit_common.isra.33 <-rcu_irq_enter

1472

<idle>-0 [004] d..1 470.867173: local_bh_disable <-irq_enter

1473

<idle>-0 [004] d..1 470.867173: add_preempt_count <-local_bh_disable

1474

<idle>-0 [004] d.s1 470.867174: tick_check_idle <-irq_enter

1475

<idle>-0 [004] d.s1 470.867174: tick_check_oneshot_broadcast <-tick_check_idle

1476

<idle>-0 [004] d.s1 470.867174: ktime_get <-tick_check_idle

1477

<idle>-0 [004] d.s1 470.867174: tick_nohz_stop_idle <-tick_check_idle

1478

<idle>-0 [004] d.s1 470.867175: update_ts_time_stats <-tick_nohz_stop_idle

1479

<idle>-0 [004] d.s1 470.867175: nr_iowait_cpu <-update_ts_time_stats

1480

<idle>-0 [004] d.s1 470.867175: tick_do_update_jiffies64 <-tick_check_idle

1481

<idle>-0 [004] d.s1 470.867175: _raw_spin_lock <-tick_do_update_jiffies64

1482

<idle>-0 [004] d.s1 470.867176: add_preempt_count <-_raw_spin_lock

1483

<idle>-0 [004] d.s2 470.867176: do_timer <-tick_do_update_jiffies64

1484

<idle>-0 [004] d.s2 470.867176: _raw_spin_lock <-do_timer

1485

<idle>-0 [004] d.s2 470.867176: add_preempt_count <-_raw_spin_lock

1486

<idle>-0 [004] d.s3 470.867177: ntp_tick_length <-do_timer

1487

<idle>-0 [004] d.s3 470.867177: _raw_spin_lock_irqsave <-ntp_tick_length

.

.

.

</literallayout>

Each line in the trace above shows what was happening in

1493

the kernel on a given cpu, to the level of detail of

1494

function calls. Each entry shows the function called,

1495

followed by its caller (after the arrow).

</para>

<para>

The function tracer gives you an extremely detailed idea

1500

of what the kernel was doing at the point in time the trace

1501

was taken, and is a great way to learn about how the kernel

1502

code works in a dynamic sense.

</para>

<emphasis>Tying it Together:</emphasis> The ftrace function tracer is also

1507

available from within perf, as the ftrace:function tracepoint.

</informalexample>

<para>

It is a little more difficult to follow the call chains than

1512

it needs to be - luckily there's a variant of the function

1513

tracer that displays the callchains explicitly, called the

1514

'function_graph' tracer:

1515

1516

root@sugarbay:/sys/kernel/debug/tracing# echo function_graph > current_tracer

1517

root@sugarbay:/sys/kernel/debug/tracing# cat trace | less

1518

1519

tracer: function_graph

1520

1521

CPU DURATION FUNCTION CALLS

1522

| | | | | | |

1523

7) 0.046 us | pick_next_task_fair();

1524

7) 0.043 us | pick_next_task_stop();

1525

7) 0.042 us | pick_next_task_rt();

1526

7) 0.032 us | pick_next_task_fair();

1527

7) 0.030 us | pick_next_task_idle();

1528

7) | _raw_spin_unlock_irq() {

1529

7) 0.033 us | sub_preempt_count();

1530

7) 0.258 us | }

1531

7) 0.032 us | sub_preempt_count();

1532

7) + 13.341 us | } /* __schedule */

1533

7) 0.095 us | } /* sub_preempt_count */

1534

7) | schedule() {

1535

7) | __schedule() {

1536

7) 0.060 us | add_preempt_count();

1537

7) 0.044 us | rcu_note_context_switch();

1538

7) | _raw_spin_lock_irq() {

1539

7) 0.033 us | add_preempt_count();

1540

7) 0.247 us | }

1541

7) | idle_balance() {

1542

7) | _raw_spin_unlock() {

1543

7) 0.031 us | sub_preempt_count();

1544

7) 0.246 us | }

1545

7) | update_shares() {

1546

7) 0.030 us | __rcu_read_lock();

1547

7) 0.029 us | __rcu_read_unlock();

1548

7) 0.484 us | }

1549

7) 0.030 us | __rcu_read_lock();

1550

7) | load_balance() {

1551

7) | find_busiest_group() {

1552

7) 0.031 us | idle_cpu();

1553

7) 0.029 us | idle_cpu();

1554

7) 0.035 us | idle_cpu();

1555

7) 0.906 us | }

1556

7) 1.141 us | }

1557

7) 0.022 us | msecs_to_jiffies();

1558

7) | load_balance() {

1559

7) | find_busiest_group() {

1560

7) 0.031 us | idle_cpu();

.

.

.

4) 0.062 us | msecs_to_jiffies();

1565

4) 0.062 us | __rcu_read_unlock();

1566

4) | _raw_spin_lock() {

1567

4) 0.073 us | add_preempt_count();

1568

4) 0.562 us | }

1569

4) + 17.452 us | }

1570

4) 0.108 us | put_prev_task_fair();

1571

4) 0.102 us | pick_next_task_fair();

1572

4) 0.084 us | pick_next_task_stop();

1573

4) 0.075 us | pick_next_task_rt();

1574

4) 0.062 us | pick_next_task_fair();

1575

4) 0.066 us | pick_next_task_idle();

1576

------------------------------------------

1577

4) kworker-74 => <idle>-0

1578

------------------------------------------

1579

1580

4) | finish_task_switch() {

1581

4) | _raw_spin_unlock_irq() {

1582

4) 0.100 us | sub_preempt_count();

1583

4) 0.582 us | }

1584

4) 1.105 us | }

1585

4) 0.088 us | sub_preempt_count();

4) ! 100.066 us | }

.

.

.

3) | sys_ioctl() {

3) 0.083 us | fget_light();

1592

3) | security_file_ioctl() {

1593

3) 0.066 us | cap_file_ioctl();

1594

3) 0.562 us | }

1595

3) | do_vfs_ioctl() {

1596

3) | drm_ioctl() {

1597

3) 0.075 us | drm_ut_debug_printk();

1598

3) | i915_gem_pwrite_ioctl() {

1599

3) | i915_mutex_lock_interruptible() {

1600

3) 0.070 us | mutex_lock_interruptible();

1601

3) 0.570 us | }

1602

3) | drm_gem_object_lookup() {

1603

3) | _raw_spin_lock() {

1604

3) 0.080 us | add_preempt_count();

1605

3) 0.620 us | }

1606

3) | _raw_spin_unlock() {

1607

3) 0.085 us | sub_preempt_count();

1608

3) 0.562 us | }

1609

3) 2.149 us | }

1610

3) 0.133 us | i915_gem_object_pin();

1611

3) | i915_gem_object_set_to_gtt_domain() {

1612

3) 0.065 us | i915_gem_object_flush_gpu_write_domain();

1613

3) 0.065 us | i915_gem_object_wait_rendering();

1614

3) 0.062 us | i915_gem_object_flush_cpu_write_domain();

1615

3) 1.612 us | }

1616

3) | i915_gem_object_put_fence() {

1617

3) 0.097 us | i915_gem_object_flush_fence.constprop.36();

1618

3) 0.645 us | }

1619

3) 0.070 us | add_preempt_count();

1620

3) 0.070 us | sub_preempt_count();

1621

3) 0.073 us | i915_gem_object_unpin();

1622

3) 0.068 us | mutex_unlock();

3) 9.924 us | }

3) + 11.236 us | }

3) + 11.770 us | }

3) + 13.784 us | }

3) | sys_ioctl() {

</literallayout>

As you can see, the function_graph display is much easier to

1630

follow. Also note that in addition to the function calls and

1631

associated braces, other events such as scheduler events

1632

are displayed in context. In fact, you can freely include

1633

any tracepoint available in the trace events subsystem described

1634

in the next section by simply enabling those events, and they'll

1635

appear in context in the function graph display. Quite a

1636

powerful tool for understanding kernel dynamics.

</para>

<para>

Also notice that there are various annotations on the left

1641

hand side of the display. For example if the total time it

1642

took for a given function to execute is above a certain

1643

threshold, an exclamation point or plus sign appears on the

1644

left hand side. Please see the ftrace documentation for

1645

details on all these fields.

</para>

</section>

<title>The 'trace events' Subsystem</title>

1651

1652

<para>

1653

One especially important directory contained within

1654

the /sys/kernel/debug/tracing directory is the 'events'

1655

subdirectory, which contains representations of every

1656

tracepoint in the system. Listing out the contents of

1657

the 'events' subdirectory, we see mainly another set of

1658

subdirectories:

1659

1660

root@sugarbay:/sys/kernel/debug/tracing# cd events

1661

root@sugarbay:/sys/kernel/debug/tracing/events# ls -al

1662

drwxr-xr-x 38 root root 0 Nov 14 23:19 .

1663

drwxr-xr-x 5 root root 0 Nov 14 23:19 ..

1664

drwxr-xr-x 19 root root 0 Nov 14 23:19 block

1665

drwxr-xr-x 32 root root 0 Nov 14 23:19 btrfs

1666

drwxr-xr-x 5 root root 0 Nov 14 23:19 drm

1667

-rw-r--r-- 1 root root 0 Nov 14 23:19 enable

1668

drwxr-xr-x 40 root root 0 Nov 14 23:19 ext3

1669

drwxr-xr-x 79 root root 0 Nov 14 23:19 ext4

1670

drwxr-xr-x 14 root root 0 Nov 14 23:19 ftrace

1671

drwxr-xr-x 8 root root 0 Nov 14 23:19 hda

1672

-r--r--r-- 1 root root 0 Nov 14 23:19 header_event

1673

-r--r--r-- 1 root root 0 Nov 14 23:19 header_page

1674

drwxr-xr-x 25 root root 0 Nov 14 23:19 i915

1675

drwxr-xr-x 7 root root 0 Nov 14 23:19 irq

1676

drwxr-xr-x 12 root root 0 Nov 14 23:19 jbd

1677

drwxr-xr-x 14 root root 0 Nov 14 23:19 jbd2

1678

drwxr-xr-x 14 root root 0 Nov 14 23:19 kmem

1679

drwxr-xr-x 7 root root 0 Nov 14 23:19 module

1680

drwxr-xr-x 3 root root 0 Nov 14 23:19 napi

1681

drwxr-xr-x 6 root root 0 Nov 14 23:19 net

1682

drwxr-xr-x 3 root root 0 Nov 14 23:19 oom

1683

drwxr-xr-x 12 root root 0 Nov 14 23:19 power

1684

drwxr-xr-x 3 root root 0 Nov 14 23:19 printk

1685

drwxr-xr-x 8 root root 0 Nov 14 23:19 random

1686

drwxr-xr-x 4 root root 0 Nov 14 23:19 raw_syscalls

1687

drwxr-xr-x 3 root root 0 Nov 14 23:19 rcu

1688

drwxr-xr-x 6 root root 0 Nov 14 23:19 rpm

1689

drwxr-xr-x 20 root root 0 Nov 14 23:19 sched

1690

drwxr-xr-x 7 root root 0 Nov 14 23:19 scsi

1691

drwxr-xr-x 4 root root 0 Nov 14 23:19 signal

1692

drwxr-xr-x 5 root root 0 Nov 14 23:19 skb

1693

drwxr-xr-x 4 root root 0 Nov 14 23:19 sock

1694

drwxr-xr-x 10 root root 0 Nov 14 23:19 sunrpc

1695

drwxr-xr-x 538 root root 0 Nov 14 23:19 syscalls

1696

drwxr-xr-x 4 root root 0 Nov 14 23:19 task

1697

drwxr-xr-x 14 root root 0 Nov 14 23:19 timer

1698

drwxr-xr-x 3 root root 0 Nov 14 23:19 udp

1699

drwxr-xr-x 21 root root 0 Nov 14 23:19 vmscan

1700

drwxr-xr-x 3 root root 0 Nov 14 23:19 vsyscall

1701

drwxr-xr-x 6 root root 0 Nov 14 23:19 workqueue

1702

drwxr-xr-x 26 root root 0 Nov 14 23:19 writeback

1703

</literallayout>

1704

Each one of these subdirectories corresponds to a

1705

'subsystem' and contains yet again more subdirectories,

1706

each one of those finally corresponding to a tracepoint.

1707

For example, here are the contents of the 'kmem' subsystem:

1708

1709

root@sugarbay:/sys/kernel/debug/tracing/events# cd kmem

1710

root@sugarbay:/sys/kernel/debug/tracing/events/kmem# ls -al

1711

drwxr-xr-x 14 root root 0 Nov 14 23:19 .

1712

drwxr-xr-x 38 root root 0 Nov 14 23:19 ..

1713

-rw-r--r-- 1 root root 0 Nov 14 23:19 enable

1714

-rw-r--r-- 1 root root 0 Nov 14 23:19 filter

1715

drwxr-xr-x 2 root root 0 Nov 14 23:19 kfree

1716

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmalloc

1717

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmalloc_node

1718

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_alloc

1719

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_alloc_node

1720

drwxr-xr-x 2 root root 0 Nov 14 23:19 kmem_cache_free

1721

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc

1722

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc_extfrag

1723

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_alloc_zone_locked

1724

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_free

1725

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_free_batched

1726

drwxr-xr-x 2 root root 0 Nov 14 23:19 mm_page_pcpu_drain

1727

</literallayout>

1728

Let's see what's inside the subdirectory for a specific

1729

tracepoint, in this case the one for kmalloc:

1730

1731

root@sugarbay:/sys/kernel/debug/tracing/events/kmem# cd kmalloc

1732

root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# ls -al

1733

drwxr-xr-x 2 root root 0 Nov 14 23:19 .

1734

drwxr-xr-x 14 root root 0 Nov 14 23:19 ..

1735

-rw-r--r-- 1 root root 0 Nov 14 23:19 enable

1736

-rw-r--r-- 1 root root 0 Nov 14 23:19 filter

1737

-r--r--r-- 1 root root 0 Nov 14 23:19 format

1738

-r--r--r-- 1 root root 0 Nov 14 23:19 id

1739

</literallayout>

1740

The 'format' file for the tracepoint describes the event

1741

in memory, which is used by the various tracing tools

1742

that now make use of these tracepoint to parse the event

1743

and make sense of it, along with a 'print fmt' field that

1744

allows tools like ftrace to display the event as text.

1745

Here's what the format of the kmalloc event looks like:

1746

1747

root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# cat format

name: kmalloc

ID: 313

format:

field:unsigned short common_type; offset:0; size:2; signed:0;

1752

field:unsigned char common_flags; offset:2; size:1; signed:0;

1753

field:unsigned char common_preempt_count; offset:3; size:1; signed:0;

1754

field:int common_pid; offset:4; size:4; signed:1;

1755

field:int common_padding; offset:8; size:4; signed:1;

1756

1757

field:unsigned long call_site; offset:16; size:8; signed:0;

1758

field:const void * ptr; offset:24; size:8; signed:0;

1759

field:size_t bytes_req; offset:32; size:8; signed:0;

1760

field:size_t bytes_alloc; offset:40; size:8; signed:0;

1761

field:gfp_t gfp_flags; offset:48; size:4; signed:0;

1762

1763

print fmt: "call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s", REC->call_site, REC->ptr, REC->bytes_req, REC->bytes_alloc,

1764

(REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {(unsigned long)(((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((

1765

1766

gfp_t)0x400000u)), "GFP_TRANSHUGE"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x20000u) | ((

1767

gfp_t)0x02u) | (( gfp_t)0x08u)), "GFP_HIGHUSER_MOVABLE"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((

1768

gfp_t)0x20000u) | (( gfp_t)0x02u)), "GFP_HIGHUSER"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | ((

1769

gfp_t)0x20000u)), "GFP_USER"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x80000u)), GFP_TEMPORARY"},

1770

{(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u) | (( gfp_t)0x80u)), "GFP_KERNEL"}, {(unsigned long)((( gfp_t)0x10u) | (( gfp_t)0x40u)),

1771

"GFP_NOFS"}, {(unsigned long)((( gfp_t)0x20u)), "GFP_ATOMIC"}, {(unsigned long)((( gfp_t)0x10u)), "GFP_NOIO"}, {(unsigned long)((

1772

gfp_t)0x20u), "GFP_HIGH"}, {(unsigned long)(( gfp_t)0x10u), "GFP_WAIT"}, {(unsigned long)(( gfp_t)0x40u), "GFP_IO"}, {(unsigned long)((

1773

gfp_t)0x100u), "GFP_COLD"}, {(unsigned long)(( gfp_t)0x200u), "GFP_NOWARN"}, {(unsigned long)(( gfp_t)0x400u), "GFP_REPEAT"}, {(unsigned

1774

long)(( gfp_t)0x800u), "GFP_NOFAIL"}, {(unsigned long)(( gfp_t)0x1000u), "GFP_NORETRY"}, {(unsigned long)(( gfp_t)0x4000u), "GFP_COMP"},

1775

{(unsigned long)(( gfp_t)0x8000u), "GFP_ZERO"}, {(unsigned long)(( gfp_t)0x10000u), "GFP_NOMEMALLOC"}, {(unsigned long)(( gfp_t)0x20000u),

1776

"GFP_HARDWALL"}, {(unsigned long)(( gfp_t)0x40000u), "GFP_THISNODE"}, {(unsigned long)(( gfp_t)0x80000u), "GFP_RECLAIMABLE"}, {(unsigned

1777

long)(( gfp_t)0x08u), "GFP_MOVABLE"}, {(unsigned long)(( gfp_t)0), "GFP_NOTRACK"}, {(unsigned long)(( gfp_t)0x400000u), "GFP_NO_KSWAPD"},

1778

{(unsigned long)(( gfp_t)0x800000u), "GFP_OTHER_NODE"} ) : "GFP_NOWAIT"

1779

</literallayout>

1780

The 'enable' file in the tracepoint directory is what allows

1781

the user (or tools such as trace-cmd) to actually turn the

1782

tracepoint on and off. When enabled, the corresponding

1783

tracepoint will start appearing in the ftrace 'trace'

1784

file described previously. For example, this turns on the

1785

kmalloc tracepoint:

1786

1787

root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# echo 1 > enable

1788

</literallayout>

1789

At the moment, we're not interested in the function tracer or

1790

some other tracer that might be in effect, so we first turn

1791

it off, but if we do that, we still need to turn tracing on in

1792

order to see the events in the output buffer:

1793

1794

root@sugarbay:/sys/kernel/debug/tracing# echo nop > current_tracer

1795

root@sugarbay:/sys/kernel/debug/tracing# echo 1 > tracing_on

1796

</literallayout>

1797

Now, if we look at the the 'trace' file, we see nothing

1798

but the kmalloc events we just turned on:

1799

1800

root@sugarbay:/sys/kernel/debug/tracing# cat trace | less

1801

# tracer: nop

1802

#

1803

# entries-in-buffer/entries-written: 1897/1897 #P:8

1804

#

1805

# _-----=> irqs-off

1806

# / _----=> need-resched

1807

# | / _---=> hardirq/softirq

1808

# || / _--=> preempt-depth

1809

# ||| / delay

1810

# TASK-PID CPU# |||| TIMESTAMP FUNCTION

1811

# | | | |||| | |

1812

dropbear-1465 [000] ...1 18154.620753: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1813

<idle>-0 [000] ..s3 18154.621640: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1814

<idle>-0 [000] ..s3 18154.621656: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1815

matchbox-termin-1361 [001] ...1 18154.755472: kmalloc: call_site=ffffffff81614050 ptr=ffff88006d5f0e00 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_KERNEL|GFP_REPEAT

1816

Xorg-1264 [002] ...1 18154.755581: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY

1817

Xorg-1264 [002] ...1 18154.755583: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO

1818

Xorg-1264 [002] ...1 18154.755589: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO

1819

matchbox-termin-1361 [001] ...1 18155.354594: kmalloc: call_site=ffffffff81614050 ptr=ffff88006db35400 bytes_req=576 bytes_alloc=1024 gfp_flags=GFP_KERNEL|GFP_REPEAT

1820

Xorg-1264 [002] ...1 18155.354703: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY

1821

Xorg-1264 [002] ...1 18155.354705: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO

1822

Xorg-1264 [002] ...1 18155.354711: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO

1823

<idle>-0 [000] ..s3 18155.673319: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1824

dropbear-1465 [000] ...1 18155.673525: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1825

<idle>-0 [000] ..s3 18155.674821: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1826

<idle>-0 [000] ..s3 18155.793014: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1827

dropbear-1465 [000] ...1 18155.793219: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1828

<idle>-0 [000] ..s3 18155.794147: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1829

<idle>-0 [000] ..s3 18155.936705: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1830

dropbear-1465 [000] ...1 18155.936910: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1831

<idle>-0 [000] ..s3 18155.937869: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1832

matchbox-termin-1361 [001] ...1 18155.953667: kmalloc: call_site=ffffffff81614050 ptr=ffff88006d5f2000 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_KERNEL|GFP_REPEAT

1833

Xorg-1264 [002] ...1 18155.953775: kmalloc: call_site=ffffffff8141abe8 ptr=ffff8800734f4cc0 bytes_req=168 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_NOWARN|GFP_NORETRY

1834

Xorg-1264 [002] ...1 18155.953777: kmalloc: call_site=ffffffff814192a3 ptr=ffff88001f822520 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|GFP_ZERO

1835

Xorg-1264 [002] ...1 18155.953783: kmalloc: call_site=ffffffff81419edb ptr=ffff8800721a2f00 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|GFP_ZERO

1836

<idle>-0 [000] ..s3 18156.176053: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1837

dropbear-1465 [000] ...1 18156.176257: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1838

<idle>-0 [000] ..s3 18156.177717: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1839

<idle>-0 [000] ..s3 18156.399229: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d555800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1840

dropbear-1465 [000] ...1 18156.399434: kmalloc: call_site=ffffffff816650d4 ptr=ffff8800729c3000 bytes_http://rostedt.homelinux.com/kernelshark/req=2048 bytes_alloc=2048 gfp_flags=GFP_KERNEL

1841

<idle>-0 [000] ..s3 18156.400660: kmalloc: call_site=ffffffff81619b36 ptr=ffff88006d554800 bytes_req=512 bytes_alloc=512 gfp_flags=GFP_ATOMIC

1842

matchbox-termin-1361 [001] ...1 18156.552800: kmalloc: call_site=ffffffff81614050 ptr=ffff88006db34800 bytes_req=576 bytes_alloc=1024 gfp_flags=GFP_KERNEL|GFP_REPEAT

1843

</literallayout>

1844

To again disable the kmalloc event, we need to send 0 to the

1845

enable file:

1846

1847

root@sugarbay:/sys/kernel/debug/tracing/events/kmem/kmalloc# echo 0 > enable

1848

</literallayout>

1849

You can enable any number of events or complete subsystems

1850

(by using the 'enable' file in the subsystem directory) and

1851

get an arbitrarily fine-grained idea of what's going on in the

1852

system by enabling as many of the appropriate tracepoints

as applicable.

</para>

<para>

A number of the tools described in this HOWTO do just that,

1858

including trace-cmd and kernelshark in the next section.

</para>

<emphasis>Tying it Together:</emphasis> These tracepoints and their representation

1863

are used not only by ftrace, but by many of the other tools

1864

covered in this document and they form a central point of

1865

integration for the various tracers available in Linux.

1866

They form a central part of the instrumentation for the

1867

following tools: perf, lttng, ftrace, blktrace and SystemTap

</informalexample>

<emphasis>Tying it Together:</emphasis> Eventually all the special-purpose tracers

1872

currently available in /sys/kernel/debug/tracing will be

1873

removed and replaced with equivalent tracers based on the

1874

'trace events' subsystem.

</informalexample>

</section>

<title>trace-cmd/kernelshark</title>

1880

1881

<para>

1882

trace-cmd is essentially an extensive command-line 'wrapper'

1883

interface that hides the details of all the individual files

1884

in /sys/kernel/debug/tracing, allowing users to specify

1885

specific particular events within the

1886

/sys/kernel/debug/tracing/events/ subdirectory and to collect

1887

traces and avoid having to deal with those details directly.

</para>

<para>

As yet another layer on top of that, kernelshark provides a GUI

1892

that allows users to start and stop traces and specify sets

1893

of events using an intuitive interface, and view the

1894

output as both trace events and as a per-CPU graphical

1895

display. It directly uses 'trace-cmd' as the plumbing

1896

that accomplishes all that underneath the covers (and

1897

actually displays the trace-cmd command it uses, as we'll see).

</para>

<para>

To start a trace using kernelshark, first start kernelshark:

1902

1903

root@sugarbay:~# kernelshark

1904

</literallayout>

1905

Then bring up the 'Capture' dialog by choosing from the

kernelshark menu:

Capture | Record

</literallayout>

That will display the following dialog, which allows you to

1911

choose one or more events (or even one or more complete

1912

subsystems) to trace:

</para>

<para>

</para>

<para>

Note that these are exactly the same sets of events described

1921

in the previous trace events subsystem section, and in fact

1922

is where trace-cmd gets them for kernelshark.

</para>

<para>

In the above screenshot, we've decided to explore the

1927

graphics subsystem a bit and so have chosen to trace all

1928

the tracepoints contained within the 'i915' and 'drm'

subsystems.

</para>

<para>

After doing that, we can start and stop the trace using

1934

the 'Run' and 'Stop' button on the lower right corner of

1935

the dialog (the same button will turn into the 'Stop'

1936

button after the trace has started):

</para>

<para>

</para>

<para>

Notice that the right-hand pane shows the exact trace-cmd

1945

command-line that's used to run the trace, along with the

1946

results of the trace-cmd run.

</para>

<para>

Once the 'Stop' button is pressed, the graphical view magically

1951

fills up with a colorful per-cpu display of the trace data,

1952

along with the detailed event listing below that:

</para>

<para>

</para>

<para>

Here's another example, this time a display resulting

1961

from tracing 'all events':

</para>

<para>

</para>

<para>

The tool is pretty self-explanatory, but for more detailed

1970

information on navigating through the data, see the

1971

<ulink url='http://rostedt.homelinux.com/kernelshark/'>kernelshark website</ulink>.

</para>

</section>

<title>Documentation</title>

1977

1978

<para>

1979

The documentation for ftrace can be found in the kernel

1980

Documentation directory:

1981

1982

Documentation/trace/ftrace.txt

1983

</literallayout>

1984

The documentation for the trace event subsystem can also

1985

be found in the kernel Documentation directory:

1986

1987

Documentation/trace/events.txt

1988

</literallayout>

1989

There is a nice series of articles on using

1990

ftrace and trace-cmd at LWN:

1991

1992

<listitem><para><ulink url='http://lwn.net/Articles/365835/'>Debugging the kernel using Ftrace - part 1</ulink>

1993

</para></listitem>

1994

<listitem><para><ulink url='http://lwn.net/Articles/366796/'>Debugging the kernel using Ftrace - part 2</ulink>

1995

</para></listitem>

1996

<listitem><para><ulink url='http://lwn.net/Articles/370423/'>Secrets of the Ftrace function tracer</ulink>

1997

</para></listitem>

1998

<listitem><para><ulink url='https://lwn.net/Articles/410200/'>trace-cmd: A front-end for Ftrace</ulink>

</para></listitem>

</itemizedlist>

</para>

<para>

There's more detailed documentation kernelshark usage here:

2005

<ulink url='http://rostedt.homelinux.com/kernelshark/'>KernelShark</ulink>

</para>

<para>

An amusing yet useful README (a tracing mini-HOWTO) can be

2010

found in /sys/kernel/debug/tracing/README.

</para>

</section>

</section>

<title>systemtap</title>

2017

2018

<para>

2019

SystemTap is a system-wide script-based tracing and profiling tool.

</para>

<para>

SystemTap scripts are C-like programs that are executed in the

2024

kernel to gather/print/aggregate data extracted from the context

2025

they end up being invoked under.

</para>

<para>

For example, this probe from the

2030

<ulink url='http://sourceware.org/systemtap/tutorial/'>SystemTap tutorial</ulink>

2031

simply prints a line every time any process on the system open()s

2032

a file. For each line, it prints the executable name of the

2033

program that opened the file, along with its PID, and the name

2034

of the file it opened (or tried to open), which it extracts

2035

from the open syscall's argstr.

probe syscall.open

{

printf ("%s(%d) open (%s)\n", execname(), pid(), argstr)

2040

}

2041

2042

probe timer.ms(4000) # after 4 seconds

{

exit ()

}

</literallayout>

Normally, to execute this probe, you'd simply install

2048

systemtap on the system you want to probe, and directly run

2049

the probe on that system e.g. assuming the name of the file

2050

containing the above text is trace_open.stp:

2051

2052

# stap trace_open.stp

2053

</literallayout>

2054

What systemtap does under the covers to run this probe is 1)

2055

parse and convert the probe to an equivalent 'C' form, 2)

2056

compile the 'C' form into a kernel module, 3) insert the

2057

module into the kernel, which arms it, and 4) collect the data

2058

generated by the probe and display it to the user.

</para>

<para>

In order to accomplish steps 1 and 2, the 'stap' program needs

2063

access to the kernel build system that produced the kernel

2064

that the probed system is running. In the case of a typical

2065

embedded system (the 'target'), the kernel build system

2066

unfortunately isn't typically part of the image running on

2067

the target. It is normally available on the 'host' system

2068

that produced the target image however; in such cases,

2069

steps 1 and 2 are executed on the host system, and steps

2070

3 and 4 are executed on the target system, using only the

systemtap 'runtime'.

</para>

<para>

The systemtap support in Yocto assumes that only steps

2076

3 and 4 are run on the target; it is possible to do

2077

everything on the target, but this section assumes only

2078

the typical embedded use-case.

</para>

<para>

So basically what you need to do in order to run a systemtap

2083

script on the target is to 1) on the host system, compile the

2084

probe into a kernel module that makes sense to the target, 2)

2085

copy the module onto the target system and 3) insert the

2086

module into the target kernel, which arms it, and 4) collect

2087

the data generated by the probe and display it to the user.

</para>

<title>Setup</title>

<para>

Those are a lot of steps and a lot of details, but

2095

fortunately Yocto includes a script called 'crosstap'

2096

that will take care of those details, allowing you to

2097

simply execute a systemtap script on the remote target,

2098

with arguments if necessary.

</para>

<para>

In order to do this from a remote host, however, you

2103

need to have access to the build for the image you

2104

booted. The 'crosstap' script provides details on how

2105

to do this if you run the script on the host without having

2106

done a build:

2107

<note>

2108

SystemTap, which uses 'crosstap', assumes you can establish an

2109

ssh connection to the remote target.

2110

Please refer to the crosstap wiki page for details on verifying

2111

ssh connections at

2112

<ulink url='https://wiki.yoctoproject.org/wiki/Tracing_and_Profiling#systemtap'></ulink>.

2113

Also, the ability to ssh into the target system is not enabled

2114

by default in *-minimal images.

2115

</note>

2116

2117

$ crosstap root@192.168.1.88 trace_open.stp

2118

2119

Error: No target kernel build found.

2120

Did you forget to create a local build of your image?

2121

2122

'crosstap' requires a local sdk build of the target system

2123

(or a build that includes 'tools-profile') in order to build

2124

kernel modules that can probe the target system.

2125

2126

Practically speaking, that means you need to do the following:

2127

- If you're running a pre-built image, download the release

2128

and/or BSP tarballs used to build the image.

2129

- If you're working from git sources, just clone the metadata

2130

and BSP layers needed to build the image you'll be booting.

2131

- Make sure you're properly set up to build a new image (see

2132

the BSP README and/or the widely available basic documentation

2133

that discusses how to build images).

2134

- Build an -sdk version of the image e.g.:

2135

$ bitbake core-image-sato-sdk

2136

OR

2137

- Build a non-sdk image but include the profiling tools:

2138

[ edit local.conf and add 'tools-profile' to the end of

2139

the EXTRA_IMAGE_FEATURES variable ]

2140

$ bitbake core-image-sato

2141

2142

Once you've build the image on the host system, you're ready to

2143

boot it (or the equivalent pre-built image) and use 'crosstap'

2144

to probe it (you need to source the environment as usual first):

2145

2146

$ source oe-init-build-env

2147

$ cd ~/my/systemtap/scripts

2148

$ crosstap root@192.168.1.xxx myscript.stp

2149

</literallayout>

2150

So essentially what you need to do is build an SDK image or

2151

image with 'tools-profile' as detailed in the

2152

"<link linkend='profile-manual-general-setup'>General Setup</link>"

2153

section of this manual, and boot the resulting target image.

</para>

<note>

If you have a build directory containing multiple machines,

2158

you need to have the MACHINE you're connecting to selected

2159

in local.conf, and the kernel in that machine's build

2160

directory must match the kernel on the booted system exactly,

2161

or you'll get the above 'crosstap' message when you try to

invoke a script.

</note>

</section>

<title>Running a Script on a Target</title>

2168

2169

<para>

2170

Once you've done that, you should be able to run a systemtap

2171

script on the target:

2172

2173

$ cd /path/to/yocto

2174

$ source oe-init-build-env

2175

2176

### Shell environment set up for builds. ###

2177

Patrick Williams

d8c66bc

2016-06-20 12:57:21 -0500

[diff] [blame]

2178

You can now run 'bitbake <target>'

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

2179

2180

Common targets are:

Patrick Williams

d8c66bc

2016-06-20 12:57:21 -0500

[diff] [blame]

core-image-minimal

core-image-sato

meta-toolchain

meta-ide-support

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

2185

Andrew Geissler

82c905d

2020-04-13 13:39:40 -0500

[diff] [blame]

2186

You can also run generated qemu images with a command like 'runqemu qemux86-64'

Patrick Williams

d8c66bc

2016-06-20 12:57:21 -0500

[diff] [blame]

2187

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

2188

</literallayout>

2189

Once you've done that, you can cd to whatever directory

2190

contains your scripts and use 'crosstap' to run the script:

2191

2192

$ cd /path/to/my/systemap/script

2193

$ crosstap root@192.168.7.2 trace_open.stp

2194

</literallayout>

2195

If you get an error connecting to the target e.g.:

2196

2197

$ crosstap root@192.168.7.2 trace_open.stp

2198

error establishing ssh connection on remote 'root@192.168.7.2'

2199

</literallayout>

2200

Try ssh'ing to the target and see what happens:

2201

2202

$ ssh root@192.168.7.2

2203

</literallayout>

2204

A lot of the time, connection problems are due specifying a

2205

wrong IP address or having a 'host key verification error'.

</para>

<para>

If everything worked as planned, you should see something

2210

like this (enter the password when prompted, or press enter

2211

if it's set up to use no password):

2212

2213

$ crosstap root@192.168.7.2 trace_open.stp

2214

root@192.168.7.2's password:

2215

matchbox-termin(1036) open ("/tmp/vte3FS2LW", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600)

2216

matchbox-termin(1036) open ("/tmp/vteJMC7LW", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0600)

</literallayout>

</para>

</section>

<title>Documentation</title>

2223

2224

<para>

2225

The SystemTap language reference can be found here:

2226

<ulink url='http://sourceware.org/systemtap/langref/'>SystemTap Language Reference</ulink>

</para>

<para>

Links to other SystemTap documents, tutorials, and examples can be

2231

found here:

2232

<ulink url='http://sourceware.org/systemtap/documentation.html'>SystemTap documentation page</ulink>

</para>

</section>

</section>

Patrick Williams

2015-09-15 14:41:29 -0500

[diff] [blame]

2237

2238

<title>Sysprof</title>

2239

2240

<para>

2241

Sysprof is a very easy to use system-wide profiler that consists

2242

of a single window with three panes and a few buttons which allow

2243

you to start, stop, and view the profile from one place.

</para>

<title>Setup</title>

<para>

For this section, we'll assume you've already performed the

2251

basic setup outlined in the General Setup section.

</para>

<para>

Sysprof is a GUI-based application that runs on the target

2256

system. For the rest of this document we assume you've

2257

ssh'ed to the host and will be running Sysprof on the

2258

target (you can use the '-X' option to ssh and have the

2259

Sysprof GUI run on the target but display remotely on the

host if you want).

</para>

</section>

<title>Basic Usage</title>

2266

2267

<para>

2268

To start profiling the system, you simply press the 'Start'

2269

button. To stop profiling and to start viewing the profile data

2270

in one easy step, press the 'Profile' button.

</para>

<para>

Once you've pressed the profile button, the three panes will

2275

fill up with profiling data:

</para>

<para>

</para>

<para>

The left pane shows a list of functions and processes.

2284

Selecting one of those expands that function in the right

2285

pane, showing all its callees. Note that this caller-oriented

2286

display is essentially the inverse of perf's default

2287

callee-oriented callchain display.

</para>

<para>

In the screenshot above, we're focusing on __copy_to_user_ll()

2292

and looking up the callchain we can see that one of the callers

2293

of __copy_to_user_ll is sys_read() and the complete callpath

2294

between them. Notice that this is essentially a portion of the

2295

same information we saw in the perf display shown in the perf

2296

section of this page.

</para>

<para>

</para>

<para>

Similarly, the above is a snapshot of the Sysprof display of a

2305

copy-from-user callchain.

</para>

<para>

Finally, looking at the third Sysprof pane in the lower left,

2310

we can see a list of all the callers of a particular function

2311

selected in the top left pane. In this case, the lower pane is

2312

showing all the callers of __mark_inode_dirty:

</para>

<para>

</para>

<para>

Double-clicking on one of those functions will in turn change the

2321

focus to the selected function, and so on.

</para>

<emphasis>Tying it Together:</emphasis> If you like sysprof's 'caller-oriented'

2326

display, you may be able to approximate it in other tools as

2327

well. For example, 'perf report' has the -g (--call-graph)

2328

option that you can experiment with; one of the options is

2329

'caller' for an inverted caller-based callgraph display.

</informalexample>

</section>

<title>Documentation</title>

2335

2336

<para>

2337

There doesn't seem to be any documentation for Sysprof, but

2338

maybe that's because it's pretty self-explanatory.

2339

The Sysprof website, however, is here:

2340

<ulink url='http://sysprof.com/'>Sysprof, System-wide Performance Profiler for Linux</ulink>

</para>

</section>

</section>

<title>LTTng (Linux Trace Toolkit, next generation)</title>

<title>Setup</title>

<para>

For this section, we'll assume you've already performed the

2353

basic setup outlined in the General Setup section.

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

2354

LTTng is run on the target system by ssh'ing to it.

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

2355

</para>

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

</section>

<title>Collecting and Viewing Traces</title>

2360

2361

<para>

2362

Once you've applied the above commits and built and booted your

2363

image (you need to build the core-image-sato-sdk image or use one of the

2364

other methods described in the General Setup section), you're

2365

ready to start tracing.

</para>

<title>Collecting and viewing a trace on the target (inside a shell)</title>

2370

2371

<para>

2372

First, from the host, ssh to the target:

2373

2374

$ ssh -l root 192.168.1.47

2375

The authenticity of host '192.168.1.47 (192.168.1.47)' can't be established.

2376

RSA key fingerprint is 23:bd:c8:b1:a8:71:52:00:ee:00:4f:64:9e:10:b9:7e.

2377

Are you sure you want to continue connecting (yes/no)? yes

2378

Warning: Permanently added '192.168.1.47' (RSA) to the list of known hosts.

2379

root@192.168.1.47's password:

2380

</literallayout>

2381

Once on the target, use these steps to create a trace:

2382

2383

root@crownbay:~# lttng create

2384

Spawning a session daemon

2385

Session auto-20121015-232120 created.

2386

Traces will be written in /home/root/lttng-traces/auto-20121015-232120

2387

</literallayout>

2388

Enable the events you want to trace (in this case all

2389

kernel events):

2390

2391

root@crownbay:~# lttng enable-event --kernel --all

2392

All kernel events are enabled in channel channel0

</literallayout>

Start the trace:

root@crownbay:~# lttng start

2397

Tracing started for session auto-20121015-232120

2398

</literallayout>

2399

And then stop the trace after awhile or after running

2400

a particular workload that you want to trace:

2401

2402

root@crownbay:~# lttng stop

2403

Tracing stopped for session auto-20121015-232120

2404

</literallayout>

2405

You can now view the trace in text form on the target:

2406

2407

root@crownbay:~# lttng view

2408

[23:21:56.989270399] (+?.?????????) sys_geteuid: { 1 }, { }

2409

[23:21:56.989278081] (+0.000007682) exit_syscall: { 1 }, { ret = 0 }

2410

[23:21:56.989286043] (+0.000007962) sys_pipe: { 1 }, { fildes = 0xB77B9E8C }

2411

[23:21:56.989321802] (+0.000035759) exit_syscall: { 1 }, { ret = 0 }

2412

[23:21:56.989329345] (+0.000007543) sys_mmap_pgoff: { 1 }, { addr = 0x0, len = 10485760, prot = 3, flags = 131362, fd = 4294967295, pgoff = 0 }

2413

[23:21:56.989351694] (+0.000022349) exit_syscall: { 1 }, { ret = -1247805440 }

2414

[23:21:56.989432989] (+0.000081295) sys_clone: { 1 }, { clone_flags = 0x411, newsp = 0xB5EFFFE4, parent_tid = 0xFFFFFFFF, child_tid = 0x0 }

2415

[23:21:56.989477129] (+0.000044140) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 681660, vruntime = 43367983388 }

2416

[23:21:56.989486697] (+0.000009568) sched_migrate_task: { 1 }, { comm = "lttng-consumerd", tid = 1193, prio = 20, orig_cpu = 1, dest_cpu = 1 }

2417

[23:21:56.989508418] (+0.000021721) hrtimer_init: { 1 }, { hrtimer = 3970832076, clockid = 1, mode = 1 }

2418

[23:21:56.989770462] (+0.000262044) hrtimer_cancel: { 1 }, { hrtimer = 3993865440 }

2419

[23:21:56.989771580] (+0.000001118) hrtimer_cancel: { 0 }, { hrtimer = 3993812192 }

2420

[23:21:56.989776957] (+0.000005377) hrtimer_expire_entry: { 1 }, { hrtimer = 3993865440, now = 79815980007057, function = 3238465232 }

2421

[23:21:56.989778145] (+0.000001188) hrtimer_expire_entry: { 0 }, { hrtimer = 3993812192, now = 79815980008174, function = 3238465232 }

2422

[23:21:56.989791695] (+0.000013550) softirq_raise: { 1 }, { vec = 1 }

2423

[23:21:56.989795396] (+0.000003701) softirq_raise: { 0 }, { vec = 1 }

2424

[23:21:56.989800635] (+0.000005239) softirq_raise: { 0 }, { vec = 9 }

2425

[23:21:56.989807130] (+0.000006495) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 330710, vruntime = 43368314098 }

2426

[23:21:56.989809993] (+0.000002863) sched_stat_runtime: { 0 }, { comm = "lttng-sessiond", tid = 1181, runtime = 1015313, vruntime = 36976733240 }

2427

[23:21:56.989818514] (+0.000008521) hrtimer_expire_exit: { 0 }, { hrtimer = 3993812192 }

2428

[23:21:56.989819631] (+0.000001117) hrtimer_expire_exit: { 1 }, { hrtimer = 3993865440 }

2429

[23:21:56.989821866] (+0.000002235) hrtimer_start: { 0 }, { hrtimer = 3993812192, function = 3238465232, expires = 79815981000000, softexpires = 79815981000000 }

2430

[23:21:56.989822984] (+0.000001118) hrtimer_start: { 1 }, { hrtimer = 3993865440, function = 3238465232, expires = 79815981000000, softexpires = 79815981000000 }

2431

[23:21:56.989832762] (+0.000009778) softirq_entry: { 1 }, { vec = 1 }

2432

[23:21:56.989833879] (+0.000001117) softirq_entry: { 0 }, { vec = 1 }

2433

[23:21:56.989838069] (+0.000004190) timer_cancel: { 1 }, { timer = 3993871956 }

2434

[23:21:56.989839187] (+0.000001118) timer_cancel: { 0 }, { timer = 3993818708 }

2435

[23:21:56.989841492] (+0.000002305) timer_expire_entry: { 1 }, { timer = 3993871956, now = 79515980, function = 3238277552 }

2436

[23:21:56.989842819] (+0.000001327) timer_expire_entry: { 0 }, { timer = 3993818708, now = 79515980, function = 3238277552 }

2437

[23:21:56.989854831] (+0.000012012) sched_stat_runtime: { 1 }, { comm = "lttng-consumerd", tid = 1193, runtime = 49237, vruntime = 43368363335 }

2438

[23:21:56.989855949] (+0.000001118) sched_stat_runtime: { 0 }, { comm = "lttng-sessiond", tid = 1181, runtime = 45121, vruntime = 36976778361 }

2439

[23:21:56.989861257] (+0.000005308) sched_stat_sleep: { 1 }, { comm = "kworker/1:1", tid = 21, delay = 9451318 }

2440

[23:21:56.989862374] (+0.000001117) sched_stat_sleep: { 0 }, { comm = "kworker/0:0", tid = 4, delay = 9958820 }

2441

[23:21:56.989868241] (+0.000005867) sched_wakeup: { 0 }, { comm = "kworker/0:0", tid = 4, prio = 120, success = 1, target_cpu = 0 }

2442

[23:21:56.989869358] (+0.000001117) sched_wakeup: { 1 }, { comm = "kworker/1:1", tid = 21, prio = 120, success = 1, target_cpu = 1 }

2443

[23:21:56.989877460] (+0.000008102) timer_expire_exit: { 1 }, { timer = 3993871956 }

2444

[23:21:56.989878577] (+0.000001117) timer_expire_exit: { 0 }, { timer = 3993818708 }

.

.

.

</literallayout>

You can now safely destroy the trace session (note that

2450

this doesn't delete the trace - it's still there

2451

in ~/lttng-traces):

2452

2453

root@crownbay:~# lttng destroy

2454

Session auto-20121015-232120 destroyed at /home/root

2455

</literallayout>

2456

Note that the trace is saved in a directory of the same

2457

name as returned by 'lttng create', under the ~/lttng-traces

2458

directory (note that you can change this by supplying your

2459

own name to 'lttng create'):

2460

2461

root@crownbay:~# ls -al ~/lttng-traces

2462

drwxrwx--- 3 root root 1024 Oct 15 23:21 .

2463

drwxr-xr-x 5 root root 1024 Oct 15 23:57 ..

2464

drwxrwx--- 3 root root 1024 Oct 15 23:21 auto-20121015-232120

</literallayout>

</para>

</section>

<title>Collecting and viewing a userspace trace on the target (inside a shell)</title>

2471

2472

<para>

2473

For LTTng userspace tracing, you need to have a properly

2474

instrumented userspace program. For this example, we'll use

2475

the 'hello' test program generated by the lttng-ust build.

</para>

<para>

The 'hello' test program isn't installed on the rootfs by

2480

the lttng-ust build, so we need to copy it over manually.

2481

First cd into the build directory that contains the hello

2482

executable:

2483

2484

$ cd build/tmp/work/core2_32-poky-linux/lttng-ust/2.0.5-r0/git/tests/hello/.libs

2485

</literallayout>

2486

Copy that over to the target machine:

2487

2488

$ scp hello root@192.168.1.20:

2489

</literallayout>

2490

You now have the instrumented lttng 'hello world' test

2491

program on the target, ready to test.

</para>

<para>

First, from the host, ssh to the target:

2496

2497

$ ssh -l root 192.168.1.47

2498

The authenticity of host '192.168.1.47 (192.168.1.47)' can't be established.

2499

RSA key fingerprint is 23:bd:c8:b1:a8:71:52:00:ee:00:4f:64:9e:10:b9:7e.

2500

Are you sure you want to continue connecting (yes/no)? yes

2501

Warning: Permanently added '192.168.1.47' (RSA) to the list of known hosts.

2502

root@192.168.1.47's password:

2503

</literallayout>

2504

Once on the target, use these steps to create a trace:

2505

2506

root@crownbay:~# lttng create

2507

Session auto-20190303-021943 created.

2508

Traces will be written in /home/root/lttng-traces/auto-20190303-021943

2509

</literallayout>

2510

Enable the events you want to trace (in this case all

2511

userspace events):

2512

2513

root@crownbay:~# lttng enable-event --userspace --all

2514

All UST events are enabled in channel channel0

</literallayout>

Start the trace:

root@crownbay:~# lttng start

2519

Tracing started for session auto-20190303-021943

2520

</literallayout>

2521

Run the instrumented hello world program:

2522

2523

root@crownbay:~# ./hello

Hello, World!

Tracing... done.

</literallayout>

And then stop the trace after awhile or after running a

2528

particular workload that you want to trace:

2529

2530

root@crownbay:~# lttng stop

2531

Tracing stopped for session auto-20190303-021943

2532

</literallayout>

2533

You can now view the trace in text form on the target:

2534

2535

root@crownbay:~# lttng view

2536

[02:31:14.906146544] (+?.?????????) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 0, intfield2 = 0x0, longfield = 0, netintfield = 0, netintfieldhex = 0x0, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }

2537

[02:31:14.906170360] (+0.000023816) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 1, intfield2 = 0x1, longfield = 1, netintfield = 1, netintfieldhex = 0x1, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }

2538

[02:31:14.906183140] (+0.000012780) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 2, intfield2 = 0x2, longfield = 2, netintfield = 2, netintfieldhex = 0x2, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }

2539

[02:31:14.906194385] (+0.000011245) hello:1424 ust_tests_hello:tptest: { cpu_id = 1 }, { intfield = 3, intfield2 = 0x3, longfield = 3, netintfield = 3, netintfieldhex = 0x3, arrfield1 = [ [0] = 1, [1] = 2, [2] = 3 ], arrfield2 = "test", _seqfield1_length = 4, seqfield1 = [ [0] = 116, [1] = 101, [2] = 115, [3] = 116 ], _seqfield2_length = 4, seqfield2 = "test", stringfield = "test", floatfield = 2222, doublefield = 2, boolfield = 1 }

.

.

.

</literallayout>

You can now safely destroy the trace session (note that

2545

this doesn't delete the trace - it's still

2546

there in ~/lttng-traces):

2547

2548

root@crownbay:~# lttng destroy

2549

Session auto-20190303-021943 destroyed at /home/root

</literallayout>

</para>

</section>

Patrick Williams

2015-09-15 14:41:29 -0500

[diff] [blame]

</section>

<title>Documentation</title>

2558

2559

<para>

2560

You can find the primary LTTng Documentation on the

2561

<ulink url='https://lttng.org/docs/'>LTTng Documentation</ulink>

2562

site.

2563

The documentation on this site is appropriate for intermediate to

2564

advanced software developers who are working in a Linux environment

2565

and are interested in efficient software tracing.

</para>

<para>

For information on LTTng in general, visit the

2570

<ulink url='http://lttng.org/lttng2.0'>LTTng Project</ulink>

2571

site.

2572

You can find a "Getting Started" link on this site that takes

2573

you to an LTTng Quick Start.

2574

</para>

Patrick Williams

c124f4f

2015-09-15 14:41:29 -0500

[diff] [blame]

</section>

</section>

<title>blktrace</title>

2580

2581

<para>

2582

blktrace is a tool for tracing and reporting low-level disk I/O.

2583

blktrace provides the tracing half of the equation; its output can

2584

be piped into the blkparse program, which renders the data in a

2585

human-readable form and does some basic analysis:

</para>

<title>Setup</title>

<para>

For this section, we'll assume you've already performed the

2593

basic setup outlined in the

2594

"<link linkend='profile-manual-general-setup'>General Setup</link>"

section.

</para>

<para>

blktrace is an application that runs on the target system.

2600

You can run the entire blktrace and blkparse pipeline on the

2601

target, or you can run blktrace in 'listen' mode on the target

2602

and have blktrace and blkparse collect and analyze the data on

2603

the host (see the

2604

"<link linkend='using-blktrace-remotely'>Using blktrace Remotely</link>"

2605

section below).

2606

For the rest of this section we assume you've ssh'ed to the

2607

host and will be running blkrace on the target.

</para>

</section>

<title>Basic Usage</title>

2613

2614

<para>

2615

To record a trace, simply run the 'blktrace' command, giving it

2616

the name of the block device you want to trace activity on:

2617

2618

root@crownbay:~# blktrace /dev/sdc

2619

</literallayout>

2620

In another shell, execute a workload you want to trace.

2621

2622

root@crownbay:/media/sdc# rm linux-2.6.19.2.tar.bz2; wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>; sync

2623

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

2624

linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA

2625

</literallayout>

2626

Press Ctrl-C in the blktrace shell to stop the trace. It will

2627

display how many events were logged, along with the per-cpu file

2628

sizes (blktrace records traces in per-cpu kernel buffers and

2629

simply dumps them to userspace for blkparse to merge and sort

later).

^C=== sdc ===

CPU 0: 7082 events, 332 KiB data

2634

CPU 1: 1578 events, 74 KiB data

2635

Total: 8660 events (dropped 0), 406 KiB data

2636

</literallayout>

2637

If you examine the files saved to disk, you see multiple files,

2638

one per CPU and with the device name as the first part of the

2639

filename:

2640

2641

root@crownbay:~# ls -al

2642

drwxr-xr-x 6 root root 1024 Oct 27 22:39 .

2643

drwxr-sr-x 4 root root 1024 Oct 26 18:24 ..

2644

-rw-r--r-- 1 root root 339938 Oct 27 22:40 sdc.blktrace.0

2645

-rw-r--r-- 1 root root 75753 Oct 27 22:40 sdc.blktrace.1

2646

</literallayout>

2647

To view the trace events, simply invoke 'blkparse' in the

2648

directory containing the trace files, giving it the device name

2649

that forms the first part of the filenames:

2650

2651

root@crownbay:~# blkparse sdc

2652

2653

8,32 1 1 0.000000000 1225 Q WS 3417048 + 8 [jbd2/sdc-8]

2654

8,32 1 2 0.000025213 1225 G WS 3417048 + 8 [jbd2/sdc-8]

2655

8,32 1 3 0.000033384 1225 P N [jbd2/sdc-8]

2656

8,32 1 4 0.000043301 1225 I WS 3417048 + 8 [jbd2/sdc-8]

2657

8,32 1 0 0.000057270 0 m N cfq1225 insert_request

2658

8,32 1 0 0.000064813 0 m N cfq1225 add_to_rr

2659

8,32 1 5 0.000076336 1225 U N [jbd2/sdc-8] 1

2660

8,32 1 0 0.000088559 0 m N cfq workload slice:150

2661

8,32 1 0 0.000097359 0 m N cfq1225 set_active wl_prio:0 wl_type:1

2662

8,32 1 0 0.000104063 0 m N cfq1225 Not idling. st->count:1

2663

8,32 1 0 0.000112584 0 m N cfq1225 fifo= (null)

2664

8,32 1 0 0.000118730 0 m N cfq1225 dispatch_insert

2665

8,32 1 0 0.000127390 0 m N cfq1225 dispatched a request

2666

8,32 1 0 0.000133536 0 m N cfq1225 activate rq, drv=1

2667

8,32 1 6 0.000136889 1225 D WS 3417048 + 8 [jbd2/sdc-8]

2668

8,32 1 7 0.000360381 1225 Q WS 3417056 + 8 [jbd2/sdc-8]

2669

8,32 1 8 0.000377422 1225 G WS 3417056 + 8 [jbd2/sdc-8]

2670

8,32 1 9 0.000388876 1225 P N [jbd2/sdc-8]

2671

8,32 1 10 0.000397886 1225 Q WS 3417064 + 8 [jbd2/sdc-8]

2672

8,32 1 11 0.000404800 1225 M WS 3417064 + 8 [jbd2/sdc-8]

2673

8,32 1 12 0.000412343 1225 Q WS 3417072 + 8 [jbd2/sdc-8]

2674

8,32 1 13 0.000416533 1225 M WS 3417072 + 8 [jbd2/sdc-8]

2675

8,32 1 14 0.000422121 1225 Q WS 3417080 + 8 [jbd2/sdc-8]

2676

8,32 1 15 0.000425194 1225 M WS 3417080 + 8 [jbd2/sdc-8]

2677

8,32 1 16 0.000431968 1225 Q WS 3417088 + 8 [jbd2/sdc-8]

2678

8,32 1 17 0.000435251 1225 M WS 3417088 + 8 [jbd2/sdc-8]

2679

8,32 1 18 0.000440279 1225 Q WS 3417096 + 8 [jbd2/sdc-8]

2680

8,32 1 19 0.000443911 1225 M WS 3417096 + 8 [jbd2/sdc-8]

2681

8,32 1 20 0.000450336 1225 Q WS 3417104 + 8 [jbd2/sdc-8]

2682

8,32 1 21 0.000454038 1225 M WS 3417104 + 8 [jbd2/sdc-8]

2683

8,32 1 22 0.000462070 1225 Q WS 3417112 + 8 [jbd2/sdc-8]

2684

8,32 1 23 0.000465422 1225 M WS 3417112 + 8 [jbd2/sdc-8]

2685

8,32 1 24 0.000474222 1225 I WS 3417056 + 64 [jbd2/sdc-8]

2686

8,32 1 0 0.000483022 0 m N cfq1225 insert_request

2687

8,32 1 25 0.000489727 1225 U N [jbd2/sdc-8] 1

2688

8,32 1 0 0.000498457 0 m N cfq1225 Not idling. st->count:1

2689

8,32 1 0 0.000503765 0 m N cfq1225 dispatch_insert

2690

8,32 1 0 0.000512914 0 m N cfq1225 dispatched a request

2691

8,32 1 0 0.000518851 0 m N cfq1225 activate rq, drv=2

.

.

.

8,32 0 0 58.515006138 0 m N cfq3551 complete rqnoidle 1

2696

8,32 0 2024 58.516603269 3 C WS 3156992 + 16 [0]

2697

8,32 0 0 58.516626736 0 m N cfq3551 complete rqnoidle 1

2698

8,32 0 0 58.516634558 0 m N cfq3551 arm_idle: 8 group_idle: 0

2699

8,32 0 0 58.516636933 0 m N cfq schedule dispatch

2700

8,32 1 0 58.516971613 0 m N cfq3551 slice expired t=0

2701

8,32 1 0 58.516982089 0 m N cfq3551 sl_used=13 disp=6 charge=13 iops=0 sect=80

2702

8,32 1 0 58.516985511 0 m N cfq3551 del_from_rr

2703

8,32 1 0 58.516990819 0 m N cfq3551 put_queue

2704

2705

CPU0 (sdc):

2706

Reads Queued: 0, 0KiB Writes Queued: 331, 26,284KiB

2707

Read Dispatches: 0, 0KiB Write Dispatches: 485, 40,484KiB

2708

Reads Requeued: 0 Writes Requeued: 0

2709

Reads Completed: 0, 0KiB Writes Completed: 511, 41,000KiB

2710

Read Merges: 0, 0KiB Write Merges: 13, 160KiB

2711

Read depth: 0 Write depth: 2

2712

IO unplugs: 23 Timer unplugs: 0

2713

CPU1 (sdc):

2714

Reads Queued: 0, 0KiB Writes Queued: 249, 15,800KiB

2715

Read Dispatches: 0, 0KiB Write Dispatches: 42, 1,600KiB

2716

Reads Requeued: 0 Writes Requeued: 0

2717

Reads Completed: 0, 0KiB Writes Completed: 16, 1,084KiB

2718

Read Merges: 0, 0KiB Write Merges: 40, 276KiB

2719

Read depth: 0 Write depth: 2

2720

IO unplugs: 30 Timer unplugs: 1

2721

2722

Total (sdc):

2723

Reads Queued: 0, 0KiB Writes Queued: 580, 42,084KiB

2724

Read Dispatches: 0, 0KiB Write Dispatches: 527, 42,084KiB

2725

Reads Requeued: 0 Writes Requeued: 0

2726

Reads Completed: 0, 0KiB Writes Completed: 527, 42,084KiB

2727

Read Merges: 0, 0KiB Write Merges: 53, 436KiB

2728

IO unplugs: 53 Timer unplugs: 1

2729

2730

Throughput (R/W): 0KiB/s / 719KiB/s

2731

Events (sdc): 6,592 entries

2732

Skips: 0 forward (0 - 0.0%)

2733

Input file sdc.blktrace.0 added

2734

Input file sdc.blktrace.1 added

2735

</literallayout>

2736

The report shows each event that was found in the blktrace data,

2737

along with a summary of the overall block I/O traffic during

2738

the run. You can look at the

2739

<ulink url='http://linux.die.net/man/1/blkparse'>blkparse</ulink>

2740

manpage to learn the

2741

meaning of each field displayed in the trace listing.

</para>

<para>

blktrace and blkparse are designed from the ground up to

2749

be able to operate together in a 'pipe mode' where the

2750

stdout of blktrace can be fed directly into the stdin of

2751

blkparse:

2752

2753

root@crownbay:~# blktrace /dev/sdc -o - | blkparse -i -

2754

</literallayout>

2755

This enables long-lived tracing sessions to run without

2756

writing anything to disk, and allows the user to look for

2757

certain conditions in the trace data in 'real-time' by

2758

viewing the trace output as it scrolls by on the screen or

2759

by passing it along to yet another program in the pipeline

2760

such as grep which can be used to identify and capture

2761

conditions of interest.

</para>

<para>

There's actually another blktrace command that implements

2766

the above pipeline as a single command, so the user doesn't

2767

have to bother typing in the above command sequence:

2768

2769

root@crownbay:~# btrace /dev/sdc

</literallayout>

</para>

</section>

<title>Using blktrace Remotely</title>

2776

2777

<para>

2778

Because blktrace traces block I/O and at the same time

2779

normally writes its trace data to a block device, and

2780

in general because it's not really a great idea to make

2781

the device being traced the same as the device the tracer

2782

writes to, blktrace provides a way to trace without

2783

perturbing the traced device at all by providing native

2784

support for sending all trace data over the network.

</para>

<para>

To have blktrace operate in this mode, start blktrace on

2789

the target system being traced with the -l option, along with

2790

the device to trace:

2791

2792

root@crownbay:~# blktrace -l /dev/sdc

2793

server: waiting for connections...

2794

</literallayout>

2795

On the host system, use the -h option to connect to the

2796

target system, also passing it the device to trace:

2797

2798

$ blktrace -d /dev/sdc -h 192.168.1.43

2799

blktrace: connecting to 192.168.1.43

2800

blktrace: connected!

2801

</literallayout>

2802

On the target system, you should see this:

2803

2804

server: connection from 192.168.1.43

2805

</literallayout>

2806

In another shell, execute a workload you want to trace.

2807

2808

root@crownbay:/media/sdc# rm linux-2.6.19.2.tar.bz2; wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>; sync

2809

Connecting to downloads.yoctoproject.org (140.211.169.59:80)

2810

linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA

2811

</literallayout>

2812

When it's done, do a Ctrl-C on the host system to

stop the trace:

^C=== sdc ===

CPU 0: 7691 events, 361 KiB data

2817

CPU 1: 4109 events, 193 KiB data

2818

Total: 11800 events (dropped 0), 554 KiB data

2819

</literallayout>

2820

On the target system, you should also see a trace

2821

summary for the trace just ended:

2822

2823

server: end of run for 192.168.1.43:sdc

2824

=== sdc ===

2825

CPU 0: 7691 events, 361 KiB data

2826

CPU 1: 4109 events, 193 KiB data

2827

Total: 11800 events (dropped 0), 554 KiB data

2828

</literallayout>

2829

The blktrace instance on the host will save the target

2830

output inside a hostname-timestamp directory:

2831

2832

$ ls -al

2833

drwxr-xr-x 10 root root 1024 Oct 28 02:40 .

2834

drwxr-sr-x 4 root root 1024 Oct 26 18:24 ..

2835

drwxr-xr-x 2 root root 1024 Oct 28 02:40 192.168.1.43-2012-10-28-02:40:56

2836

</literallayout>

2837

cd into that directory to see the output files:

2838

2839

$ ls -l

2840

-rw-r--r-- 1 root root 369193 Oct 28 02:44 sdc.blktrace.0

2841

-rw-r--r-- 1 root root 197278 Oct 28 02:44 sdc.blktrace.1

2842

</literallayout>

2843

And run blkparse on the host system using the device name:

$ blkparse sdc

8,32 1 1 0.000000000 1263 Q RM 6016 + 8 [ls]

2848

8,32 1 0 0.000036038 0 m N cfq1263 alloced

2849

8,32 1 2 0.000039390 1263 G RM 6016 + 8 [ls]

2850

8,32 1 3 0.000049168 1263 I RM 6016 + 8 [ls]

2851

8,32 1 0 0.000056152 0 m N cfq1263 insert_request

2852

8,32 1 0 0.000061600 0 m N cfq1263 add_to_rr

2853

8,32 1 0 0.000075498 0 m N cfq workload slice:300

.

.

.

8,32 0 0 177.266385696 0 m N cfq1267 arm_idle: 8 group_idle: 0

2858

8,32 0 0 177.266388140 0 m N cfq schedule dispatch

2859

8,32 1 0 177.266679239 0 m N cfq1267 slice expired t=0

2860

8,32 1 0 177.266689297 0 m N cfq1267 sl_used=9 disp=6 charge=9 iops=0 sect=56

2861

8,32 1 0 177.266692649 0 m N cfq1267 del_from_rr

2862

8,32 1 0 177.266696560 0 m N cfq1267 put_queue

2863

2864

CPU0 (sdc):

2865

Reads Queued: 0, 0KiB Writes Queued: 270, 21,708KiB

2866

Read Dispatches: 59, 2,628KiB Write Dispatches: 495, 39,964KiB

2867

Reads Requeued: 0 Writes Requeued: 0

2868

Reads Completed: 90, 2,752KiB Writes Completed: 543, 41,596KiB

2869

Read Merges: 0, 0KiB Write Merges: 9, 344KiB

2870

Read depth: 2 Write depth: 2

2871

IO unplugs: 20 Timer unplugs: 1

2872

CPU1 (sdc):

2873

Reads Queued: 688, 2,752KiB Writes Queued: 381, 20,652KiB

2874

Read Dispatches: 31, 124KiB Write Dispatches: 59, 2,396KiB

2875

Reads Requeued: 0 Writes Requeued: 0

2876

Reads Completed: 0, 0KiB Writes Completed: 11, 764KiB

2877

Read Merges: 598, 2,392KiB Write Merges: 88, 448KiB

2878

Read depth: 2 Write depth: 2

2879

IO unplugs: 52 Timer unplugs: 0

2880

2881

Total (sdc):

2882

Reads Queued: 688, 2,752KiB Writes Queued: 651, 42,360KiB

2883

Read Dispatches: 90, 2,752KiB Write Dispatches: 554, 42,360KiB

2884

Reads Requeued: 0 Writes Requeued: 0

2885

Reads Completed: 90, 2,752KiB Writes Completed: 554, 42,360KiB

2886

Read Merges: 598, 2,392KiB Write Merges: 97, 792KiB

2887

IO unplugs: 72 Timer unplugs: 1

2888

2889

Throughput (R/W): 15KiB/s / 238KiB/s

2890

Events (sdc): 9,301 entries

2891

Skips: 0 forward (0 - 0.0%)

2892

</literallayout>

2893

You should see the trace events and summary just as

2894

you would have if you'd run the same command on the target.

</para>

</section>

<title>Tracing Block I/O via 'ftrace'</title>

2900

2901

<para>

2902

It's also possible to trace block I/O using only

2903

2904

which can be useful for casual tracing

2905

if you don't want to bother dealing with the userspace tools.

</para>

<para>

To enable tracing for a given device, use

2910

/sys/block/xxx/trace/enable, where xxx is the device name.

2911

This for example enables tracing for /dev/sdc:

2912

2913

root@crownbay:/sys/kernel/debug/tracing# echo 1 > /sys/block/sdc/trace/enable

2914

</literallayout>

2915

Once you've selected the device(s) you want to trace,

2916

selecting the 'blk' tracer will turn the blk tracer on:

2917

2918

root@crownbay:/sys/kernel/debug/tracing# cat available_tracers

2919

blk function_graph function nop

2920

2921

root@crownbay:/sys/kernel/debug/tracing# echo blk > current_tracer

2922

</literallayout>

2923

Execute the workload you're interested in:

2924

2925

root@crownbay:/sys/kernel/debug/tracing# cat /media/sdc/testfile.txt

2926

</literallayout>

2927

And look at the output (note here that we're using

2928

'trace_pipe' instead of trace to capture this trace -

2929

this allows us to wait around on the pipe for data to

2930

appear):

2931

2932

root@crownbay:/sys/kernel/debug/tracing# cat trace_pipe

2933

cat-3587 [001] d..1 3023.276361: 8,32 Q R 1699848 + 8 [cat]

2934

cat-3587 [001] d..1 3023.276410: 8,32 m N cfq3587 alloced

2935

cat-3587 [001] d..1 3023.276415: 8,32 G R 1699848 + 8 [cat]

2936

cat-3587 [001] d..1 3023.276424: 8,32 P N [cat]

2937

cat-3587 [001] d..2 3023.276432: 8,32 I R 1699848 + 8 [cat]

2938

cat-3587 [001] d..1 3023.276439: 8,32 m N cfq3587 insert_request

2939

cat-3587 [001] d..1 3023.276445: 8,32 m N cfq3587 add_to_rr

2940

cat-3587 [001] d..2 3023.276454: 8,32 U N [cat] 1

2941

cat-3587 [001] d..1 3023.276464: 8,32 m N cfq workload slice:150

2942

cat-3587 [001] d..1 3023.276471: 8,32 m N cfq3587 set_active wl_prio:0 wl_type:2

2943

cat-3587 [001] d..1 3023.276478: 8,32 m N cfq3587 fifo= (null)

2944

cat-3587 [001] d..1 3023.276483: 8,32 m N cfq3587 dispatch_insert

2945

cat-3587 [001] d..1 3023.276490: 8,32 m N cfq3587 dispatched a request

2946

cat-3587 [001] d..1 3023.276497: 8,32 m N cfq3587 activate rq, drv=1

2947

cat-3587 [001] d..2 3023.276500: 8,32 D R 1699848 + 8 [cat]

2948

</literallayout>

2949

And this turns off tracing for the specified device:

2950

2951

root@crownbay:/sys/kernel/debug/tracing# echo 0 > /sys/block/sdc/trace/enable

</literallayout>

</para>

</section>

</section>

<title>Documentation</title>

2959

2960

<para>

2961

Online versions of the man pages for the commands discussed

2962

in this section can be found here:

2963

2964

<listitem><para><ulink url='http://linux.die.net/man/8/blktrace'>http://linux.die.net/man/8/blktrace</ulink>

2965

</para></listitem>

2966

<listitem><para><ulink url='http://linux.die.net/man/1/blkparse'>http://linux.die.net/man/1/blkparse</ulink>

2967

</para></listitem>

2968

<listitem><para><ulink url='http://linux.die.net/man/8/btrace'>http://linux.die.net/man/8/btrace</ulink>

</para></listitem>

</itemizedlist>

</para>

<para>

The above manpages, along with manpages for the other

2975

blktrace utilities (btt, blkiomon, etc) can be found in the

2976

/doc directory of the blktrace tools git repo:

2977

2978

$ git clone git://git.kernel.dk/blktrace.git

</literallayout>

</para>

</section>

</section>

</chapter>

<!--

vim: expandtab tw=80 ts=4

2986

-->