blob: c81feb6ae1f3df2e0d8ee58f8ce508cbe0934375 [file] [log] [blame]
Austin Schuh745610d2015-09-06 18:19:50 -07001<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
2<HTML>
3
4<HEAD>
5 <link rel="stylesheet" href="designstyle.css">
6 <title>Gperftools CPU Profiler</title>
7</HEAD>
8
9<BODY>
10
11<p align=right>
12 <i>Last modified
13 <script type=text/javascript>
14 var lm = new Date(document.lastModified);
15 document.write(lm.toDateString());
16 </script></i>
17</p>
18
19<p>This is the CPU profiler we use at Google. There are three parts
20to using it: linking the library into an application, running the
21code, and analyzing the output.</p>
22
23<p>On the off-chance that you should need to understand it, the CPU
24profiler data file format is documented separately,
25<a href="cpuprofile-fileformat.html">here</a>.
26
27
28<H1>Linking in the Library</H1>
29
30<p>To install the CPU profiler into your executable, add
31<code>-lprofiler</code> to the link-time step for your executable.
32(It's also probably possible to add in the profiler at run-time using
33<code>LD_PRELOAD</code>, e.g.
34<code>% env LD_PRELOAD="/usr/lib/libprofiler.so" &lt;binary&gt;</code>,
35but this isn't necessarily recommended.)</p>
36
37<p>This does <i>not</i> turn on CPU profiling; it just inserts the
38code. For that reason, it's practical to just always link
39<code>-lprofiler</code> into a binary while developing; that's what we
40do at Google. (However, since any user can turn on the profiler by
41setting an environment variable, it's not necessarily recommended to
42install profiler-linked binaries into a production, running
43system.)</p>
44
45
46<H1>Running the Code</H1>
47
48<p>There are several alternatives to actually turn on CPU profiling
49for a given run of an executable:</p>
50
51<ol>
52 <li> <p>Define the environment variable CPUPROFILE to the filename
53 to dump the profile to. For instance, if you had a version of
54 <code>/bin/ls</code> that had been linked against libprofiler,
55 you could run:</p>
56 <pre>% env CPUPROFILE=ls.prof /bin/ls</pre>
57 </li>
58 <li> <p>In addition to defining the environment variable CPUPROFILE
59 you can also define CPUPROFILESIGNAL. This allows profiling to be
60 controlled via the signal number that you specify. The signal number
61 must be unused by the program under normal operation. Internally it
62 acts as a switch, triggered by the signal, which is off by default.
63 For instance, if you had a copy of <code>/bin/chrome</code> that had been
64 been linked against libprofiler, you could run:</p>
65 <pre>% env CPUPROFILE=chrome.prof CPUPROFILESIGNAL=12 /bin/chrome &</pre>
66 <p>You can then trigger profiling to start:</p>
67 <pre>% killall -12 chrome</pre>
68 <p>Then after a period of time you can tell it to stop which will
69 generate the profile:</p>
70 <pre>% killall -12 chrome</pre>
71 </li>
72 <li> <p>In your code, bracket the code you want profiled in calls to
73 <code>ProfilerStart()</code> and <code>ProfilerStop()</code>.
74 (These functions are declared in <code>&lt;gperftools/profiler.h&gt;</code>.)
75 <code>ProfilerStart()</code> will take
76 the profile-filename as an argument.</p>
77 </li>
78</ol>
79
80<p>In Linux 2.6 and above, profiling works correctly with threads,
81automatically profiling all threads. In Linux 2.4, profiling only
82profiles the main thread (due to a kernel bug involving itimers and
83threads). Profiling works correctly with sub-processes: each child
84process gets its own profile with its own name (generated by combining
85CPUPROFILE with the child's process id).</p>
86
87<p>For security reasons, CPU profiling will not write to a file -- and
88is thus not usable -- for setuid programs.</p>
89
90<p>See the include-file <code>gperftools/profiler.h</code> for
91advanced-use functions, including <code>ProfilerFlush()</code> and
92<code>ProfilerStartWithOptions()</code>.</p>
93
94
95<H2>Modifying Runtime Behavior</H2>
96
97<p>You can more finely control the behavior of the CPU profiler via
98environment variables.</p>
99
100<table frame=box rules=sides cellpadding=5 width=100%>
101
102<tr valign=top>
103 <td><code>CPUPROFILE_FREQUENCY=<i>x</i></code></td>
104 <td>default: 100</td>
105 <td>
106 How many interrupts/second the cpu-profiler samples.
107 </td>
108</tr>
109
110<tr valign=top>
111 <td><code>CPUPROFILE_REALTIME=1</code></td>
112 <td>default: [not set]</td>
113 <td>
114 If set to any value (including 0 or the empty string), use
115 ITIMER_REAL instead of ITIMER_PROF to gather profiles. In
116 general, ITIMER_REAL is not as accurate as ITIMER_PROF, and also
117 interacts badly with use of alarm(), so prefer ITIMER_PROF unless
118 you have a reason prefer ITIMER_REAL.
119 </td>
120</tr>
121
122</table>
123
124
125<h1><a name="pprof">Analyzing the Output</a></h1>
126
127<p><code>pprof</code> is the script used to analyze a profile. It has
128many output modes, both textual and graphical. Some give just raw
129numbers, much like the <code>-pg</code> output of <code>gcc</code>,
130and others show the data in the form of a dependency graph.</p>
131
132<p>pprof <b>requires</b> <code>perl5</code> to be installed to run.
133It also requires <code>dot</code> to be installed for any of the
134graphical output routines, and <code>gv</code> to be installed for
135<code>--gv</code> mode (described below).
136</p>
137
138<p>Here are some ways to call pprof. These are described in more
139detail below.</p>
140
141<pre>
142% pprof /bin/ls ls.prof
143 Enters "interactive" mode
144% pprof --text /bin/ls ls.prof
145 Outputs one line per procedure
146% pprof --gv /bin/ls ls.prof
147 Displays annotated call-graph via 'gv'
148% pprof --gv --focus=Mutex /bin/ls ls.prof
149 Restricts to code paths including a .*Mutex.* entry
150% pprof --gv --focus=Mutex --ignore=string /bin/ls ls.prof
151 Code paths including Mutex but not string
152% pprof --list=getdir /bin/ls ls.prof
153 (Per-line) annotated source listing for getdir()
154% pprof --disasm=getdir /bin/ls ls.prof
155 (Per-PC) annotated disassembly for getdir()
156% pprof --text localhost:1234
157 Outputs one line per procedure for localhost:1234
158% pprof --callgrind /bin/ls ls.prof
159 Outputs the call information in callgrind format
160</pre>
161
162
163<h3>Analyzing Text Output</h3>
164
165<p>Text mode has lines of output that look like this:</p>
166<pre>
167 14 2.1% 17.2% 58 8.7% std::_Rb_tree::find
168</pre>
169
170<p>Here is how to interpret the columns:</p>
171<ol>
172 <li> Number of profiling samples in this function
173 <li> Percentage of profiling samples in this function
174 <li> Percentage of profiling samples in the functions printed so far
175 <li> Number of profiling samples in this function and its callees
176 <li> Percentage of profiling samples in this function and its callees
177 <li> Function name
178</ol>
179
180<h3>Analyzing Callgrind Output</h3>
181
182<p>Use <a href="http://kcachegrind.sourceforge.net">kcachegrind</a> to
183analyze your callgrind output:</p>
184<pre>
185% pprof --callgrind /bin/ls ls.prof > ls.callgrind
186% kcachegrind ls.callgrind
187</pre>
188
189<p>The cost is specified in 'hits', i.e. how many times a function
190appears in the recorded call stack information. The 'calls' from
191function a to b record how many times function b was found in the
192stack traces directly below function a.</p>
193
194<p>Tip: if you use a debug build the output will include file and line
195number information and kcachegrind will show an annotated source
196code view.</p>
197
198<h3>Node Information</h3>
199
200<p>In the various graphical modes of pprof, the output is a call graph
201annotated with timing information, like so:</p>
202
203<A HREF="pprof-test-big.gif">
204<center><table><tr><td>
205 <img src="pprof-test.gif">
206</td></tr></table></center>
207</A>
208
209<p>Each node represents a procedure. The directed edges indicate
210caller to callee relations. Each node is formatted as follows:</p>
211
212<center><pre>
213Class Name
214Method Name
215local (percentage)
216<b>of</b> cumulative (percentage)
217</pre></center>
218
219<p>The last one or two lines contains the timing information. (The
220profiling is done via a sampling method, where by default we take 100
221samples a second. Therefor one unit of time in the output corresponds
222to about 10 milliseconds of execution time.) The "local" time is the
223time spent executing the instructions directly contained in the
224procedure (and in any other procedures that were inlined into the
225procedure). The "cumulative" time is the sum of the "local" time and
226the time spent in any callees. If the cumulative time is the same as
227the local time, it is not printed.</p>
228
229<p>For instance, the timing information for test_main_thread()
230indicates that 155 units (about 1.55 seconds) were spent executing the
231code in <code>test_main_thread()</code> and 200 units were spent while
232executing <code>test_main_thread()</code> and its callees such as
233<code>snprintf()</code>.</p>
234
235<p>The size of the node is proportional to the local count. The
236percentage displayed in the node corresponds to the count divided by
237the total run time of the program (that is, the cumulative count for
238<code>main()</code>).</p>
239
240<h3>Edge Information</h3>
241
242<p>An edge from one node to another indicates a caller to callee
243relationship. Each edge is labelled with the time spent by the callee
244on behalf of the caller. E.g, the edge from
245<code>test_main_thread()</code> to <code>snprintf()</code> indicates
246that of the 200 samples in <code>test_main_thread()</code>, 37 are
247because of calls to <code>snprintf()</code>.</p>
248
249<p>Note that <code>test_main_thread()</code> has an edge to
250<code>vsnprintf()</code>, even though <code>test_main_thread()</code>
251doesn't call that function directly. This is because the code was
252compiled with <code>-O2</code>; the profile reflects the optimized
253control flow.</p>
254
255<h3>Meta Information</h3>
256
257<p>The top of the display should contain some meta information
258like:</p>
259<pre>
260 /tmp/profiler2_unittest
261 Total samples: 202
262 Focusing on: 202
263 Dropped nodes with &lt;= 1 abs(samples)
264 Dropped edges with &lt;= 0 samples
265</pre>
266
267<p>This section contains the name of the program, and the total
268samples collected during the profiling run. If the
269<code>--focus</code> option is on (see the <a href="#focus">Focus</a>
270section below), the legend also contains the number of samples being
271shown in the focused display. Furthermore, some unimportant nodes and
272edges are dropped to reduce clutter. The characteristics of the
273dropped nodes and edges are also displayed in the legend.</p>
274
275<h3><a name=focus>Focus and Ignore</a></h3>
276
277<p>You can ask pprof to generate a display focused on a particular
278piece of the program. You specify a regular expression. Any portion
279of the call-graph that is on a path which contains at least one node
280matching the regular expression is preserved. The rest of the
281call-graph is dropped on the floor. For example, you can focus on the
282<code>vsnprintf()</code> libc call in <code>profiler2_unittest</code>
283as follows:</p>
284
285<pre>
286% pprof --gv --focus=vsnprintf /tmp/profiler2_unittest test.prof
287</pre>
288<A HREF="pprof-vsnprintf-big.gif">
289<center><table><tr><td>
290 <img src="pprof-vsnprintf.gif">
291</td></tr></table></center>
292</A>
293
294<p>Similarly, you can supply the <code>--ignore</code> option to
295ignore samples that match a specified regular expression. E.g., if
296you are interested in everything except calls to
297<code>snprintf()</code>, you can say:</p>
298<pre>
299% pprof --gv --ignore=snprintf /tmp/profiler2_unittest test.prof
300</pre>
301
302
303<h3>Interactive mode</a></h3>
304
305<p>By default -- if you don't specify any flags to the contrary --
306pprof runs in interactive mode. At the <code>(pprof)</code> prompt,
307you can run many of the commands described above. You can type
308<code>help</code> for a list of what commands are available in
309interactive mode.</p>
310
311<h3><a name=options>pprof Options</a></h3>
312
313For a complete list of pprof options, you can run <code>pprof
314--help</code>.
315
316<h4>Output Type</h4>
317
318<p>
319<center>
320<table frame=box rules=sides cellpadding=5 width=100%>
321<tr valign=top>
322 <td><code>--text</code></td>
323 <td>
324 Produces a textual listing. (Note: If you have an X display, and
325 <code>dot</code> and <code>gv</code> installed, you will probably
326 be happier with the <code>--gv</code> output.)
327 </td>
328</tr>
329<tr valign=top>
330 <td><code>--gv</code></td>
331 <td>
332 Generates annotated call-graph, converts to postscript, and
333 displays via gv (requres <code>dot</code> and <code>gv</code> be
334 installed).
335 </td>
336</tr>
337<tr valign=top>
338 <td><code>--dot</code></td>
339 <td>
340 Generates the annotated call-graph in dot format and
341 emits to stdout (requres <code>dot</code> be installed).
342 </td>
343</tr>
344<tr valign=top>
345 <td><code>--ps</code></td>
346 <td>
347 Generates the annotated call-graph in Postscript format and
348 emits to stdout (requres <code>dot</code> be installed).
349 </td>
350</tr>
351<tr valign=top>
352 <td><code>--pdf</code></td>
353 <td>
354 Generates the annotated call-graph in PDF format and emits to
355 stdout (requires <code>dot</code> and <code>ps2pdf</code> be
356 installed).
357 </td>
358</tr>
359<tr valign=top>
360 <td><code>--gif</code></td>
361 <td>
362 Generates the annotated call-graph in GIF format and
363 emits to stdout (requres <code>dot</code> be installed).
364 </td>
365</tr>
366<tr valign=top>
367 <td><code>--list=&lt;<i>regexp</i>&gt;</code></td>
368 <td>
369 <p>Outputs source-code listing of routines whose
370 name matches &lt;regexp&gt;. Each line
371 in the listing is annotated with flat and cumulative
372 sample counts.</p>
373
374 <p>In the presence of inlined calls, the samples
375 associated with inlined code tend to get assigned
376 to a line that follows the location of the
377 inlined call. A more precise accounting can be
378 obtained by disassembling the routine using the
379 --disasm flag.</p>
380 </td>
381</tr>
382<tr valign=top>
383 <td><code>--disasm=&lt;<i>regexp</i>&gt;</code></td>
384 <td>
385 Generates disassembly of routines that match
386 &lt;regexp&gt;, annotated with flat and
387 cumulative sample counts and emits to stdout.
388 </td>
389</tr>
390</table>
391</center>
392
393<h4>Reporting Granularity</h4>
394
395<p>By default, pprof produces one entry per procedure. However you can
396use one of the following options to change the granularity of the
397output. The <code>--files</code> option seems to be particularly
398useless, and may be removed eventually.</p>
399
400<center>
401<table frame=box rules=sides cellpadding=5 width=100%>
402<tr valign=top>
403 <td><code>--addresses</code></td>
404 <td>
405 Produce one node per program address.
406 </td>
407</tr>
408 <td><code>--lines</code></td>
409 <td>
410 Produce one node per source line.
411 </td>
412</tr>
413 <td><code>--functions</code></td>
414 <td>
415 Produce one node per function (this is the default).
416 </td>
417</tr>
418 <td><code>--files</code></td>
419 <td>
420 Produce one node per source file.
421 </td>
422</tr>
423</table>
424</center>
425
426<h4>Controlling the Call Graph Display</h4>
427
428<p>Some nodes and edges are dropped to reduce clutter in the output
429display. The following options control this effect:</p>
430
431<center>
432<table frame=box rules=sides cellpadding=5 width=100%>
433<tr valign=top>
434 <td><code>--nodecount=&lt;n&gt;</code></td>
435 <td>
436 This option controls the number of displayed nodes. The nodes
437 are first sorted by decreasing cumulative count, and then only
438 the top N nodes are kept. The default value is 80.
439 </td>
440</tr>
441<tr valign=top>
442 <td><code>--nodefraction=&lt;f&gt;</code></td>
443 <td>
444 This option provides another mechanism for discarding nodes
445 from the display. If the cumulative count for a node is
446 less than this option's value multiplied by the total count
447 for the profile, the node is dropped. The default value
448 is 0.005; i.e. nodes that account for less than
449 half a percent of the total time are dropped. A node
450 is dropped if either this condition is satisfied, or the
451 --nodecount condition is satisfied.
452 </td>
453</tr>
454<tr valign=top>
455 <td><code>--edgefraction=&lt;f&gt;</code></td>
456 <td>
457 This option controls the number of displayed edges. First of all,
458 an edge is dropped if either its source or destination node is
459 dropped. Otherwise, the edge is dropped if the sample
460 count along the edge is less than this option's value multiplied
461 by the total count for the profile. The default value is
462 0.001; i.e., edges that account for less than
463 0.1% of the total time are dropped.
464 </td>
465</tr>
466<tr valign=top>
467 <td><code>--focus=&lt;re&gt;</code></td>
468 <td>
469 This option controls what region of the graph is displayed
470 based on the regular expression supplied with the option.
471 For any path in the callgraph, we check all nodes in the path
472 against the supplied regular expression. If none of the nodes
473 match, the path is dropped from the output.
474 </td>
475</tr>
476<tr valign=top>
477 <td><code>--ignore=&lt;re&gt;</code></td>
478 <td>
479 This option controls what region of the graph is displayed
480 based on the regular expression supplied with the option.
481 For any path in the callgraph, we check all nodes in the path
482 against the supplied regular expression. If any of the nodes
483 match, the path is dropped from the output.
484 </td>
485</tr>
486</table>
487</center>
488
489<p>The dropped edges and nodes account for some count mismatches in
490the display. For example, the cumulative count for
491<code>snprintf()</code> in the first diagram above was 41. However
492the local count (1) and the count along the outgoing edges (12+1+20+6)
493add up to only 40.</p>
494
495
496<h1>Caveats</h1>
497
498<ul>
499 <li> If the program exits because of a signal, the generated profile
500 will be <font color=red>incomplete, and may perhaps be
501 completely empty</font>.
502 <li> The displayed graph may have disconnected regions because
503 of the edge-dropping heuristics described above.
504 <li> If the program linked in a library that was not compiled
505 with enough symbolic information, all samples associated
506 with the library may be charged to the last symbol found
507 in the program before the library. This will artificially
508 inflate the count for that symbol.
509 <li> If you run the program on one machine, and profile it on
510 another, and the shared libraries are different on the two
511 machines, the profiling output may be confusing: samples that
512 fall within shared libaries may be assigned to arbitrary
513 procedures.
514 <li> If your program forks, the children will also be profiled
515 (since they inherit the same CPUPROFILE setting). Each process
516 is profiled separately; to distinguish the child profiles from
517 the parent profile and from each other, all children will have
518 their process-id appended to the CPUPROFILE name.
519 <li> Due to a hack we make to work around a possible gcc bug, your
520 profiles may end up named strangely if the first character of
521 your CPUPROFILE variable has ascii value greater than 127.
522 This should be exceedingly rare, but if you need to use such a
523 name, just set prepend <code>./</code> to your filename:
524 <code>CPUPROFILE=./&Auml;gypten</code>.
525</ul>
526
527
528<hr>
529<address>Sanjay Ghemawat<br>
530<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
531<!-- hhmts start -->
532Last modified: Fri May 9 14:41:29 PDT 2008
533<!-- hhmts end -->
534</address>
535</BODY>
536</HTML>