blob: 33b8cc5fbb3072012e09c3ae953658cf4d6b9885 [file] [log] [blame]
Austin Schuh745610d2015-09-06 18:19:50 -07001<!doctype html public "-//w3c//dtd html 4.01 transitional//en">
2<!-- $Id: $ -->
3<html>
4<head>
5<title>TCMalloc : Thread-Caching Malloc</title>
6<link rel="stylesheet" href="designstyle.css">
7<style type="text/css">
8 em {
9 color: red;
10 font-style: normal;
11 }
12</style>
13</head>
14<body>
15
16<h1>TCMalloc : Thread-Caching Malloc</h1>
17
18<address>Sanjay Ghemawat</address>
19
20<h2><A name=motivation>Motivation</A></h2>
21
22<p>TCMalloc is faster than the glibc 2.3 malloc (available as a
23separate library called ptmalloc2) and other mallocs that I have
24tested. ptmalloc2 takes approximately 300 nanoseconds to execute a
25malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc
26implementation takes approximately 50 nanoseconds for the same
27operation pair. Speed is important for a malloc implementation
28because if malloc is not fast enough, application writers are inclined
29to write their own custom free lists on top of malloc. This can lead
30to extra complexity, and more memory usage unless the application
31writer is very careful to appropriately size the free lists and
32scavenge idle objects out of the free list.</p>
33
34<p>TCMalloc also reduces lock contention for multi-threaded programs.
35For small objects, there is virtually zero contention. For large
36objects, TCMalloc tries to use fine grained and efficient spinlocks.
37ptmalloc2 also reduces lock contention by using per-thread arenas but
38there is a big problem with ptmalloc2's use of per-thread arenas. In
39ptmalloc2 memory can never move from one arena to another. This can
40lead to huge amounts of wasted space. For example, in one Google
41application, the first phase would allocate approximately 300MB of
42memory for its URL canonicalization data structures. When the first
43phase finished, a second phase would be started in the same address
44space. If this second phase was assigned a different arena than the
45one used by the first phase, this phase would not reuse any of the
46memory left after the first phase and would add another 300MB to the
47address space. Similar memory blowup problems were also noticed in
48other applications.</p>
49
50<p>Another benefit of TCMalloc is space-efficient representation of
51small objects. For example, N 8-byte objects can be allocated while
52using space approximately <code>8N * 1.01</code> bytes. I.e., a
53one-percent space overhead. ptmalloc2 uses a four-byte header for
54each object and (I think) rounds up the size to a multiple of 8 bytes
55and ends up using <code>16N</code> bytes.</p>
56
57
58<h2><A NAME="Usage">Usage</A></h2>
59
60<p>To use TCMalloc, just link TCMalloc into your application via the
61"-ltcmalloc" linker flag.</p>
62
63<p>You can use TCMalloc in applications you didn't compile yourself,
64by using LD_PRELOAD:</p>
65<pre>
66 $ LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary>
67</pre>
68<p>LD_PRELOAD is tricky, and we don't necessarily recommend this mode
69of usage.</p>
70
71<p>TCMalloc includes a <A HREF="heap_checker.html">heap checker</A>
72and <A HREF="heapprofile.html">heap profiler</A> as well.</p>
73
74<p>If you'd rather link in a version of TCMalloc that does not include
75the heap profiler and checker (perhaps to reduce binary size for a
76static binary), you can link in <code>libtcmalloc_minimal</code>
77instead.</p>
78
79
80<h2><A NAME="Overview">Overview</A></h2>
81
82<p>TCMalloc assigns each thread a thread-local cache. Small
83allocations are satisfied from the thread-local cache. Objects are
84moved from central data structures into a thread-local cache as
85needed, and periodic garbage collections are used to migrate memory
86back from a thread-local cache into the central data structures.</p>
87<center><img src="overview.gif"></center>
88
89<p>TCMalloc treats objects with size &lt;= 256K ("small" objects)
90differently from larger objects. Large objects are allocated directly
91from the central heap using a page-level allocator (a page is a 8K
92aligned region of memory). I.e., a large object is always
93page-aligned and occupies an integral number of pages.</p>
94
95<p>A run of pages can be carved up into a sequence of small objects,
96each equally sized. For example a run of one page (4K) can be carved
97up into 32 objects of size 128 bytes each.</p>
98
99
100<h2><A NAME="Small_Object_Allocation">Small Object Allocation</A></h2>
101
102<p>Each small object size maps to one of approximately 88 allocatable
103size-classes. For example, all allocations in the range 961 to 1024
104bytes are rounded up to 1024. The size-classes are spaced so that
105small sizes are separated by 8 bytes, larger sizes by 16 bytes, even
106larger sizes by 32 bytes, and so forth. The maximal spacing is
107controlled so that not too much space is wasted when an allocation
108request falls just past the end of a size class and has to be rounded
109up to the next class.</p>
110
111<p>A thread cache contains a singly linked list of free objects per
112size-class.</p>
113<center><img src="threadheap.gif"></center>
114
115<p>When allocating a small object: (1) We map its size to the
116corresponding size-class. (2) Look in the corresponding free list in
117the thread cache for the current thread. (3) If the free list is not
118empty, we remove the first object from the list and return it. When
119following this fast path, TCMalloc acquires no locks at all. This
120helps speed-up allocation significantly because a lock/unlock pair
121takes approximately 100 nanoseconds on a 2.8 GHz Xeon.</p>
122
123<p>If the free list is empty: (1) We fetch a bunch of objects from a
124central free list for this size-class (the central free list is shared
125by all threads). (2) Place them in the thread-local free list. (3)
126Return one of the newly fetched objects to the applications.</p>
127
128<p>If the central free list is also empty: (1) We allocate a run of
129pages from the central page allocator. (2) Split the run into a set
130of objects of this size-class. (3) Place the new objects on the
131central free list. (4) As before, move some of these objects to the
132thread-local free list.</p>
133
134<h3><A NAME="Sizing_Thread_Cache_Free_Lists">
135 Sizing Thread Cache Free Lists</A></h3>
136
137<p>It is important to size the thread cache free lists correctly. If
138the free list is too small, we'll need to go to the central free list
139too often. If the free list is too big, we'll waste memory as objects
140sit idle in the free list.</p>
141
142<p>Note that the thread caches are just as important for deallocation
143as they are for allocation. Without a cache, each deallocation would
144require moving the memory to the central free list. Also, some threads
145have asymmetric alloc/free behavior (e.g. producer and consumer threads),
146so sizing the free list correctly gets trickier.</p>
147
148<p>To size the free lists appropriately, we use a slow-start algorithm
149to determine the maximum length of each individual free list. As the
150free list is used more frequently, its maximum length grows. However,
151if a free list is used more for deallocation than allocation, its
152maximum length will grow only up to a point where the whole list can
153be efficiently moved to the central free list at once.</p>
154
155<p>The psuedo-code below illustrates this slow-start algorithm. Note
156that <code>num_objects_to_move</code> is specific to each size class.
157By moving a list of objects with a well-known length, the central
158cache can efficiently pass these lists between thread caches. If
159a thread cache wants fewer than <code>num_objects_to_move</code>,
160the operation on the central free list has linear time complexity.
161The downside of always using <code>num_objects_to_move</code> as
162the number of objects to transfer to and from the central cache is
163that it wastes memory in threads that don't need all of those objects.
164
165<pre>
166Start each freelist max_length at 1.
167
168Allocation
169 if freelist empty {
170 fetch min(max_length, num_objects_to_move) from central list;
171 if max_length < num_objects_to_move { // slow-start
172 max_length++;
173 } else {
174 max_length += num_objects_to_move;
175 }
176 }
177
178Deallocation
179 if length > max_length {
180 // Don't try to release num_objects_to_move if we don't have that many.
181 release min(max_length, num_objects_to_move) objects to central list
182 if max_length < num_objects_to_move {
183 // Slow-start up to num_objects_to_move.
184 max_length++;
185 } else if max_length > num_objects_to_move {
186 // If we consistently go over max_length, shrink max_length.
187 overages++;
188 if overages > kMaxOverages {
189 max_length -= num_objects_to_move;
190 overages = 0;
191 }
192 }
193 }
194</pre>
195
196See also the section on <a href="#Garbage_Collection">Garbage Collection</a>
197to see how it affects the <code>max_length</code>.
198
Brian Silverman20350ac2021-11-17 18:19:55 -0800199<h2><A NAME="Medium_Object_Allocation">Medium Object Allocation</A></h2>
Austin Schuh745610d2015-09-06 18:19:50 -0700200
Brian Silverman20350ac2021-11-17 18:19:55 -0800201<p>A medium object size (256K &le; size &le; 1MB) is rounded up to a page
202size (8K) and is handled by a central page heap. The central page heap
203includes an array of 128 free lists. The <code>k</code>th entry is a
204free list of runs that consist of <code>k + 1</code> pages:</p>
Austin Schuh745610d2015-09-06 18:19:50 -0700205<center><img src="pageheap.gif"></center>
206
207<p>An allocation for <code>k</code> pages is satisfied by looking in
208the <code>k</code>th free list. If that free list is empty, we look
Brian Silverman20350ac2021-11-17 18:19:55 -0800209in the next free list, and so forth. If no medium-object free list
210can satisfy the allocation, the allocation is treated as a large object.
211
212
213<h2><A NAME="Large_Object_Allocation">Large Object Allocation</A></h2>
214
215Allocations of 1MB or more are considered large allocations. Spans
216of free memory which can satisfy these allocations are tracked in
217a red-black tree sorted by size. Allocations follow the <em>best-fit</em>
218algorithm: the tree is searched to find the smallest span of free
219space which is larger than the requested allocation. The allocation
220is carved out of that span, and the remaining space is reinserted
221either into the large object tree or possibly into one of the smaller
222free-lists as appropriate.
223
224If no span of free memory is located that can fit the requested
225allocation, we fetch memory from the system (using <code>sbrk</code>,
226<code>mmap</code>, or by mapping in portions of
227<code>/dev/mem</code>).</p>
Austin Schuh745610d2015-09-06 18:19:50 -0700228
229<p>If an allocation for <code>k</code> pages is satisfied by a run
230of pages of length &gt; <code>k</code>, the remainder of the
231run is re-inserted back into the appropriate free list in the
232page heap.</p>
233
234
235<h2><A NAME="Spans">Spans</A></h2>
236
237<p>The heap managed by TCMalloc consists of a set of pages. A run of
238contiguous pages is represented by a <code>Span</code> object. A span
239can either be <em>allocated</em>, or <em>free</em>. If free, the span
240is one of the entries in a page heap linked-list. If allocated, it is
241either a large object that has been handed off to the application, or
242a run of pages that have been split up into a sequence of small
243objects. If split into small objects, the size-class of the objects
244is recorded in the span.</p>
245
246<p>A central array indexed by page number can be used to find the span to
247which a page belongs. For example, span <em>a</em> below occupies 2
248pages, span <em>b</em> occupies 1 page, span <em>c</em> occupies 5
249pages and span <em>d</em> occupies 3 pages.</p>
250<center><img src="spanmap.gif"></center>
251
252<p>In a 32-bit address space, the central array is represented by a a
2532-level radix tree where the root contains 32 entries and each leaf
254contains 2^14 entries (a 32-bit address space has 2^19 8K pages, and
255the first level of tree divides the 2^19 pages by 2^5). This leads to
256a starting memory usage of 64KB of space (2^14*4 bytes) for the
257central array, which seems acceptable.</p>
258
259<p>On 64-bit machines, we use a 3-level radix tree.</p>
260
261
262<h2><A NAME="Deallocation">Deallocation</A></h2>
263
264<p>When an object is deallocated, we compute its page number and look
265it up in the central array to find the corresponding span object. The
266span tells us whether or not the object is small, and its size-class
267if it is small. If the object is small, we insert it into the
268appropriate free list in the current thread's thread cache. If the
269thread cache now exceeds a predetermined size (2MB by default), we run
270a garbage collector that moves unused objects from the thread cache
271into central free lists.</p>
272
273<p>If the object is large, the span tells us the range of pages covered
274by the object. Suppose this range is <code>[p,q]</code>. We also
275lookup the spans for pages <code>p-1</code> and <code>q+1</code>. If
276either of these neighboring spans are free, we coalesce them with the
277<code>[p,q]</code> span. The resulting span is inserted into the
278appropriate free list in the page heap.</p>
279
280
281<h2>Central Free Lists for Small Objects</h2>
282
283<p>As mentioned before, we keep a central free list for each
284size-class. Each central free list is organized as a two-level data
285structure: a set of spans, and a linked list of free objects per
286span.</p>
287
288<p>An object is allocated from a central free list by removing the
289first entry from the linked list of some span. (If all spans have
290empty linked lists, a suitably sized span is first allocated from the
291central page heap.)</p>
292
293<p>An object is returned to a central free list by adding it to the
294linked list of its containing span. If the linked list length now
295equals the total number of small objects in the span, this span is now
296completely free and is returned to the page heap.</p>
297
298
299<h2><A NAME="Garbage_Collection">Garbage Collection of Thread Caches</A></h2>
300
301<p>Garbage collecting objects from a thread cache keeps the size of
302the cache under control and returns unused objects to the central free
303lists. Some threads need large caches to perform well while others
304can get by with little or no cache at all. When a thread cache goes
305over its <code>max_size</code>, garbage collection kicks in and then the
306thread competes with the other threads for a larger cache.</p>
307
308<p>Garbage collection is run only during a deallocation. We walk over
309all free lists in the cache and move some number of objects from the
310free list to the corresponding central list.</p>
311
312<p>The number of objects to be moved from a free list is determined
313using a per-list low-water-mark <code>L</code>. <code>L</code>
314records the minimum length of the list since the last garbage
315collection. Note that we could have shortened the list by
316<code>L</code> objects at the last garbage collection without
317requiring any extra accesses to the central list. We use this past
318history as a predictor of future accesses and move <code>L/2</code>
319objects from the thread cache free list to the corresponding central
320free list. This algorithm has the nice property that if a thread
321stops using a particular size, all objects of that size will quickly
322move from the thread cache to the central free list where they can be
323used by other threads.</p>
324
325<p>If a thread consistently deallocates more objects of a certain size
326than it allocates, this <code>L/2</code> behavior will cause at least
327<code>L/2</code> objects to always sit in the free list. To avoid
328wasting memory this way, we shrink the maximum length of the freelist
329to converge on <code>num_objects_to_move</code> (see also
330<a href="#Sizing_Thread_Cache_Free_Lists">Sizing Thread Cache Free Lists</a>).
331
332<pre>
333Garbage Collection
334 if (L != 0 && max_length > num_objects_to_move) {
335 max_length = max(max_length - num_objects_to_move, num_objects_to_move)
336 }
337</pre>
338
339<p>The fact that the thread cache went over its <code>max_size</code> is
340an indication that the thread would benefit from a larger cache. Simply
341increasing <code>max_size</code> would use an inordinate amount of memory
342in programs that have lots of active threads. Developers can bound the
343memory used with the flag --tcmalloc_max_total_thread_cache_bytes.</p>
344
345<p>Each thread cache starts with a small <code>max_size</code>
346(e.g. 64KB) so that idle threads won't pre-allocate memory they don't
347need. Each time the cache runs a garbage collection, it will also try
348to grow its <code>max_size</code>. If the sum of the thread cache
349sizes is less than --tcmalloc_max_total_thread_cache_bytes,
350<code>max_size</code> grows easily. If not, thread cache 1 will try
351to steal from thread cache 2 (picked round-robin) by decreasing thread
352cache 2's <code>max_size</code>. In this way, threads that are more
353active will steal memory from other threads more often than they are
354have memory stolen from themselves. Mostly idle threads end up with
355small caches and active threads end up with big caches. Note that
356this stealing can cause the sum of the thread cache sizes to be
357greater than --tcmalloc_max_total_thread_cache_bytes until thread
358cache 2 deallocates some memory to trigger a garbage collection.</p>
359
360<h2><A NAME="performance">Performance Notes</A></h2>
361
362<h3>PTMalloc2 unittest</h3>
363
364<p>The PTMalloc2 package (now part of glibc) contains a unittest
365program <code>t-test1.c</code>. This forks a number of threads and
366performs a series of allocations and deallocations in each thread; the
367threads do not communicate other than by synchronization in the memory
368allocator.</p>
369
370<p><code>t-test1</code> (included in
371<code>tests/tcmalloc/</code>, and compiled as
372<code>ptmalloc_unittest1</code>) was run with a varying numbers of
373threads (1-20) and maximum allocation sizes (64 bytes -
37432Kbytes). These tests were run on a 2.4GHz dual Xeon system with
375hyper-threading enabled, using Linux glibc-2.3.2 from RedHat 9, with
376one million operations per thread in each test. In each case, the test
377was run once normally, and once with
378<code>LD_PRELOAD=libtcmalloc.so</code>.
379
380<p>The graphs below show the performance of TCMalloc vs PTMalloc2 for
381several different metrics. Firstly, total operations (millions) per
382elapsed second vs max allocation size, for varying numbers of
383threads. The raw data used to generate these graphs (the output of the
384<code>time</code> utility) is available in
385<code>t-test1.times.txt</code>.</p>
386
387<table>
388<tr>
389 <td><img src="tcmalloc-opspersec.vs.size.1.threads.png"></td>
390 <td><img src="tcmalloc-opspersec.vs.size.2.threads.png"></td>
391 <td><img src="tcmalloc-opspersec.vs.size.3.threads.png"></td>
392</tr>
393<tr>
394 <td><img src="tcmalloc-opspersec.vs.size.4.threads.png"></td>
395 <td><img src="tcmalloc-opspersec.vs.size.5.threads.png"></td>
396 <td><img src="tcmalloc-opspersec.vs.size.8.threads.png"></td>
397</tr>
398<tr>
399 <td><img src="tcmalloc-opspersec.vs.size.12.threads.png"></td>
400 <td><img src="tcmalloc-opspersec.vs.size.16.threads.png"></td>
401 <td><img src="tcmalloc-opspersec.vs.size.20.threads.png"></td>
402</tr>
403</table>
404
405
406<ul>
407 <li> TCMalloc is much more consistently scalable than PTMalloc2 - for
408 all thread counts &gt;1 it achieves ~7-9 million ops/sec for small
409 allocations, falling to ~2 million ops/sec for larger
410 allocations. The single-thread case is an obvious outlier,
411 since it is only able to keep a single processor busy and hence
412 can achieve fewer ops/sec. PTMalloc2 has a much higher variance
413 on operations/sec - peaking somewhere around 4 million ops/sec
414 for small allocations and falling to &lt;1 million ops/sec for
415 larger allocations.
416
417 <li> TCMalloc is faster than PTMalloc2 in the vast majority of
418 cases, and particularly for small allocations. Contention
419 between threads is less of a problem in TCMalloc.
420
421 <li> TCMalloc's performance drops off as the allocation size
422 increases. This is because the per-thread cache is
423 garbage-collected when it hits a threshold (defaulting to
424 2MB). With larger allocation sizes, fewer objects can be stored
425 in the cache before it is garbage-collected.
426
427 <li> There is a noticeable drop in TCMalloc's performance at ~32K
428 maximum allocation size; at larger sizes performance drops less
429 quickly. This is due to the 32K maximum size of objects in the
430 per-thread caches; for objects larger than this TCMalloc
431 allocates from the central page heap.
432</ul>
433
434<p>Next, operations (millions) per second of CPU time vs number of
435threads, for max allocation size 64 bytes - 128 Kbytes.</p>
436
437<table>
438<tr>
439 <td><img src="tcmalloc-opspercpusec.vs.threads.64.bytes.png"></td>
440 <td><img src="tcmalloc-opspercpusec.vs.threads.256.bytes.png"></td>
441 <td><img src="tcmalloc-opspercpusec.vs.threads.1024.bytes.png"></td>
442</tr>
443<tr>
444 <td><img src="tcmalloc-opspercpusec.vs.threads.4096.bytes.png"></td>
445 <td><img src="tcmalloc-opspercpusec.vs.threads.8192.bytes.png"></td>
446 <td><img src="tcmalloc-opspercpusec.vs.threads.16384.bytes.png"></td>
447</tr>
448<tr>
449 <td><img src="tcmalloc-opspercpusec.vs.threads.32768.bytes.png"></td>
450 <td><img src="tcmalloc-opspercpusec.vs.threads.65536.bytes.png"></td>
451 <td><img src="tcmalloc-opspercpusec.vs.threads.131072.bytes.png"></td>
452</tr>
453</table>
454
455<p>Here we see again that TCMalloc is both more consistent and more
456efficient than PTMalloc2. For max allocation sizes &lt;32K, TCMalloc
457typically achieves ~2-2.5 million ops per second of CPU time with a
458large number of threads, whereas PTMalloc achieves generally 0.5-1
459million ops per second of CPU time, with a lot of cases achieving much
460less than this figure. Above 32K max allocation size, TCMalloc drops
461to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost
462to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU
463time is being burned spinning waiting for locks in the heavily
464multi-threaded case).</p>
465
466
467<H2><A NAME="runtime">Modifying Runtime Behavior</A></H2>
468
469<p>You can more finely control the behavior of the tcmalloc via
470environment variables.</p>
471
472<p>Generally useful flags:</p>
473
474<table frame=box rules=sides cellpadding=5 width=100%>
475
476<tr valign=top>
477 <td><code>TCMALLOC_SAMPLE_PARAMETER</code></td>
478 <td>default: 0</td>
479 <td>
480 The approximate gap between sampling actions. That is, we
481 take one sample approximately once every
482 <code>tcmalloc_sample_parmeter</code> bytes of allocation.
483 This sampled heap information is available via
484 <code>MallocExtension::GetHeapSample()</code> or
485 <code>MallocExtension::ReadStackTraces()</code>. A reasonable
486 value is 524288.
487 </td>
488</tr>
489
490<tr valign=top>
491 <td><code>TCMALLOC_RELEASE_RATE</code></td>
492 <td>default: 1.0</td>
493 <td>
494 Rate at which we release unused memory to the system, via
495 <code>madvise(MADV_DONTNEED)</code>, on systems that support
496 it. Zero means we never release memory back to the system.
497 Increase this flag to return memory faster; decrease it
498 to return memory slower. Reasonable rates are in the
499 range [0,10].
500 </td>
501</tr>
502
503<tr valign=top>
504 <td><code>TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD</code></td>
505 <td>default: 1073741824</td>
506 <td>
507 Allocations larger than this value cause a stack trace to be
508 dumped to stderr. The threshold for dumping stack traces is
509 increased by a factor of 1.125 every time we print a message so
510 that the threshold automatically goes up by a factor of ~1000
511 every 60 messages. This bounds the amount of extra logging
512 generated by this flag. Default value of this flag is very large
513 and therefore you should see no extra logging unless the flag is
514 overridden.
515 </td>
516</tr>
517
518<tr valign=top>
519 <td><code>TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES</code></td>
520 <td>default: 16777216</td>
521 <td>
522 Bound on the total amount of bytes allocated to thread caches. This
523 bound is not strict, so it is possible for the cache to go over this
524 bound in certain circumstances. This value defaults to 16MB. For
525 applications with many threads, this may not be a large enough cache,
526 which can affect performance. If you suspect your application is not
527 scaling to many threads due to lock contention in TCMalloc, you can
528 try increasing this value. This may improve performance, at a cost
529 of extra memory use by TCMalloc. See <a href="#Garbage_Collection">
530 Garbage Collection</a> for more details.
531 </td>
532</tr>
533
534</table>
535
536<p>Advanced "tweaking" flags, that control more precisely how tcmalloc
537tries to allocate memory from the kernel.</p>
538
539<table frame=box rules=sides cellpadding=5 width=100%>
540
541<tr valign=top>
542 <td><code>TCMALLOC_SKIP_MMAP</code></td>
543 <td>default: false</td>
544 <td>
545 If true, do not try to use <code>mmap</code> to obtain memory
546 from the kernel.
547 </td>
548</tr>
549
550<tr valign=top>
551 <td><code>TCMALLOC_SKIP_SBRK</code></td>
552 <td>default: false</td>
553 <td>
554 If true, do not try to use <code>sbrk</code> to obtain memory
555 from the kernel.
556 </td>
557</tr>
558
559<tr valign=top>
560 <td><code>TCMALLOC_DEVMEM_START</code></td>
561 <td>default: 0</td>
562 <td>
563 Physical memory starting location in MB for <code>/dev/mem</code>
564 allocation. Setting this to 0 disables <code>/dev/mem</code>
565 allocation.
566 </td>
567</tr>
568
569<tr valign=top>
570 <td><code>TCMALLOC_DEVMEM_LIMIT</code></td>
571 <td>default: 0</td>
572 <td>
573 Physical memory limit location in MB for <code>/dev/mem</code>
574 allocation. Setting this to 0 means no limit.
575 </td>
576</tr>
577
578<tr valign=top>
579 <td><code>TCMALLOC_DEVMEM_DEVICE</code></td>
580 <td>default: /dev/mem</td>
581 <td>
582 Device to use for allocating unmanaged memory.
583 </td>
584</tr>
585
586<tr valign=top>
587 <td><code>TCMALLOC_MEMFS_MALLOC_PATH</code></td>
588 <td>default: ""</td>
589 <td>
590 If set, specify a path where hugetlbfs or tmpfs is mounted.
591 This may allow for speedier allocations.
592 </td>
593</tr>
594
595<tr valign=top>
596 <td><code>TCMALLOC_MEMFS_LIMIT_MB</code></td>
597 <td>default: 0</td>
598 <td>
599 Limit total memfs allocation size to specified number of MB.
600 0 means "no limit".
601 </td>
602</tr>
603
604<tr valign=top>
605 <td><code>TCMALLOC_MEMFS_ABORT_ON_FAIL</code></td>
606 <td>default: false</td>
607 <td>
608 If true, abort() whenever memfs_malloc fails to satisfy an allocation.
609 </td>
610</tr>
611
612<tr valign=top>
613 <td><code>TCMALLOC_MEMFS_IGNORE_MMAP_FAIL</code></td>
614 <td>default: false</td>
615 <td>
616 If true, ignore failures from mmap.
617 </td>
618</tr>
619
620<tr valign=top>
Brian Silverman20350ac2021-11-17 18:19:55 -0800621 <td><code>TCMALLOC_MEMFS_MAP_PRIVATE</code></td>
Austin Schuh745610d2015-09-06 18:19:50 -0700622 <td>default: false</td>
623 <td>
624 If true, use MAP_PRIVATE when mapping via memfs, not MAP_SHARED.
625 </td>
626</tr>
627
628</table>
629
630
631<H2><A NAME="compiletime">Modifying Behavior In Code</A></H2>
632
633<p>The <code>MallocExtension</code> class, in
634<code>malloc_extension.h</code>, provides a few knobs that you can
635tweak in your program, to affect tcmalloc's behavior.</p>
636
637<h3>Releasing Memory Back to the System</h3>
638
639<p>By default, tcmalloc will release no-longer-used memory back to the
640kernel gradually, over time. The <a
641href="#runtime">tcmalloc_release_rate</a> flag controls how quickly
642this happens. You can also force a release at a given point in the
643progam execution like so:</p>
644<pre>
645 MallocExtension::instance()->ReleaseFreeMemory();
646</pre>
647
648<p>You can also call <code>SetMemoryReleaseRate()</code> to change the
649<code>tcmalloc_release_rate</code> value at runtime, or
650<code>GetMemoryReleaseRate</code> to see what the current release rate
651is.</p>
652
653<h3>Memory Introspection</h3>
654
655<p>There are several routines for getting a human-readable form of the
656current memory usage:</p>
657<pre>
658 MallocExtension::instance()->GetStats(buffer, buffer_length);
659 MallocExtension::instance()->GetHeapSample(&string);
660 MallocExtension::instance()->GetHeapGrowthStacks(&string);
661</pre>
662
663<p>The last two create files in the same format as the heap-profiler,
664and can be passed as data files to pprof. The first is human-readable
665and is meant for debugging.</p>
666
667<h3>Generic Tcmalloc Status</h3>
668
669<p>TCMalloc has support for setting and retrieving arbitrary
670'properties':</p>
671<pre>
672 MallocExtension::instance()->SetNumericProperty(property_name, value);
673 MallocExtension::instance()->GetNumericProperty(property_name, &value);
674</pre>
675
676<p>It is possible for an application to set and get these properties,
677but the most useful is when a library sets the properties so the
678application can read them. Here are the properties TCMalloc defines;
679you can access them with a call like
680<code>MallocExtension::instance()->GetNumericProperty("generic.heap_size",
681&value);</code>:</p>
682
683<table frame=box rules=sides cellpadding=5 width=100%>
684
685<tr valign=top>
686 <td><code>generic.current_allocated_bytes</code></td>
687 <td>
688 Number of bytes used by the application. This will not typically
689 match the memory use reported by the OS, because it does not
690 include TCMalloc overhead or memory fragmentation.
691 </td>
692</tr>
693
694<tr valign=top>
695 <td><code>generic.heap_size</code></td>
696 <td>
697 Bytes of system memory reserved by TCMalloc.
698 </td>
699</tr>
700
701<tr valign=top>
702 <td><code>tcmalloc.pageheap_free_bytes</code></td>
703 <td>
704 Number of bytes in free, mapped pages in page heap. These bytes
705 can be used to fulfill allocation requests. They always count
706 towards virtual memory usage, and unless the underlying memory is
707 swapped out by the OS, they also count towards physical memory
708 usage.
709 </td>
710</tr>
711
712<tr valign=top>
713 <td><code>tcmalloc.pageheap_unmapped_bytes</code></td>
714 <td>
715 Number of bytes in free, unmapped pages in page heap. These are
716 bytes that have been released back to the OS, possibly by one of
717 the MallocExtension "Release" calls. They can be used to fulfill
718 allocation requests, but typically incur a page fault. They
719 always count towards virtual memory usage, and depending on the
720 OS, typically do not count towards physical memory usage.
721 </td>
722</tr>
723
724<tr valign=top>
725 <td><code>tcmalloc.slack_bytes</code></td>
726 <td>
727 Sum of pageheap_free_bytes and pageheap_unmapped_bytes. Provided
728 for backwards compatibility only. Do not use.
729 </td>
730</tr>
731
732<tr valign=top>
733 <td><code>tcmalloc.max_total_thread_cache_bytes</code></td>
734 <td>
735 A limit to how much memory TCMalloc dedicates for small objects.
736 Higher numbers trade off more memory use for -- in some situations
737 -- improved efficiency.
738 </td>
739</tr>
740
741<tr valign=top>
742 <td><code>tcmalloc.current_total_thread_cache_bytes</code></td>
743 <td>
744 A measure of some of the memory TCMalloc is using (for
745 small objects).
746 </td>
747</tr>
748
749</table>
750
751<h2><A NAME="caveats">Caveats</A></h2>
752
753<p>For some systems, TCMalloc may not work correctly with
754applications that aren't linked against <code>libpthread.so</code> (or
755the equivalent on your OS). It should work on Linux using glibc 2.3,
756but other OS/libc combinations have not been tested.</p>
757
758<p>TCMalloc may be somewhat more memory hungry than other mallocs,
759(but tends not to have the huge blowups that can happen with other
760mallocs). In particular, at startup TCMalloc allocates approximately
761240KB of internal memory.</p>
762
763<p>Don't try to load TCMalloc into a running binary (e.g., using JNI
764in Java programs). The binary will have allocated some objects using
765the system malloc, and may try to pass them to TCMalloc for
766deallocation. TCMalloc will not be able to handle such objects.</p>
767
768<hr>
769
770<address>Sanjay Ghemawat, Paul Menage<br>
771<!-- Created: Tue Dec 19 10:43:14 PST 2000 -->
772<!-- hhmts start -->
773Last modified: Sat Feb 24 13:11:38 PST 2007 (csilvers)
774<!-- hhmts end -->
775</address>
776
777</body>
778</html>