Austin Schuh | 745610d | 2015-09-06 18:19:50 -0700 | [diff] [blame^] | 1 | <!doctype html public "-//w3c//dtd html 4.01 transitional//en"> |
| 2 | <!-- $Id: $ --> |
| 3 | <html> |
| 4 | <head> |
| 5 | <title>TCMalloc : Thread-Caching Malloc</title> |
| 6 | <link rel="stylesheet" href="designstyle.css"> |
| 7 | <style type="text/css"> |
| 8 | em { |
| 9 | color: red; |
| 10 | font-style: normal; |
| 11 | } |
| 12 | </style> |
| 13 | </head> |
| 14 | <body> |
| 15 | |
| 16 | <h1>TCMalloc : Thread-Caching Malloc</h1> |
| 17 | |
| 18 | <address>Sanjay Ghemawat</address> |
| 19 | |
| 20 | <h2><A name=motivation>Motivation</A></h2> |
| 21 | |
| 22 | <p>TCMalloc is faster than the glibc 2.3 malloc (available as a |
| 23 | separate library called ptmalloc2) and other mallocs that I have |
| 24 | tested. ptmalloc2 takes approximately 300 nanoseconds to execute a |
| 25 | malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc |
| 26 | implementation takes approximately 50 nanoseconds for the same |
| 27 | operation pair. Speed is important for a malloc implementation |
| 28 | because if malloc is not fast enough, application writers are inclined |
| 29 | to write their own custom free lists on top of malloc. This can lead |
| 30 | to extra complexity, and more memory usage unless the application |
| 31 | writer is very careful to appropriately size the free lists and |
| 32 | scavenge idle objects out of the free list.</p> |
| 33 | |
| 34 | <p>TCMalloc also reduces lock contention for multi-threaded programs. |
| 35 | For small objects, there is virtually zero contention. For large |
| 36 | objects, TCMalloc tries to use fine grained and efficient spinlocks. |
| 37 | ptmalloc2 also reduces lock contention by using per-thread arenas but |
| 38 | there is a big problem with ptmalloc2's use of per-thread arenas. In |
| 39 | ptmalloc2 memory can never move from one arena to another. This can |
| 40 | lead to huge amounts of wasted space. For example, in one Google |
| 41 | application, the first phase would allocate approximately 300MB of |
| 42 | memory for its URL canonicalization data structures. When the first |
| 43 | phase finished, a second phase would be started in the same address |
| 44 | space. If this second phase was assigned a different arena than the |
| 45 | one used by the first phase, this phase would not reuse any of the |
| 46 | memory left after the first phase and would add another 300MB to the |
| 47 | address space. Similar memory blowup problems were also noticed in |
| 48 | other applications.</p> |
| 49 | |
| 50 | <p>Another benefit of TCMalloc is space-efficient representation of |
| 51 | small objects. For example, N 8-byte objects can be allocated while |
| 52 | using space approximately <code>8N * 1.01</code> bytes. I.e., a |
| 53 | one-percent space overhead. ptmalloc2 uses a four-byte header for |
| 54 | each object and (I think) rounds up the size to a multiple of 8 bytes |
| 55 | and ends up using <code>16N</code> bytes.</p> |
| 56 | |
| 57 | |
| 58 | <h2><A NAME="Usage">Usage</A></h2> |
| 59 | |
| 60 | <p>To use TCMalloc, just link TCMalloc into your application via the |
| 61 | "-ltcmalloc" linker flag.</p> |
| 62 | |
| 63 | <p>You can use TCMalloc in applications you didn't compile yourself, |
| 64 | by using LD_PRELOAD:</p> |
| 65 | <pre> |
| 66 | $ LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary> |
| 67 | </pre> |
| 68 | <p>LD_PRELOAD is tricky, and we don't necessarily recommend this mode |
| 69 | of usage.</p> |
| 70 | |
| 71 | <p>TCMalloc includes a <A HREF="heap_checker.html">heap checker</A> |
| 72 | and <A HREF="heapprofile.html">heap profiler</A> as well.</p> |
| 73 | |
| 74 | <p>If you'd rather link in a version of TCMalloc that does not include |
| 75 | the heap profiler and checker (perhaps to reduce binary size for a |
| 76 | static binary), you can link in <code>libtcmalloc_minimal</code> |
| 77 | instead.</p> |
| 78 | |
| 79 | |
| 80 | <h2><A NAME="Overview">Overview</A></h2> |
| 81 | |
| 82 | <p>TCMalloc assigns each thread a thread-local cache. Small |
| 83 | allocations are satisfied from the thread-local cache. Objects are |
| 84 | moved from central data structures into a thread-local cache as |
| 85 | needed, and periodic garbage collections are used to migrate memory |
| 86 | back from a thread-local cache into the central data structures.</p> |
| 87 | <center><img src="overview.gif"></center> |
| 88 | |
| 89 | <p>TCMalloc treats objects with size <= 256K ("small" objects) |
| 90 | differently from larger objects. Large objects are allocated directly |
| 91 | from the central heap using a page-level allocator (a page is a 8K |
| 92 | aligned region of memory). I.e., a large object is always |
| 93 | page-aligned and occupies an integral number of pages.</p> |
| 94 | |
| 95 | <p>A run of pages can be carved up into a sequence of small objects, |
| 96 | each equally sized. For example a run of one page (4K) can be carved |
| 97 | up into 32 objects of size 128 bytes each.</p> |
| 98 | |
| 99 | |
| 100 | <h2><A NAME="Small_Object_Allocation">Small Object Allocation</A></h2> |
| 101 | |
| 102 | <p>Each small object size maps to one of approximately 88 allocatable |
| 103 | size-classes. For example, all allocations in the range 961 to 1024 |
| 104 | bytes are rounded up to 1024. The size-classes are spaced so that |
| 105 | small sizes are separated by 8 bytes, larger sizes by 16 bytes, even |
| 106 | larger sizes by 32 bytes, and so forth. The maximal spacing is |
| 107 | controlled so that not too much space is wasted when an allocation |
| 108 | request falls just past the end of a size class and has to be rounded |
| 109 | up to the next class.</p> |
| 110 | |
| 111 | <p>A thread cache contains a singly linked list of free objects per |
| 112 | size-class.</p> |
| 113 | <center><img src="threadheap.gif"></center> |
| 114 | |
| 115 | <p>When allocating a small object: (1) We map its size to the |
| 116 | corresponding size-class. (2) Look in the corresponding free list in |
| 117 | the thread cache for the current thread. (3) If the free list is not |
| 118 | empty, we remove the first object from the list and return it. When |
| 119 | following this fast path, TCMalloc acquires no locks at all. This |
| 120 | helps speed-up allocation significantly because a lock/unlock pair |
| 121 | takes approximately 100 nanoseconds on a 2.8 GHz Xeon.</p> |
| 122 | |
| 123 | <p>If the free list is empty: (1) We fetch a bunch of objects from a |
| 124 | central free list for this size-class (the central free list is shared |
| 125 | by all threads). (2) Place them in the thread-local free list. (3) |
| 126 | Return one of the newly fetched objects to the applications.</p> |
| 127 | |
| 128 | <p>If the central free list is also empty: (1) We allocate a run of |
| 129 | pages from the central page allocator. (2) Split the run into a set |
| 130 | of objects of this size-class. (3) Place the new objects on the |
| 131 | central free list. (4) As before, move some of these objects to the |
| 132 | thread-local free list.</p> |
| 133 | |
| 134 | <h3><A NAME="Sizing_Thread_Cache_Free_Lists"> |
| 135 | Sizing Thread Cache Free Lists</A></h3> |
| 136 | |
| 137 | <p>It is important to size the thread cache free lists correctly. If |
| 138 | the free list is too small, we'll need to go to the central free list |
| 139 | too often. If the free list is too big, we'll waste memory as objects |
| 140 | sit idle in the free list.</p> |
| 141 | |
| 142 | <p>Note that the thread caches are just as important for deallocation |
| 143 | as they are for allocation. Without a cache, each deallocation would |
| 144 | require moving the memory to the central free list. Also, some threads |
| 145 | have asymmetric alloc/free behavior (e.g. producer and consumer threads), |
| 146 | so sizing the free list correctly gets trickier.</p> |
| 147 | |
| 148 | <p>To size the free lists appropriately, we use a slow-start algorithm |
| 149 | to determine the maximum length of each individual free list. As the |
| 150 | free list is used more frequently, its maximum length grows. However, |
| 151 | if a free list is used more for deallocation than allocation, its |
| 152 | maximum length will grow only up to a point where the whole list can |
| 153 | be efficiently moved to the central free list at once.</p> |
| 154 | |
| 155 | <p>The psuedo-code below illustrates this slow-start algorithm. Note |
| 156 | that <code>num_objects_to_move</code> is specific to each size class. |
| 157 | By moving a list of objects with a well-known length, the central |
| 158 | cache can efficiently pass these lists between thread caches. If |
| 159 | a thread cache wants fewer than <code>num_objects_to_move</code>, |
| 160 | the operation on the central free list has linear time complexity. |
| 161 | The downside of always using <code>num_objects_to_move</code> as |
| 162 | the number of objects to transfer to and from the central cache is |
| 163 | that it wastes memory in threads that don't need all of those objects. |
| 164 | |
| 165 | <pre> |
| 166 | Start each freelist max_length at 1. |
| 167 | |
| 168 | Allocation |
| 169 | if freelist empty { |
| 170 | fetch min(max_length, num_objects_to_move) from central list; |
| 171 | if max_length < num_objects_to_move { // slow-start |
| 172 | max_length++; |
| 173 | } else { |
| 174 | max_length += num_objects_to_move; |
| 175 | } |
| 176 | } |
| 177 | |
| 178 | Deallocation |
| 179 | if length > max_length { |
| 180 | // Don't try to release num_objects_to_move if we don't have that many. |
| 181 | release min(max_length, num_objects_to_move) objects to central list |
| 182 | if max_length < num_objects_to_move { |
| 183 | // Slow-start up to num_objects_to_move. |
| 184 | max_length++; |
| 185 | } else if max_length > num_objects_to_move { |
| 186 | // If we consistently go over max_length, shrink max_length. |
| 187 | overages++; |
| 188 | if overages > kMaxOverages { |
| 189 | max_length -= num_objects_to_move; |
| 190 | overages = 0; |
| 191 | } |
| 192 | } |
| 193 | } |
| 194 | </pre> |
| 195 | |
| 196 | See also the section on <a href="#Garbage_Collection">Garbage Collection</a> |
| 197 | to see how it affects the <code>max_length</code>. |
| 198 | |
| 199 | <h2><A NAME="Large_Object_Allocation">Large Object Allocation</A></h2> |
| 200 | |
| 201 | <p>A large object size (> 256K) is rounded up to a page size (8K) |
| 202 | and is handled by a central page heap. The central page heap is again |
| 203 | an array of free lists. For <code>i < 128</code>, the |
| 204 | <code>k</code>th entry is a free list of runs that consist of |
| 205 | <code>k</code> pages. The <code>128</code>th entry is a free list of |
| 206 | runs that have length <code>>= 128</code> pages: </p> |
| 207 | <center><img src="pageheap.gif"></center> |
| 208 | |
| 209 | <p>An allocation for <code>k</code> pages is satisfied by looking in |
| 210 | the <code>k</code>th free list. If that free list is empty, we look |
| 211 | in the next free list, and so forth. Eventually, we look in the last |
| 212 | free list if necessary. If that fails, we fetch memory from the |
| 213 | system (using <code>sbrk</code>, <code>mmap</code>, or by mapping in |
| 214 | portions of <code>/dev/mem</code>).</p> |
| 215 | |
| 216 | <p>If an allocation for <code>k</code> pages is satisfied by a run |
| 217 | of pages of length > <code>k</code>, the remainder of the |
| 218 | run is re-inserted back into the appropriate free list in the |
| 219 | page heap.</p> |
| 220 | |
| 221 | |
| 222 | <h2><A NAME="Spans">Spans</A></h2> |
| 223 | |
| 224 | <p>The heap managed by TCMalloc consists of a set of pages. A run of |
| 225 | contiguous pages is represented by a <code>Span</code> object. A span |
| 226 | can either be <em>allocated</em>, or <em>free</em>. If free, the span |
| 227 | is one of the entries in a page heap linked-list. If allocated, it is |
| 228 | either a large object that has been handed off to the application, or |
| 229 | a run of pages that have been split up into a sequence of small |
| 230 | objects. If split into small objects, the size-class of the objects |
| 231 | is recorded in the span.</p> |
| 232 | |
| 233 | <p>A central array indexed by page number can be used to find the span to |
| 234 | which a page belongs. For example, span <em>a</em> below occupies 2 |
| 235 | pages, span <em>b</em> occupies 1 page, span <em>c</em> occupies 5 |
| 236 | pages and span <em>d</em> occupies 3 pages.</p> |
| 237 | <center><img src="spanmap.gif"></center> |
| 238 | |
| 239 | <p>In a 32-bit address space, the central array is represented by a a |
| 240 | 2-level radix tree where the root contains 32 entries and each leaf |
| 241 | contains 2^14 entries (a 32-bit address space has 2^19 8K pages, and |
| 242 | the first level of tree divides the 2^19 pages by 2^5). This leads to |
| 243 | a starting memory usage of 64KB of space (2^14*4 bytes) for the |
| 244 | central array, which seems acceptable.</p> |
| 245 | |
| 246 | <p>On 64-bit machines, we use a 3-level radix tree.</p> |
| 247 | |
| 248 | |
| 249 | <h2><A NAME="Deallocation">Deallocation</A></h2> |
| 250 | |
| 251 | <p>When an object is deallocated, we compute its page number and look |
| 252 | it up in the central array to find the corresponding span object. The |
| 253 | span tells us whether or not the object is small, and its size-class |
| 254 | if it is small. If the object is small, we insert it into the |
| 255 | appropriate free list in the current thread's thread cache. If the |
| 256 | thread cache now exceeds a predetermined size (2MB by default), we run |
| 257 | a garbage collector that moves unused objects from the thread cache |
| 258 | into central free lists.</p> |
| 259 | |
| 260 | <p>If the object is large, the span tells us the range of pages covered |
| 261 | by the object. Suppose this range is <code>[p,q]</code>. We also |
| 262 | lookup the spans for pages <code>p-1</code> and <code>q+1</code>. If |
| 263 | either of these neighboring spans are free, we coalesce them with the |
| 264 | <code>[p,q]</code> span. The resulting span is inserted into the |
| 265 | appropriate free list in the page heap.</p> |
| 266 | |
| 267 | |
| 268 | <h2>Central Free Lists for Small Objects</h2> |
| 269 | |
| 270 | <p>As mentioned before, we keep a central free list for each |
| 271 | size-class. Each central free list is organized as a two-level data |
| 272 | structure: a set of spans, and a linked list of free objects per |
| 273 | span.</p> |
| 274 | |
| 275 | <p>An object is allocated from a central free list by removing the |
| 276 | first entry from the linked list of some span. (If all spans have |
| 277 | empty linked lists, a suitably sized span is first allocated from the |
| 278 | central page heap.)</p> |
| 279 | |
| 280 | <p>An object is returned to a central free list by adding it to the |
| 281 | linked list of its containing span. If the linked list length now |
| 282 | equals the total number of small objects in the span, this span is now |
| 283 | completely free and is returned to the page heap.</p> |
| 284 | |
| 285 | |
| 286 | <h2><A NAME="Garbage_Collection">Garbage Collection of Thread Caches</A></h2> |
| 287 | |
| 288 | <p>Garbage collecting objects from a thread cache keeps the size of |
| 289 | the cache under control and returns unused objects to the central free |
| 290 | lists. Some threads need large caches to perform well while others |
| 291 | can get by with little or no cache at all. When a thread cache goes |
| 292 | over its <code>max_size</code>, garbage collection kicks in and then the |
| 293 | thread competes with the other threads for a larger cache.</p> |
| 294 | |
| 295 | <p>Garbage collection is run only during a deallocation. We walk over |
| 296 | all free lists in the cache and move some number of objects from the |
| 297 | free list to the corresponding central list.</p> |
| 298 | |
| 299 | <p>The number of objects to be moved from a free list is determined |
| 300 | using a per-list low-water-mark <code>L</code>. <code>L</code> |
| 301 | records the minimum length of the list since the last garbage |
| 302 | collection. Note that we could have shortened the list by |
| 303 | <code>L</code> objects at the last garbage collection without |
| 304 | requiring any extra accesses to the central list. We use this past |
| 305 | history as a predictor of future accesses and move <code>L/2</code> |
| 306 | objects from the thread cache free list to the corresponding central |
| 307 | free list. This algorithm has the nice property that if a thread |
| 308 | stops using a particular size, all objects of that size will quickly |
| 309 | move from the thread cache to the central free list where they can be |
| 310 | used by other threads.</p> |
| 311 | |
| 312 | <p>If a thread consistently deallocates more objects of a certain size |
| 313 | than it allocates, this <code>L/2</code> behavior will cause at least |
| 314 | <code>L/2</code> objects to always sit in the free list. To avoid |
| 315 | wasting memory this way, we shrink the maximum length of the freelist |
| 316 | to converge on <code>num_objects_to_move</code> (see also |
| 317 | <a href="#Sizing_Thread_Cache_Free_Lists">Sizing Thread Cache Free Lists</a>). |
| 318 | |
| 319 | <pre> |
| 320 | Garbage Collection |
| 321 | if (L != 0 && max_length > num_objects_to_move) { |
| 322 | max_length = max(max_length - num_objects_to_move, num_objects_to_move) |
| 323 | } |
| 324 | </pre> |
| 325 | |
| 326 | <p>The fact that the thread cache went over its <code>max_size</code> is |
| 327 | an indication that the thread would benefit from a larger cache. Simply |
| 328 | increasing <code>max_size</code> would use an inordinate amount of memory |
| 329 | in programs that have lots of active threads. Developers can bound the |
| 330 | memory used with the flag --tcmalloc_max_total_thread_cache_bytes.</p> |
| 331 | |
| 332 | <p>Each thread cache starts with a small <code>max_size</code> |
| 333 | (e.g. 64KB) so that idle threads won't pre-allocate memory they don't |
| 334 | need. Each time the cache runs a garbage collection, it will also try |
| 335 | to grow its <code>max_size</code>. If the sum of the thread cache |
| 336 | sizes is less than --tcmalloc_max_total_thread_cache_bytes, |
| 337 | <code>max_size</code> grows easily. If not, thread cache 1 will try |
| 338 | to steal from thread cache 2 (picked round-robin) by decreasing thread |
| 339 | cache 2's <code>max_size</code>. In this way, threads that are more |
| 340 | active will steal memory from other threads more often than they are |
| 341 | have memory stolen from themselves. Mostly idle threads end up with |
| 342 | small caches and active threads end up with big caches. Note that |
| 343 | this stealing can cause the sum of the thread cache sizes to be |
| 344 | greater than --tcmalloc_max_total_thread_cache_bytes until thread |
| 345 | cache 2 deallocates some memory to trigger a garbage collection.</p> |
| 346 | |
| 347 | <h2><A NAME="performance">Performance Notes</A></h2> |
| 348 | |
| 349 | <h3>PTMalloc2 unittest</h3> |
| 350 | |
| 351 | <p>The PTMalloc2 package (now part of glibc) contains a unittest |
| 352 | program <code>t-test1.c</code>. This forks a number of threads and |
| 353 | performs a series of allocations and deallocations in each thread; the |
| 354 | threads do not communicate other than by synchronization in the memory |
| 355 | allocator.</p> |
| 356 | |
| 357 | <p><code>t-test1</code> (included in |
| 358 | <code>tests/tcmalloc/</code>, and compiled as |
| 359 | <code>ptmalloc_unittest1</code>) was run with a varying numbers of |
| 360 | threads (1-20) and maximum allocation sizes (64 bytes - |
| 361 | 32Kbytes). These tests were run on a 2.4GHz dual Xeon system with |
| 362 | hyper-threading enabled, using Linux glibc-2.3.2 from RedHat 9, with |
| 363 | one million operations per thread in each test. In each case, the test |
| 364 | was run once normally, and once with |
| 365 | <code>LD_PRELOAD=libtcmalloc.so</code>. |
| 366 | |
| 367 | <p>The graphs below show the performance of TCMalloc vs PTMalloc2 for |
| 368 | several different metrics. Firstly, total operations (millions) per |
| 369 | elapsed second vs max allocation size, for varying numbers of |
| 370 | threads. The raw data used to generate these graphs (the output of the |
| 371 | <code>time</code> utility) is available in |
| 372 | <code>t-test1.times.txt</code>.</p> |
| 373 | |
| 374 | <table> |
| 375 | <tr> |
| 376 | <td><img src="tcmalloc-opspersec.vs.size.1.threads.png"></td> |
| 377 | <td><img src="tcmalloc-opspersec.vs.size.2.threads.png"></td> |
| 378 | <td><img src="tcmalloc-opspersec.vs.size.3.threads.png"></td> |
| 379 | </tr> |
| 380 | <tr> |
| 381 | <td><img src="tcmalloc-opspersec.vs.size.4.threads.png"></td> |
| 382 | <td><img src="tcmalloc-opspersec.vs.size.5.threads.png"></td> |
| 383 | <td><img src="tcmalloc-opspersec.vs.size.8.threads.png"></td> |
| 384 | </tr> |
| 385 | <tr> |
| 386 | <td><img src="tcmalloc-opspersec.vs.size.12.threads.png"></td> |
| 387 | <td><img src="tcmalloc-opspersec.vs.size.16.threads.png"></td> |
| 388 | <td><img src="tcmalloc-opspersec.vs.size.20.threads.png"></td> |
| 389 | </tr> |
| 390 | </table> |
| 391 | |
| 392 | |
| 393 | <ul> |
| 394 | <li> TCMalloc is much more consistently scalable than PTMalloc2 - for |
| 395 | all thread counts >1 it achieves ~7-9 million ops/sec for small |
| 396 | allocations, falling to ~2 million ops/sec for larger |
| 397 | allocations. The single-thread case is an obvious outlier, |
| 398 | since it is only able to keep a single processor busy and hence |
| 399 | can achieve fewer ops/sec. PTMalloc2 has a much higher variance |
| 400 | on operations/sec - peaking somewhere around 4 million ops/sec |
| 401 | for small allocations and falling to <1 million ops/sec for |
| 402 | larger allocations. |
| 403 | |
| 404 | <li> TCMalloc is faster than PTMalloc2 in the vast majority of |
| 405 | cases, and particularly for small allocations. Contention |
| 406 | between threads is less of a problem in TCMalloc. |
| 407 | |
| 408 | <li> TCMalloc's performance drops off as the allocation size |
| 409 | increases. This is because the per-thread cache is |
| 410 | garbage-collected when it hits a threshold (defaulting to |
| 411 | 2MB). With larger allocation sizes, fewer objects can be stored |
| 412 | in the cache before it is garbage-collected. |
| 413 | |
| 414 | <li> There is a noticeable drop in TCMalloc's performance at ~32K |
| 415 | maximum allocation size; at larger sizes performance drops less |
| 416 | quickly. This is due to the 32K maximum size of objects in the |
| 417 | per-thread caches; for objects larger than this TCMalloc |
| 418 | allocates from the central page heap. |
| 419 | </ul> |
| 420 | |
| 421 | <p>Next, operations (millions) per second of CPU time vs number of |
| 422 | threads, for max allocation size 64 bytes - 128 Kbytes.</p> |
| 423 | |
| 424 | <table> |
| 425 | <tr> |
| 426 | <td><img src="tcmalloc-opspercpusec.vs.threads.64.bytes.png"></td> |
| 427 | <td><img src="tcmalloc-opspercpusec.vs.threads.256.bytes.png"></td> |
| 428 | <td><img src="tcmalloc-opspercpusec.vs.threads.1024.bytes.png"></td> |
| 429 | </tr> |
| 430 | <tr> |
| 431 | <td><img src="tcmalloc-opspercpusec.vs.threads.4096.bytes.png"></td> |
| 432 | <td><img src="tcmalloc-opspercpusec.vs.threads.8192.bytes.png"></td> |
| 433 | <td><img src="tcmalloc-opspercpusec.vs.threads.16384.bytes.png"></td> |
| 434 | </tr> |
| 435 | <tr> |
| 436 | <td><img src="tcmalloc-opspercpusec.vs.threads.32768.bytes.png"></td> |
| 437 | <td><img src="tcmalloc-opspercpusec.vs.threads.65536.bytes.png"></td> |
| 438 | <td><img src="tcmalloc-opspercpusec.vs.threads.131072.bytes.png"></td> |
| 439 | </tr> |
| 440 | </table> |
| 441 | |
| 442 | <p>Here we see again that TCMalloc is both more consistent and more |
| 443 | efficient than PTMalloc2. For max allocation sizes <32K, TCMalloc |
| 444 | typically achieves ~2-2.5 million ops per second of CPU time with a |
| 445 | large number of threads, whereas PTMalloc achieves generally 0.5-1 |
| 446 | million ops per second of CPU time, with a lot of cases achieving much |
| 447 | less than this figure. Above 32K max allocation size, TCMalloc drops |
| 448 | to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost |
| 449 | to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU |
| 450 | time is being burned spinning waiting for locks in the heavily |
| 451 | multi-threaded case).</p> |
| 452 | |
| 453 | |
| 454 | <H2><A NAME="runtime">Modifying Runtime Behavior</A></H2> |
| 455 | |
| 456 | <p>You can more finely control the behavior of the tcmalloc via |
| 457 | environment variables.</p> |
| 458 | |
| 459 | <p>Generally useful flags:</p> |
| 460 | |
| 461 | <table frame=box rules=sides cellpadding=5 width=100%> |
| 462 | |
| 463 | <tr valign=top> |
| 464 | <td><code>TCMALLOC_SAMPLE_PARAMETER</code></td> |
| 465 | <td>default: 0</td> |
| 466 | <td> |
| 467 | The approximate gap between sampling actions. That is, we |
| 468 | take one sample approximately once every |
| 469 | <code>tcmalloc_sample_parmeter</code> bytes of allocation. |
| 470 | This sampled heap information is available via |
| 471 | <code>MallocExtension::GetHeapSample()</code> or |
| 472 | <code>MallocExtension::ReadStackTraces()</code>. A reasonable |
| 473 | value is 524288. |
| 474 | </td> |
| 475 | </tr> |
| 476 | |
| 477 | <tr valign=top> |
| 478 | <td><code>TCMALLOC_RELEASE_RATE</code></td> |
| 479 | <td>default: 1.0</td> |
| 480 | <td> |
| 481 | Rate at which we release unused memory to the system, via |
| 482 | <code>madvise(MADV_DONTNEED)</code>, on systems that support |
| 483 | it. Zero means we never release memory back to the system. |
| 484 | Increase this flag to return memory faster; decrease it |
| 485 | to return memory slower. Reasonable rates are in the |
| 486 | range [0,10]. |
| 487 | </td> |
| 488 | </tr> |
| 489 | |
| 490 | <tr valign=top> |
| 491 | <td><code>TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD</code></td> |
| 492 | <td>default: 1073741824</td> |
| 493 | <td> |
| 494 | Allocations larger than this value cause a stack trace to be |
| 495 | dumped to stderr. The threshold for dumping stack traces is |
| 496 | increased by a factor of 1.125 every time we print a message so |
| 497 | that the threshold automatically goes up by a factor of ~1000 |
| 498 | every 60 messages. This bounds the amount of extra logging |
| 499 | generated by this flag. Default value of this flag is very large |
| 500 | and therefore you should see no extra logging unless the flag is |
| 501 | overridden. |
| 502 | </td> |
| 503 | </tr> |
| 504 | |
| 505 | <tr valign=top> |
| 506 | <td><code>TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES</code></td> |
| 507 | <td>default: 16777216</td> |
| 508 | <td> |
| 509 | Bound on the total amount of bytes allocated to thread caches. This |
| 510 | bound is not strict, so it is possible for the cache to go over this |
| 511 | bound in certain circumstances. This value defaults to 16MB. For |
| 512 | applications with many threads, this may not be a large enough cache, |
| 513 | which can affect performance. If you suspect your application is not |
| 514 | scaling to many threads due to lock contention in TCMalloc, you can |
| 515 | try increasing this value. This may improve performance, at a cost |
| 516 | of extra memory use by TCMalloc. See <a href="#Garbage_Collection"> |
| 517 | Garbage Collection</a> for more details. |
| 518 | </td> |
| 519 | </tr> |
| 520 | |
| 521 | </table> |
| 522 | |
| 523 | <p>Advanced "tweaking" flags, that control more precisely how tcmalloc |
| 524 | tries to allocate memory from the kernel.</p> |
| 525 | |
| 526 | <table frame=box rules=sides cellpadding=5 width=100%> |
| 527 | |
| 528 | <tr valign=top> |
| 529 | <td><code>TCMALLOC_SKIP_MMAP</code></td> |
| 530 | <td>default: false</td> |
| 531 | <td> |
| 532 | If true, do not try to use <code>mmap</code> to obtain memory |
| 533 | from the kernel. |
| 534 | </td> |
| 535 | </tr> |
| 536 | |
| 537 | <tr valign=top> |
| 538 | <td><code>TCMALLOC_SKIP_SBRK</code></td> |
| 539 | <td>default: false</td> |
| 540 | <td> |
| 541 | If true, do not try to use <code>sbrk</code> to obtain memory |
| 542 | from the kernel. |
| 543 | </td> |
| 544 | </tr> |
| 545 | |
| 546 | <tr valign=top> |
| 547 | <td><code>TCMALLOC_DEVMEM_START</code></td> |
| 548 | <td>default: 0</td> |
| 549 | <td> |
| 550 | Physical memory starting location in MB for <code>/dev/mem</code> |
| 551 | allocation. Setting this to 0 disables <code>/dev/mem</code> |
| 552 | allocation. |
| 553 | </td> |
| 554 | </tr> |
| 555 | |
| 556 | <tr valign=top> |
| 557 | <td><code>TCMALLOC_DEVMEM_LIMIT</code></td> |
| 558 | <td>default: 0</td> |
| 559 | <td> |
| 560 | Physical memory limit location in MB for <code>/dev/mem</code> |
| 561 | allocation. Setting this to 0 means no limit. |
| 562 | </td> |
| 563 | </tr> |
| 564 | |
| 565 | <tr valign=top> |
| 566 | <td><code>TCMALLOC_DEVMEM_DEVICE</code></td> |
| 567 | <td>default: /dev/mem</td> |
| 568 | <td> |
| 569 | Device to use for allocating unmanaged memory. |
| 570 | </td> |
| 571 | </tr> |
| 572 | |
| 573 | <tr valign=top> |
| 574 | <td><code>TCMALLOC_MEMFS_MALLOC_PATH</code></td> |
| 575 | <td>default: ""</td> |
| 576 | <td> |
| 577 | If set, specify a path where hugetlbfs or tmpfs is mounted. |
| 578 | This may allow for speedier allocations. |
| 579 | </td> |
| 580 | </tr> |
| 581 | |
| 582 | <tr valign=top> |
| 583 | <td><code>TCMALLOC_MEMFS_LIMIT_MB</code></td> |
| 584 | <td>default: 0</td> |
| 585 | <td> |
| 586 | Limit total memfs allocation size to specified number of MB. |
| 587 | 0 means "no limit". |
| 588 | </td> |
| 589 | </tr> |
| 590 | |
| 591 | <tr valign=top> |
| 592 | <td><code>TCMALLOC_MEMFS_ABORT_ON_FAIL</code></td> |
| 593 | <td>default: false</td> |
| 594 | <td> |
| 595 | If true, abort() whenever memfs_malloc fails to satisfy an allocation. |
| 596 | </td> |
| 597 | </tr> |
| 598 | |
| 599 | <tr valign=top> |
| 600 | <td><code>TCMALLOC_MEMFS_IGNORE_MMAP_FAIL</code></td> |
| 601 | <td>default: false</td> |
| 602 | <td> |
| 603 | If true, ignore failures from mmap. |
| 604 | </td> |
| 605 | </tr> |
| 606 | |
| 607 | <tr valign=top> |
| 608 | <td><code>TCMALLOC_MEMFS_MAP_PRVIATE</code></td> |
| 609 | <td>default: false</td> |
| 610 | <td> |
| 611 | If true, use MAP_PRIVATE when mapping via memfs, not MAP_SHARED. |
| 612 | </td> |
| 613 | </tr> |
| 614 | |
| 615 | </table> |
| 616 | |
| 617 | |
| 618 | <H2><A NAME="compiletime">Modifying Behavior In Code</A></H2> |
| 619 | |
| 620 | <p>The <code>MallocExtension</code> class, in |
| 621 | <code>malloc_extension.h</code>, provides a few knobs that you can |
| 622 | tweak in your program, to affect tcmalloc's behavior.</p> |
| 623 | |
| 624 | <h3>Releasing Memory Back to the System</h3> |
| 625 | |
| 626 | <p>By default, tcmalloc will release no-longer-used memory back to the |
| 627 | kernel gradually, over time. The <a |
| 628 | href="#runtime">tcmalloc_release_rate</a> flag controls how quickly |
| 629 | this happens. You can also force a release at a given point in the |
| 630 | progam execution like so:</p> |
| 631 | <pre> |
| 632 | MallocExtension::instance()->ReleaseFreeMemory(); |
| 633 | </pre> |
| 634 | |
| 635 | <p>You can also call <code>SetMemoryReleaseRate()</code> to change the |
| 636 | <code>tcmalloc_release_rate</code> value at runtime, or |
| 637 | <code>GetMemoryReleaseRate</code> to see what the current release rate |
| 638 | is.</p> |
| 639 | |
| 640 | <h3>Memory Introspection</h3> |
| 641 | |
| 642 | <p>There are several routines for getting a human-readable form of the |
| 643 | current memory usage:</p> |
| 644 | <pre> |
| 645 | MallocExtension::instance()->GetStats(buffer, buffer_length); |
| 646 | MallocExtension::instance()->GetHeapSample(&string); |
| 647 | MallocExtension::instance()->GetHeapGrowthStacks(&string); |
| 648 | </pre> |
| 649 | |
| 650 | <p>The last two create files in the same format as the heap-profiler, |
| 651 | and can be passed as data files to pprof. The first is human-readable |
| 652 | and is meant for debugging.</p> |
| 653 | |
| 654 | <h3>Generic Tcmalloc Status</h3> |
| 655 | |
| 656 | <p>TCMalloc has support for setting and retrieving arbitrary |
| 657 | 'properties':</p> |
| 658 | <pre> |
| 659 | MallocExtension::instance()->SetNumericProperty(property_name, value); |
| 660 | MallocExtension::instance()->GetNumericProperty(property_name, &value); |
| 661 | </pre> |
| 662 | |
| 663 | <p>It is possible for an application to set and get these properties, |
| 664 | but the most useful is when a library sets the properties so the |
| 665 | application can read them. Here are the properties TCMalloc defines; |
| 666 | you can access them with a call like |
| 667 | <code>MallocExtension::instance()->GetNumericProperty("generic.heap_size", |
| 668 | &value);</code>:</p> |
| 669 | |
| 670 | <table frame=box rules=sides cellpadding=5 width=100%> |
| 671 | |
| 672 | <tr valign=top> |
| 673 | <td><code>generic.current_allocated_bytes</code></td> |
| 674 | <td> |
| 675 | Number of bytes used by the application. This will not typically |
| 676 | match the memory use reported by the OS, because it does not |
| 677 | include TCMalloc overhead or memory fragmentation. |
| 678 | </td> |
| 679 | </tr> |
| 680 | |
| 681 | <tr valign=top> |
| 682 | <td><code>generic.heap_size</code></td> |
| 683 | <td> |
| 684 | Bytes of system memory reserved by TCMalloc. |
| 685 | </td> |
| 686 | </tr> |
| 687 | |
| 688 | <tr valign=top> |
| 689 | <td><code>tcmalloc.pageheap_free_bytes</code></td> |
| 690 | <td> |
| 691 | Number of bytes in free, mapped pages in page heap. These bytes |
| 692 | can be used to fulfill allocation requests. They always count |
| 693 | towards virtual memory usage, and unless the underlying memory is |
| 694 | swapped out by the OS, they also count towards physical memory |
| 695 | usage. |
| 696 | </td> |
| 697 | </tr> |
| 698 | |
| 699 | <tr valign=top> |
| 700 | <td><code>tcmalloc.pageheap_unmapped_bytes</code></td> |
| 701 | <td> |
| 702 | Number of bytes in free, unmapped pages in page heap. These are |
| 703 | bytes that have been released back to the OS, possibly by one of |
| 704 | the MallocExtension "Release" calls. They can be used to fulfill |
| 705 | allocation requests, but typically incur a page fault. They |
| 706 | always count towards virtual memory usage, and depending on the |
| 707 | OS, typically do not count towards physical memory usage. |
| 708 | </td> |
| 709 | </tr> |
| 710 | |
| 711 | <tr valign=top> |
| 712 | <td><code>tcmalloc.slack_bytes</code></td> |
| 713 | <td> |
| 714 | Sum of pageheap_free_bytes and pageheap_unmapped_bytes. Provided |
| 715 | for backwards compatibility only. Do not use. |
| 716 | </td> |
| 717 | </tr> |
| 718 | |
| 719 | <tr valign=top> |
| 720 | <td><code>tcmalloc.max_total_thread_cache_bytes</code></td> |
| 721 | <td> |
| 722 | A limit to how much memory TCMalloc dedicates for small objects. |
| 723 | Higher numbers trade off more memory use for -- in some situations |
| 724 | -- improved efficiency. |
| 725 | </td> |
| 726 | </tr> |
| 727 | |
| 728 | <tr valign=top> |
| 729 | <td><code>tcmalloc.current_total_thread_cache_bytes</code></td> |
| 730 | <td> |
| 731 | A measure of some of the memory TCMalloc is using (for |
| 732 | small objects). |
| 733 | </td> |
| 734 | </tr> |
| 735 | |
| 736 | </table> |
| 737 | |
| 738 | <h2><A NAME="caveats">Caveats</A></h2> |
| 739 | |
| 740 | <p>For some systems, TCMalloc may not work correctly with |
| 741 | applications that aren't linked against <code>libpthread.so</code> (or |
| 742 | the equivalent on your OS). It should work on Linux using glibc 2.3, |
| 743 | but other OS/libc combinations have not been tested.</p> |
| 744 | |
| 745 | <p>TCMalloc may be somewhat more memory hungry than other mallocs, |
| 746 | (but tends not to have the huge blowups that can happen with other |
| 747 | mallocs). In particular, at startup TCMalloc allocates approximately |
| 748 | 240KB of internal memory.</p> |
| 749 | |
| 750 | <p>Don't try to load TCMalloc into a running binary (e.g., using JNI |
| 751 | in Java programs). The binary will have allocated some objects using |
| 752 | the system malloc, and may try to pass them to TCMalloc for |
| 753 | deallocation. TCMalloc will not be able to handle such objects.</p> |
| 754 | |
| 755 | <hr> |
| 756 | |
| 757 | <address>Sanjay Ghemawat, Paul Menage<br> |
| 758 | <!-- Created: Tue Dec 19 10:43:14 PST 2000 --> |
| 759 | <!-- hhmts start --> |
| 760 | Last modified: Sat Feb 24 13:11:38 PST 2007 (csilvers) |
| 761 | <!-- hhmts end --> |
| 762 | </address> |
| 763 | |
| 764 | </body> |
| 765 | </html> |