blob: 20db66c1f26b3f1850d7105344bc43216cbda7fb [file] [log] [blame]
James Kuszmaul48dd4c82021-10-27 20:04:08 -07001Snappy compressed format description
2Last revised: 2011-10-05
3
4
5This is not a formal specification, but should suffice to explain most
6relevant parts of how the Snappy format works. It is originally based on
7text by Zeev Tarantov.
8
9Snappy is a LZ77-type compressor with a fixed, byte-oriented encoding.
10There is no entropy encoder backend nor framing layer -- the latter is
11assumed to be handled by other parts of the system.
12
13This document only describes the format, not how the Snappy compressor nor
14decompressor actually works. The correctness of the decompressor should not
15depend on implementation details of the compressor, and vice versa.
16
17
181. Preamble
19
20The stream starts with the uncompressed length (up to a maximum of 2^32 - 1),
21stored as a little-endian varint. Varints consist of a series of bytes,
22where the lower 7 bits are data and the upper bit is set iff there are
23more bytes to be read. In other words, an uncompressed length of 64 would
24be stored as 0x40, and an uncompressed length of 2097150 (0x1FFFFE)
25would be stored as 0xFE 0xFF 0x7F.
26
27
282. The compressed stream itself
29
30There are two types of elements in a Snappy stream: Literals and
31copies (backreferences). There is no restriction on the order of elements,
32except that the stream naturally cannot start with a copy. (Having
33two literals in a row is never optimal from a compression point of
34view, but nevertheless fully permitted.) Each element starts with a tag byte,
35and the lower two bits of this tag byte signal what type of element will
36follow:
37
38 00: Literal
39 01: Copy with 1-byte offset
40 10: Copy with 2-byte offset
41 11: Copy with 4-byte offset
42
43The interpretation of the upper six bits are element-dependent.
44
45
462.1. Literals (00)
47
48Literals are uncompressed data stored directly in the byte stream.
49The literal length is stored differently depending on the length
50of the literal:
51
52 - For literals up to and including 60 bytes in length, the upper
53 six bits of the tag byte contain (len-1). The literal follows
54 immediately thereafter in the bytestream.
55 - For longer literals, the (len-1) value is stored after the tag byte,
56 little-endian. The upper six bits of the tag byte describe how
57 many bytes are used for the length; 60, 61, 62 or 63 for
58 1-4 bytes, respectively. The literal itself follows after the
59 length.
60
61
622.2. Copies
63
64Copies are references back into previous decompressed data, telling
65the decompressor to reuse data it has previously decoded.
66They encode two values: The _offset_, saying how many bytes back
67from the current position to read, and the _length_, how many bytes
68to copy. Offsets of zero can be encoded, but are not legal;
69similarly, it is possible to encode backreferences that would
70go past the end of the block (offset > current decompressed position),
71which is also nonsensical and thus not allowed.
72
73As in most LZ77-based compressors, the length can be larger than the offset,
74yielding a form of run-length encoding (RLE). For instance,
75"xababab" could be encoded as
76
77 <literal: "xab"> <copy: offset=2 length=4>
78
79Note that since the current Snappy compressor works in 32 kB
80blocks and does not do matching across blocks, it will never produce
81a bitstream with offsets larger than about 32768. However, the
82decompressor should not rely on this, as it may change in the future.
83
84There are several different kinds of copy elements, depending on
85the amount of bytes to be copied (length), and how far back the
86data to be copied is (offset).
87
88
892.2.1. Copy with 1-byte offset (01)
90
91These elements can encode lengths between [4..11] bytes and offsets
92between [0..2047] bytes. (len-4) occupies three bits and is stored
93in bits [2..4] of the tag byte. The offset occupies 11 bits, of which the
94upper three are stored in the upper three bits ([5..7]) of the tag byte,
95and the lower eight are stored in a byte following the tag byte.
96
97
982.2.2. Copy with 2-byte offset (10)
99
100These elements can encode lengths between [1..64] and offsets from
101[0..65535]. (len-1) occupies six bits and is stored in the upper
102six bits ([2..7]) of the tag byte. The offset is stored as a
103little-endian 16-bit integer in the two bytes following the tag byte.
104
105
1062.2.3. Copy with 4-byte offset (11)
107
108These are like the copies with 2-byte offsets (see previous subsection),
109except that the offset is stored as a 32-bit integer instead of a
11016-bit integer (and thus will occupy four bytes).