blob: 9764e83de663dc5c3eafada510ac9ccf85324ffb [file] [log] [blame]
James Kuszmaul48dd4c82021-10-27 20:04:08 -07001Snappy framing format description
2Last revised: 2013-10-25
3
4This format decribes a framing format for Snappy, allowing compressing to
5files or streams that can then more easily be decompressed without having
6to hold the entire stream in memory. It also provides data checksums to
7help verify integrity. It does not provide metadata checksums, so it does
8not protect against e.g. all forms of truncations.
9
10Implementation of the framing format is optional for Snappy compressors and
11decompressor; it is not part of the Snappy core specification.
12
13
141. General structure
15
16The file consists solely of chunks, lying back-to-back with no padding
17in between. Each chunk consists first a single byte of chunk identifier,
18then a three-byte little-endian length of the chunk in bytes (from 0 to
1916777215, inclusive), and then the data if any. The four bytes of chunk
20header is not counted in the data length.
21
22The different chunk types are listed below. The first chunk must always
23be the stream identifier chunk (see section 4.1, below). The stream
24ends when the file ends -- there is no explicit end-of-file marker.
25
26
272. File type identification
28
29The following identifiers for this format are recommended where appropriate.
30However, note that none have been registered officially, so this is only to
31be taken as a guideline. We use "Snappy framed" to distinguish between this
32format and raw Snappy data.
33
34 File extension: .sz
35 MIME type: application/x-snappy-framed
36 HTTP Content-Encoding: x-snappy-framed
37
38
393. Checksum format
40
41Some chunks have data protected by a checksum (the ones that do will say so
42explicitly). The checksums are always masked CRC-32Cs.
43
44A description of CRC-32C can be found in RFC 3720, section 12.1, with
45examples in section B.4.
46
47Checksums are not stored directly, but masked, as checksumming data and
48then its own checksum can be problematic. The masking is the same as used
49in Apache Hadoop: Rotate the checksum by 15 bits, then add the constant
500xa282ead8 (using wraparound as normal for unsigned integers). This is
51equivalent to the following C code:
52
53 uint32_t mask_checksum(uint32_t x) {
54 return ((x >> 15) | (x << 17)) + 0xa282ead8;
55 }
56
57Note that the masking is reversible.
58
59The checksum is always stored as a four bytes long integer, in little-endian.
60
61
624. Chunk types
63
64The currently supported chunk types are described below. The list may
65be extended in the future.
66
67
684.1. Stream identifier (chunk type 0xff)
69
70The stream identifier is always the first element in the stream.
71It is exactly six bytes long and contains "sNaPpY" in ASCII. This means that
72a valid Snappy framed stream always starts with the bytes
73
74 0xff 0x06 0x00 0x00 0x73 0x4e 0x61 0x50 0x70 0x59
75
76The stream identifier chunk can come multiple times in the stream besides
77the first; if such a chunk shows up, it should simply be ignored, assuming
78it has the right length and contents. This allows for easy concatenation of
79compressed files without the need for re-framing.
80
81
824.2. Compressed data (chunk type 0x00)
83
84Compressed data chunks contain a normal Snappy compressed bitstream;
85see the compressed format specification. The compressed data is preceded by
86the CRC-32C (see section 3) of the _uncompressed_ data.
87
88Note that the data portion of the chunk, i.e., the compressed contents,
89can be at most 16777211 bytes (2^24 - 1, minus the checksum).
90However, we place an additional restriction that the uncompressed data
91in a chunk must be no longer than 65536 bytes. This allows consumers to
92easily use small fixed-size buffers.
93
94
954.3. Uncompressed data (chunk type 0x01)
96
97Uncompressed data chunks allow a compressor to send uncompressed,
98raw data; this is useful if, for instance, uncompressible or
99near-incompressible data is detected, and faster decompression is desired.
100
101As in the compressed chunks, the data is preceded by its own masked
102CRC-32C (see section 3).
103
104An uncompressed data chunk, like compressed data chunks, should contain
105no more than 65536 data bytes, so the maximum legal chunk length with the
106checksum is 65540.
107
108
1094.4. Padding (chunk type 0xfe)
110
111Padding chunks allow a compressor to increase the size of the data stream
112so that it complies with external demands, e.g. that the total number of
113bytes is a multiple of some value.
114
115All bytes of the padding chunk, except the chunk byte itself and the length,
116should be zero, but decompressors must not try to interpret or verify the
117padding data in any way.
118
119
1204.5. Reserved unskippable chunks (chunk types 0x02-0x7f)
121
122These are reserved for future expansion. A decoder that sees such a chunk
123should immediately return an error, as it must assume it cannot decode the
124stream correctly.
125
126Future versions of this specification may define meanings for these chunks.
127
128
1294.6. Reserved skippable chunks (chunk types 0x80-0xfd)
130
131These are also reserved for future expansion, but unlike the chunks
132described in 4.5, a decoder seeing these must skip them and continue
133decoding.
134
135Future versions of this specification may define meanings for these chunks.