Blame - format_description.txt - RealtimeRoboticsGroup/test

blob: 20db66c1f26b3f1850d7105344bc43216cbda7fb [file] [log] [blame]

James Kuszmaul	48dd4c8	2021-10-27 20:04:08 -0700	[diff] [blame^]	1	Snappy compressed format description
				2	Last revised: 2011-10-05
				3
				4
				5	This is not a formal specification, but should suffice to explain most
				6	relevant parts of how the Snappy format works. It is originally based on
				7	text by Zeev Tarantov.
				8
				9	Snappy is a LZ77-type compressor with a fixed, byte-oriented encoding.
				10	There is no entropy encoder backend nor framing layer -- the latter is
				11	assumed to be handled by other parts of the system.
				12
				13	This document only describes the format, not how the Snappy compressor nor
				14	decompressor actually works. The correctness of the decompressor should not
				15	depend on implementation details of the compressor, and vice versa.
				16
				17
				18	1. Preamble
				19
				20	The stream starts with the uncompressed length (up to a maximum of 2^32 - 1),
				21	stored as a little-endian varint. Varints consist of a series of bytes,
				22	where the lower 7 bits are data and the upper bit is set iff there are
				23	more bytes to be read. In other words, an uncompressed length of 64 would
				24	be stored as 0x40, and an uncompressed length of 2097150 (0x1FFFFE)
				25	would be stored as 0xFE 0xFF 0x7F.
				26
				27
				28	2. The compressed stream itself
				29
				30	There are two types of elements in a Snappy stream: Literals and
				31	copies (backreferences). There is no restriction on the order of elements,
				32	except that the stream naturally cannot start with a copy. (Having
				33	two literals in a row is never optimal from a compression point of
				34	view, but nevertheless fully permitted.) Each element starts with a tag byte,
				35	and the lower two bits of this tag byte signal what type of element will
				36	follow:
				37
				38	00: Literal
				39	01: Copy with 1-byte offset
				40	10: Copy with 2-byte offset
				41	11: Copy with 4-byte offset
				42
				43	The interpretation of the upper six bits are element-dependent.
				44
				45
				46	2.1. Literals (00)
				47
				48	Literals are uncompressed data stored directly in the byte stream.
				49	The literal length is stored differently depending on the length
				50	of the literal:
				51
				52	- For literals up to and including 60 bytes in length, the upper
				53	six bits of the tag byte contain (len-1). The literal follows
				54	immediately thereafter in the bytestream.
				55	- For longer literals, the (len-1) value is stored after the tag byte,
				56	little-endian. The upper six bits of the tag byte describe how
				57	many bytes are used for the length; 60, 61, 62 or 63 for
				58	1-4 bytes, respectively. The literal itself follows after the
				59	length.
				60
				61
				62	2.2. Copies
				63
				64	Copies are references back into previous decompressed data, telling
				65	the decompressor to reuse data it has previously decoded.
				66	They encode two values: The _offset_, saying how many bytes back
				67	from the current position to read, and the _length_, how many bytes
				68	to copy. Offsets of zero can be encoded, but are not legal;
				69	similarly, it is possible to encode backreferences that would
				70	go past the end of the block (offset > current decompressed position),
				71	which is also nonsensical and thus not allowed.
				72
				73	As in most LZ77-based compressors, the length can be larger than the offset,
				74	yielding a form of run-length encoding (RLE). For instance,
				75	"xababab" could be encoded as
				76
				77	<literal: "xab"> <copy: offset=2 length=4>
				78
				79	Note that since the current Snappy compressor works in 32 kB
				80	blocks and does not do matching across blocks, it will never produce
				81	a bitstream with offsets larger than about 32768. However, the
				82	decompressor should not rely on this, as it may change in the future.
				83
				84	There are several different kinds of copy elements, depending on
				85	the amount of bytes to be copied (length), and how far back the
				86	data to be copied is (offset).
				87
				88
				89	2.2.1. Copy with 1-byte offset (01)
				90
				91	These elements can encode lengths between [4..11] bytes and offsets
				92	between [0..2047] bytes. (len-4) occupies three bits and is stored
				93	in bits [2..4] of the tag byte. The offset occupies 11 bits, of which the
				94	upper three are stored in the upper three bits ([5..7]) of the tag byte,
				95	and the lower eight are stored in a byte following the tag byte.
				96
				97
				98	2.2.2. Copy with 2-byte offset (10)
				99
				100	These elements can encode lengths between [1..64] and offsets from
				101	[0..65535]. (len-1) occupies six bits and is stored in the upper
				102	six bits ([2..7]) of the tag byte. The offset is stored as a
				103	little-endian 16-bit integer in the two bytes following the tag byte.
				104
				105
				106	2.2.3. Copy with 4-byte offset (11)
				107
				108	These are like the copies with 2-byte offsets (see previous subsection),
				109	except that the offset is stored as a 32-bit integer instead of a
				110	16-bit integer (and thus will occupy four bytes).