mirror of
https://github.com/google/leveldb.git
synced 2025-01-21 16:43:16 +08:00
73 lines
2.6 KiB
Plaintext
73 lines
2.6 KiB
Plaintext
|
The log file contents are a sequence of 32KB blocks. The only
|
||
|
exception is that the tail of the file may contain a partial block.
|
||
|
|
||
|
Each block consists of a sequence of records:
|
||
|
block := record* trailer?
|
||
|
record :=
|
||
|
checksum: uint32 // crc32c of type and data[]
|
||
|
length: uint16
|
||
|
type: uint8 // One of FULL, FIRST, MIDDLE, LAST
|
||
|
data: uint8[length]
|
||
|
|
||
|
A record never starts within the last seven bytes of a block. Any
|
||
|
leftover bytes here form the trailer, which must consist entirely of
|
||
|
zero bytes and must be skipped by readers. In particular, even if
|
||
|
there are exactly seven bytes left in the block, and a zero-length
|
||
|
user record is added (which will fit in these seven bytes), the writer
|
||
|
must skip these trailer bytes and add the record to the next block.
|
||
|
|
||
|
More types may be added in the future. Some Readers may skip record
|
||
|
types they do not understand, others may report that some data was
|
||
|
skipped.
|
||
|
|
||
|
FULL == 1
|
||
|
FIRST == 2
|
||
|
MIDDLE == 3
|
||
|
LAST == 4
|
||
|
|
||
|
The FULL record contains the contents of an entire user record.
|
||
|
|
||
|
FIRST, MIDDLE, LAST are types used for user records that have been
|
||
|
split into multiple fragments (typically because of block boundaries).
|
||
|
FIRST is the type of the first fragment of a user record, LAST is the
|
||
|
type of the last fragment of a user record, and MID is the type of all
|
||
|
interior fragments of a user record.
|
||
|
|
||
|
Example: consider a sequence of user records:
|
||
|
A: length 1000
|
||
|
B: length 97270
|
||
|
C: length 8000
|
||
|
A will be stored as a FULL record in the first block.
|
||
|
|
||
|
B will be split into three fragments: first fragment occupies the rest
|
||
|
of the first block, second fragment occupies the entirety of the
|
||
|
second block, and the third fragment occupies a prefix of the third
|
||
|
block. This will leave six bytes free in the third block, which will
|
||
|
be left empty as the trailer.
|
||
|
|
||
|
C will be stored as a FULL record in the fourth block.
|
||
|
|
||
|
===================
|
||
|
|
||
|
Some benefits over the recordio format:
|
||
|
|
||
|
(1) We do not need any heuristics for resyncing - just go to next
|
||
|
block boundary and scan. If there is a corruption, skip to the next
|
||
|
block. As a side-benefit, we do not get confused when part of the
|
||
|
contents of one log file are embedded as a record inside another log
|
||
|
file.
|
||
|
|
||
|
(2) Splitting at approximate boundaries (e.g., for mapreduce) is
|
||
|
simple: find the next block boundary and skip records until we
|
||
|
hit a FULL or FIRST record.
|
||
|
|
||
|
(3) We do not need extra buffering for large records.
|
||
|
|
||
|
Some downsides compared to recordio format:
|
||
|
|
||
|
(1) No packing of tiny records. This could be fixed by adding a new
|
||
|
record type, so it is a shortcoming of the current implementation,
|
||
|
not necessarily the format.
|
||
|
|
||
|
(2) No compression. Again, this could be fixed by adding new record types.
|