Commit Graph

137 Commits

Author SHA1 Message Date
costan
8415f00eee leveldb: Report missing CURRENT manifest file as database corruption.
BTRFS reorders rename and write operations, so it is possible that a filesystem crash and recovery results in a situation where the file pointed to by CURRENT does not exist. DB::Open currently reports an I/O error in this case. Reporting database corruption is a better hint to the caller, which can attempt to recover the database or erase it and start over.

This issue is not merely theoretical. It was reported as having showed up in the wild at https://github.com/google/leveldb/issues/195 and at https://crbug.com/738961. Also, asides from the BTRFS case described above, incorrect data in CURRENT seems like a possible corruption case that should be handled gracefully.

The Env API changes here can be considered backwards compatible, because an implementation that returns Status::IOError instead of Status::NotFound will still get the same functionality as before.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=161432630
2017-07-10 14:14:00 -07:00
costan
69e2bd224b LevelDB: Add WriteBatch::ApproximateSize().
This can be used to report metrics on LevelDB usage.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=156934930
2017-07-10 14:13:30 -07:00
costan
a53934a3ae Increase leveldb version to 1.20.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=148937577
2017-03-01 16:08:02 -08:00
costan
f3f139737c Separate Env tests from PosixEnv tests.
env_test.cc defines EnvPosixTest which tests the Env implementation returned by Env::Default(). The naming is a bit unfortunate, as the tests in env_test.cc are written against the Env contract, and therefore are applicable to any Env implementation. An instance of the confusion caused by the naming is [] which added a dependency from env_test.cc to EnvPosixTestHelper, which is closely coupled to EnvPosix.

This change disentangles EnvPosix-specific test code into a env_posix_test.cc file. The code there uses EnvPosixTestHelper and specifically targets the EnvPosix implementation. env_test.cc now implements EnvTest, and contains tests that are also applicable to other ports, which may define their own Env implementation.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=148914642
2017-03-01 13:53:23 -08:00
costan
eb4f0972fd leveldb: Fix compilation warnings in port_posix_sse.cc on x86 (32-bit).
LE_LOAD64 is only used when _mm_crc32_u64 is available, on 64-bit x86 processors.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=148906169
2017-03-01 11:37:43 -08:00
cmumford
d0883b6006 Fixed path to doc file: index.md.
Prior index.html was using rawgit.com which doesn't process
Markdown and therefore only serves the markdown source.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=148902180
2017-03-01 10:28:56 -08:00
cmumford
7fa20948d5 Convert documentation to markdown.
Markdown is more readable in a text editor and when hosted
on GitHub is more readable than HTML.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=148830423
2017-03-01 09:42:25 -08:00
costan
ea175e28f8 Implement support for Intel crc32 instruction (SSE 4.2)
This change authored by vadimskipin and submitted via:

    https://github.com/google/leveldb/pull/309

Changes made to support iOS builds and other architectures
without support for SSE 4.2.

db_bench reports original crc32 speed at:

    crc32c : 3.610 micros/op; 1082.0 MB/s (4K per op)

with this change performance has increased to:

    crc32c : 0.843 micros/op; 4633.6 MB/s (4K per op)

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=148694935
2017-02-28 14:08:46 -08:00
cmumford
95cd743e5e Including <limits> for std::numeric_limits.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=146841327
2017-02-09 14:09:51 -08:00
cmumford
646c3588de Limit the number of read-only files the POSIX Env will have open.
Background compaction can create an unbounded number of
leveldb::RandomAccessFile instances. On 64-bit systems mmap is used and
file descriptors are only used beyond a certain number of mmap's.
32-bit systems to not use mmap at all. leveldb::RandomAccessFile does not
observe Options.max_open_files so compaction could exhaust the file
descriptor limit.

This change uses getrlimit to determine the maximum number of open
files and limits RandomAccessFile to approximately 20% of that value.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=143505556
2017-01-04 09:13:20 -08:00
corrado
a2fb086d07 Add option for max file size. The currend hard-coded value of 2M is inefficient in colossus.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=134391640
2016-09-28 10:52:24 -07:00
cmumford
3080a45b62 Increase leveldb version to 1.19.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=129930720
2016-08-11 07:33:30 -07:00
sanjay
fa6dc010a2 A zippy change broke test assumptions about the size of compressed output.
Fix the tests by allowing more slop in zippy's behavior.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=123432472
2016-07-06 09:16:11 -07:00
m3b
06a191b8de fix problems in LevelDB's caching code
Background:

LevelDB uses a cache (util/cache.h, util/cache.cc) of (key,value)
pairs for two purposes:
- a cache of (table, file handle) pairs
- a cache of blocks

The cache places the (key,value) pairs in a reference-counted
wrapper.  When it returns a value, it returns a reference to this
wrapper.  When the client has finished using the reference and
its enclosed (key,value), it calls Release() to decrement the
reference count.

Each (key,value) pair has an associated resource usage.  The
cache maintains the sum of the usages of the elements it holds,
and removes values as needed to keep the sum below a capacity
threshold.  It maintains an LRU list so that it will remove the
least-recently used elements first.

The max_open_files option to LevelDB sets the size of the cache
of (table, file handle) pairs.  The option is not used in any
other way.

The observed behaviour:

If LevelDB at any time used more file handles concurrently than
the cache size set via max_open_files, it attempted to reduce the
number by evicting entries from the table cache.  This could
happen most easily during compaction, and if max_open_files was
low.  Because the handles were in use, their reference count did
not drop to zero, and so the usage sum in the cache was not
modified by the evictions.  Subsequent Insert() calls returned
valid handles, but their entries were immediately evicted from
the cache, which though empty still acted as though full.  As a
result, there was effectively no caching, and the number of open
file handles rose []ly until it hit system-imposed limits and
the process died.

If one set max_open_files lower, the cache was more likely to
exhibit this beahviour, and cause the process to run out of file
descriptors.  That is, max_open_files acted in almost exactly the
opposite manner from what was intended.

The problems:

1. The cache kept all elements on its LRU list eligible for capacity
   eviction---even those with outstanding references from clients.  This was
   ineffective in reducing resource consumption because there was an
   outstanding reference, guaranteeing that the items remained.  A secondary
   issue was that there is no guarantee that these in-use items will be the
   last things reached in the LRU chain, which actually recorded
   "least-recently requested" rather than "least-recently used".

2. The sum of usages was decremented not when a (key,value) was evicted from
   the cache, but when its reference count went to zero.  Thus, when things
   were removed from the cache, either by garbage collection or via Erase(),
   the usage sum was not necessarily decreased.  This allowed the cache to act
   as though full when it was in fact not, reducing caching effectiveness, and
   leading to more resources being consumed---the opposite of what the
   evictions were intended to achieve.

3. (minor) The cache's clients insert items into it by first looking up the
   key, and inserting only if no value is found.  Although the cache has an
   internal lock, the clients use no locking to ensure atomicity of the
   Lookup/Insert pair.  (see table/table.cc:  block_cache->Insert() and
   db/table_cache.cc:  cache_->Insert()).  Thus, if two threads Insert() at
   about the same time, they can both Lookup(), find nothing, and both
   Insert().  The second Insert() would evict the first value, leaving each
   thread with a handle on its own version of the data, and with the second
   version in the cache.  It would be better if both threads ended up with a
   handle on the same (key,value) pair, which implies it must be the first item
   inserted.  This suggests that Insert() should not replace an existing value.

   This can be made safe with current usage inside LeveDB itself, but this is
   not easy to change first because Cache is a public interface, so to change
   the semantics of an existing call might break things, second because Cache
   is an abstract virtual class, so adding a new abstract virtual method may
   break other implementations, and third, the new method "insert without
   replacing" cannot be implemented in terms of the existing methods, so cannot
   be implemented with a non-abstract default.   But fortunately, the effects
   of this issue are minor, so this issue is not fixed by this change.

The changes:

The assumption in the fixes is that it is always better to cache
entries unless removal from the cache would lead to deallocation.

Cache entries now have an "in_cache" boolean indicating whether
the cache has a reference on the entry.  The only ways that this can
become false without the entry being passed to its "deleter" are via
Erase(), via Insert() when an element with a duplicate key is inserted,
or on destruction of the cache.

The cache now keeps two linked lists instead of one.  All items
in the cache are in one list or the other, and never both.  Items
still referenced by clients but erased from the cache are in
neither list.  The lists are:
- in-use:  contains the items currently referenced by clients, in no particular
  order.  (This list is used for invariant checking.  If we removed the check,
  elements that would otherwise be on this list could be left as disconnected
  singleton lists.)
- LRU:  contains the items not currently referenced by clients, in LRU order

A new internal Ref() method increments the reference count.  If
incrementing from 1 to 2 for an item in the cache, it is moved
from the LRU list to the in-use list.

The Unref() call now moves things from the in-use list to the LRU
list if the reference count falls to 1, and the item is in the
cache.  It no longer adjusts the usage sum.  The usage sum now
reflects only what is in the cache, rather than including
still-referenced items that have been evicted.

The LRU_Append() now takes a "list" parameter so that it can be
used to append either to the LRU list or the in-use list.

Lookup() is modified to use the new Ref() call, rather than
adjusting the reference count and LRU chain directly.

Insert() eviction code is also modified to adjust the usage sum and the
in_cache boolean of the evicted elements.  Some LevelDB tests assume that there
will be no caching whatsoever if the cache size is set to zero, so this is
handled as a special case.

A new private method FinishErase() is factored out
with the common code from where items are removed from the cache.

Erase() is modified to adjust the usage sum and the in_cache
boolean of the erased elements, and to use FinishErase().

Prune() is modified to use FinishErase() also, and to make use of the fact that
the lru_ list now contains only items with reference count 1.

- EvictionPolicy is modified to test that an entry with an
outstanding handle is not evicted.  This test fails with the old cache.cc.

- A new test case UseExceedsCacheSize verifies that even when the
cache is overfull of entries with outstanding handles, none are
evicted.  This test fails with the old cache.cc, and is the key
issue that causes file descriptors to run out when the cache
size is set too small.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=123247237
2016-07-06 09:15:53 -07:00
John Abd-El-Malek
a7bff697ba Fix LevelDB build when asserts are enabled in release builds. (#367)
* Fix LevelDB build when asserts are enabled in release builds.

BUG=https://bugs.chromium.org/p/chromium/issues/detail?id=603166

* fix

* Add comment
2016-04-15 10:58:27 -07:00
Nicholas Westlake
ea992b467b Change std::uint64_t to uint64_t (#354)
-This fixes compile errors with default setup on RHEL 6 systems.
2016-04-12 15:38:09 -07:00
mjwiacek
e84b5bdb5a This CL fixes a bug encountered when reading records from leveldb files that have been split, as in a [] input task split.
Detailed description:

Suppose an input split is generated between two leveldb record blocks and the preceding block ends with null padding.

A reader that previously read at least 1 record within the first block (before encountering the padding) upon trying to read the next record, will successfully and correctly read the next logical record from the subsequent block, but will return a last record offset pointing to the padding in the first block.

When this happened in a [], it resulted in duplicate records being handled at what appeared to be different offsets that were separated by only a few bytes.

This behavior is only observed when at least 1 record was read from the first block before encountering the padding. If the initial offset for a reader was within the padding, the correct record offset would be reported, namely the offset within the second block.

The tests failed to catch this scenario/bug, because each read test only read a single record with an initial offset. This CL adds an explicit test case for this scenario, and modifies the test structure to read all remaining records in the test case after an initial offset is specified.  Thus an initial offset that jumps to record #3, with 5 total records in the test file, will result in reading 2 records, and validating the offset of each of them in order to pass successfully.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=115338487
2016-03-31 15:53:34 -07:00
cmumford
3211343909 Deleted redundant null ptr check prior to delete.
Fixes issue #338.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=113439460
2016-03-31 15:53:30 -07:00
Chris Mumford
7306ef856a Merge pull request #348 from randomascii/master
Fix signed/unsigned mismatch on VC++ builds
2016-02-24 14:39:03 -08:00
Bruce Dawson
6b18316d01 Fix signed/unsigned mismatch on VC++ builds 2016-02-19 13:59:19 -08:00
cmumford
adbe3eb073 Putting build artifacts in subdirectory.
1. Object files, libraries, and compiled executables are put
   into subdirectories.
2. The shared library is linked from individual object files.
   This provides for greater parallelism on large desktops
   while at the same time making for easier builds on small
   (i.e. embedded) systems. Fixes issue #279.
3. One program, db_bench, is compiled using the shared library.
4. The source file for "leveldbutil" was renamed from
   leveldb_main.cc to leveldbutil.cc. This provides for simpler
   makefile rules.
5. Because all targets placed the library (libleveldb.a) at the top
   level, the last platform built (desktop/device) always overwrote
   any prior artifact.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=113407013
2016-01-29 16:10:00 -08:00
Chris Mumford
2d0320a458 Merge pull request #329 from ralphtheninja/travis-badge
Add travis build badge to README
2016-01-15 11:17:41 -08:00
Lars-Magnus Skog
dd1c3c3572 add travis build badge 2016-01-15 18:29:01 +01:00
Chris Mumford
43fcf23af0 Merge pull request #328 from cmumford/master
Added a Travis CI build file.
2016-01-14 21:17:21 -08:00
Chris Mumford
9fcae61641 Added a Travis CI build file.
This allows for continuous integration builds by travis-ci.org.
More information at https://docs.travis-ci.com/user/languages/cpp
2016-01-14 17:41:48 -08:00
Chris Mumford
dac40d25f6 Merge pull request #284 from ideawu/master
log compaction output file's level along with number
2016-01-12 11:30:32 -08:00
Chris Mumford
8ec241a3b0 Merge pull request #317 from falvojr/patch-1
Update README.md
2016-01-12 10:52:33 -08:00
Chris Mumford
5d36bedd1c Merge pull request #272 from vapier/master
Fix Android/MIPS build.
2016-01-12 10:47:33 -08:00
cmumford
4753c9b617 Added a contributors section to README.md
In preparation for accepting GitHub pull requests this new README
section outlines the general criteria that the leveldb project owners
will use when accepting external (and internal) project contributions.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=111349899
2016-01-04 13:29:41 -08:00
Chris Mumford
e2446d0848 Merge pull request #275 from paulirish/patch-1
readme: improved documentation link
2015-12-09 15:04:30 -08:00
ssid
706b7f8d43 Resolve race when getting approximate-memory-usage property
The write operations in the table happens without holding the mutex
lock, but concurrent writes are avoided using "writers_" queue.
The Arena::MemoryUsage could access the blocks when write happens.
So, the memory usage is cached in atomic word and can be loaded
from any thread safely.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=107573379
2015-12-09 11:27:50 -08:00
cmumford
3c9ff3c691 Only compiling TrimSpace on linux.
Incorporated change by zmodem at https://github.com/google/leveldb/pull/310
to fix issue #310.

This change will only build TrimSace on linux to avoid unused function
warning/error.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=105323419
2015-12-09 10:35:07 -08:00
cmumford
f8d205cf89 Including atomic_pointer.h in port_posix
A recent CL (104348226) created the port_posix library, but omitted: port/atomic_pointer.h.

And when:

    [] test third_party/leveldb:all

was run this error was reported:

    //third_party/leveldb:port_posix does not depend on a
    module exporting 'third_party/leveldb/port/atomic_pointer.h'
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=105243399
2015-12-09 10:35:07 -08:00
ndmatthews
889de31a5a Let LevelDB use xcrun to determine Xcode.app path instead of using a hardcoded path.
This allows build agents to select from multiple Xcode installations.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=104859097
2015-12-09 10:34:58 -08:00
ssid
528c2bc6ad Add "approximate-memory-usage" property to leveldb::DB::GetProperty
The approximate RAM usage of the database is calculated from the memory
allocated for write buffers and the block cache. This is to give an
estimate of memory usage to leveldb clients.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=104222307
2015-12-09 10:34:58 -08:00
tzik
359b6bcec2 Add leveldb::Cache::Prune
Prune() drops on-memory read cache of the database, so that the client can
relief its memory shortage.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=101335710
2015-12-09 10:34:58 -08:00
pkasting
50e77a8263 Fix size_t/int comparison/conversion issues in leveldb.
The create function took |num_keys| as an int, but callers and implementers wanted it to function as a size_t (e.g. passing std::vector::size() in, passing it to vector constructors as a size arg, indexing containers by it, etc.).  This resulted in implicit conversions between the two types as well as warnings (found with Chromium's external copy of these sources, built with MSVC) about signed vs. unsigned comparisons.

The leveldb sources were already widely using size_t elsewhere, e.g. for key and filter lengths, so using size_t here is not inconsistent with the existing code.  However, it does change the public C API.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=101074871
2015-12-09 10:34:58 -08:00
cmumford
5208e7952d Added leveldb::Status::IsInvalidArgument() method.
All other Status::Code enum values have an Is**() method with the one
exception of InvalidArgument.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=97166441
2015-12-09 10:34:57 -08:00
Mike Wiacek
ce45404bba Suppress error reporting after seeking but before a valid First or Full record is encountered.
Fix a spelling mistake.
2015-12-09 10:34:57 -08:00
Chris Mumford
b9afa1f2e7 include <assert> -> <cassert>
Fixes reported public issue #280.
2015-12-09 10:34:57 -08:00
Venilton FalvoJr
edf2939c0d Update README.md 2015-11-23 16:24:16 -02:00
Chris Mumford
65190ac48b Will not reuse manifest if reuse_logs options is false.
Prior implementation would always try to reuse the manifest, even if reuse_logs
was false (the default). This was missed because the stock
Env::NewAppendableFile implementation returns false forcing the creation of a
new log.
2015-08-11 14:59:48 -07:00
Sanjay Ghemawat
ac1d69da31 LevelDB now attempts to reuse the preceding MANIFEST and log file when re-opened.
(Based on a suggestion by cmumford.)

"open" benchmark on my workstation speeds up significantly since we
can now avoid three fdatasync calls and a compaction per open:

  Before: ~80000 microseconds
  After:    ~130 microseconds

Details:

(1) Added Options::reuse_logs (currently defaults to false) to control
new behavior.  The intention is to change the default to true after some
baking.

(2) Added Env::NewAppendableFile() whose default implementation returns
a not-supported error.

(3) VersionSet::Recovery attempts to reuse the MANIFEST from which
it is recovering.

(4) DBImpl recovery attempts to reuse the last log file and memtable.

(5) db_test.cc now tests a new configuration that sets reuse_logs to true.

(6) fault_injection_test also tests a reuse_logs==true config.

(7) Added a new recovery_test.
2015-08-11 14:56:39 -07:00
ideawu
76bba139c0 fix indent 2015-04-20 12:41:01 +08:00
ideawu
8fcceb2a6f log compaction output file's level along with number 2015-04-20 12:39:14 +08:00
Paul Irish
0e0f07417c documentation. improved link 2015-02-17 09:55:40 -08:00
Paul Irish
c85addcdf3 readme: improved documentation link
This replaces htmlpreview with [rawgit](http://rawgit.com/).
rawgit is faster and doesnt use JS to render the page, making it SEO friendly.
2015-01-10 17:36:40 -08:00
David Turner
ceff6f1215 Fix Android/MIPS build.
port/atomic_pointer.h was missing an implementation for
MemoryBarrier() for this platform.
2014-12-17 14:18:54 -05:00
Sanjay Ghemawat
77948e7eec Add benchmark that measures cost of repeatedly opening the database. 2014-12-11 08:08:57 -08:00
Chris Mumford
34ad72e3e9 Move header guard below copyright banner. 2014-12-11 08:04:40 -08:00