[LUCENE-7976] Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.5, 8.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

We're seeing situations "in the wild" where there are very large indexes (on disk) handled quite easily in a single Lucene index. This is particularly true as features like docValues move data into MMapDirectory space. The current TMP algorithm allows on the order of 50% deleted documents as per a dev list conversation with Mike McCandless (and his blog here: https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).

Especially in the current era of very large indexes in aggregate, (think many TB) solutions like "you need to distribute your collection over more shards" become very costly. Additionally, the tempting "optimize" button exacerbates the issue since once you form, say, a 100G segment (by optimizing/forceMerging) it is not eligible for merging until 97.5G of the docs in it are deleted (current default 5G max segment size).

The proposal here would be to add a new parameter to TMP, something like <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions welcome) which would default to 100 (or the same behavior we have now).

So if I set this parameter to, say, 20%, and the max segment size stays at 5G, the following would happen when segments were selected for merging:

> any segment with > 20% deleted documents would be merged or rewritten NO MATTER HOW LARGE. There are two cases,
>> the segment has < 5G "live" docs. In that case it would be merged with smaller segments to bring the resulting segment up to 5G. If no smaller segments exist, it would just be rewritten
>> The segment has > 5G "live" docs (the result of a forceMerge or optimize). It would be rewritten into a single segment removing all deleted docs no matter how big it is to start. The 100G example above would be rewritten to an 80G segment for instance.

Of course this would lead to potentially much more I/O which is why the default would be the same behavior we see now. As it stands now, though, there's no way to recover from an optimize/forceMerge except to re-index from scratch. We routinely see 200G-300G Lucene indexes at this point "in the wild" with 10s of shards replicated 3 or more times. And that doesn't even include having these over HDFS.

Alternatives welcome! Something like the above seems minimally invasive. A new merge policy is certainly an alternative.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-7976.patch
15/Jun/18 19:32
57 kB
Erick Erickson
LUCENE-7976.patch
13/Jun/18 17:02
54 kB
Erick Erickson
LUCENE-7976.patch
08/Jun/18 19:51
53 kB
Erick Erickson
LUCENE-7976.patch
30/May/18 19:21
53 kB
Erick Erickson
LUCENE-7976.patch
30/May/18 19:12
53 kB
Erick Erickson
LUCENE-7976.patch
10/May/18 04:59
51 kB
Erick Erickson
LUCENE-7976.patch
01/May/18 05:09
36 kB
Erick Erickson
LUCENE-7976.patch
24/Apr/18 04:29
29 kB
Erick Erickson
LUCENE-7976.patch
23/Apr/18 03:41
29 kB
Erick Erickson
LUCENE-7976.patch
19/Apr/18 02:59
26 kB
Erick Erickson
LUCENE-7976.patch
16/Apr/18 16:49
44 kB
Erick Erickson
LUCENE-7976.patch
05/Apr/18 21:39
29 kB
Erick Erickson
LUCENE-7976.patch
23/Oct/17 20:43
1 kB
Michael McCandless
SOLR-7976.patch
08/Jun/18 21:36
53 kB
Erick Erickson

Issue Links

causes

LUCENE-8370 Reproducing TestLucene{54,70}DocValuesFormat.testSortedSetVariableLengthBigVsStoredFields() failures

Closed

SOLR-12513 Reproducing TestCodecSupport.testMixedCompressionMode failure

Closed

contains

SOLR-8839 Angular admin/segments display: display of deleted docs not proportional

Closed

is depended upon by

SOLR-7733 remove "optimize" from the UI.

Closed

is related to

LUCENE-8263 Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging

Closed

relates to

LUCENE-8004 IndexUpgraderTool should rewrite segments rather than forceMerge

Resolved

SOLR-12259 Robustly upgrade indexes

Resolved

SOLR-7733 remove "optimize" from the UI.

Closed

(3 relates to)

Activity

People

Assignee:: Erick Erickson

Reporter:: Erick Erickson

Votes:: 7 Vote for this issue

Watchers:: 25 Start watching this issue

Dates

Created:: 25/Sep/17 20:01

Updated:: 28/Aug/22 15:19

Resolved:: 17/Jun/18 01:08