[LUCENE-2810] Explore Alternate Stored Field approaches for highly redundant data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: core/store
Labels:
None

Description

In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents contain a lot of redundant information and end up wasting a lot of space across a large collection of documents. For instance, simply compressing a typical log file often results in > 75% compression rates. We should explore mechanisms for applying compression across all the documents for a field (or fields) while still maintaining relatively fast lookup (that being said, in most logging applications, fast retrieval of a given event is not always critical.) For instance, perhaps it is possible to have a part of storage that contains the set of unique values for all the fields and the document field value simply contains a reference (could be as small as a few bits depending on the number of uniq. items) to that value instead of having a full copy. Extending this, perhaps we can leverage some existing compression capabilities in Java to provide this as well.

It may make sense to implement this as a Directory, but it might also make sense as a Codec, if and when we have support for changing storage Codecs.

Attachments

Issue Links

relates to

LUCENE-4226 Efficient compression of small to medium stored fields

Closed

Activity

People

Assignee:: Grant Ingersoll

Reporter:: Grant Ingersoll

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 13/Dec/10 01:14

Updated:: 28/Aug/22 12:37

Resolved:: 08/Oct/12 10:14