[LUCENE-5854] BytesRefHash can be simplified (and made faster) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 6.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Currently BytesRefHash stores the length of each byte sequence as either one or two bytes inside the byte pool. This is redundant (slows down add operation and increases the required memory).

Logically, what BytesRefHash does is assign linear IDs (0..n) to each unique byte sequence on input. So what's really needed are two data structures:

a compact list of byte sequences (indexed 0...n),
a hash table lookup of hash(byteSequence) -> ID.

The first item is already implemented efficiently in BytesRefArray. Note that the length of each byte sequence is implicitly stored as the difference in starting offsets between the next sequence's start offset and the current sequence (clever!). This doesn't allow for removals, but saves on encoding and representation.

The second bullet point above is trivial (linear hash table of IDs or -1 indicating empty slots).

I have a clear-room implementation of the above (based on HPPC data structures though) and it does show some performance improvement (on simplistic randomized data benchmarks).

BenchmarkByteBlockOrdSet.testBytesRefHash: [measured 5 out of 7 rounds, threads: 1 (sequential)]
 round: 2.62 [+- 0.03], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 5, GC.time: 0.20, time.total: 18.43, time.warmup: 5.35, time.bench: 13.08
BenchmarkByteBlockOrdSet.testByteBlockOrdSet: [measured 5 out of 7 rounds, threads: 1 (sequential)]
 round: 1.91 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 8, GC.time: 0.20, time.total: 13.37, time.warmup: 3.83, time.bench: 9.54

But I think the reason it'd be worth looking at this in Lucene is making BytesRefHash simpler to understand. For example, put operation in my code looks like this:

  public int put(ByteBlockRef ref) {
    assert ref != null; 

    int slot = hash(ref) & mask;
    for (int id = ordByHash[slot];
             id != EMPTY_SLOT;
             id = ordByHash[slot]) {
      if (blocks.compareTo(id, ref) == 0) {
        return id;
      }
      slot = (slot + 1) & mask;
    }

    int newId = ordByHash[slot] = blocks.size();
    blocks.add(ref);

    if (size() == resizeAt) {
      grow();
    }

    return -newId - 1;
  }

and get is simply delegation to the list of byte sequences:

  public ByteBlockRef get(int index, ByteBlockRef ref) {
    return blocks.get(index, ref);
  }

What makes this refactoring slightly more complicated is that there is a fair bit of hairy stuff in BytesRefHash that the person doing the refactoring would have to look at. The craziest parts are, to me:

  private void rehash(final int newSize, boolean hashOnData)
-> when hashOnData is true this class becomes insane :)

  shrinking is not really used anyway; neither are element removals (and they're not really needed I think).

  compacting and in-memory sorting should be moved outside of this class and placed somewhere else. You could still grab the internal storage buffers, but do the compacting/ in-memory sorting logic somewhere else and thus not leave the BytesRefHash object in an invalid/ crazy state.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Dawid Weiss

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Jul/14 08:23

Updated:: 28/Aug/22 14:13