While making the above changes I did some measurements which I feel is also useful to share.

PackedInts trade CPU for better CPU cache & memory usage. PackedInts gives you an acceptableOverheadRatio parameter to control the trade off but is not exposed in the AbstractAppendingLongBuffer family is based on those. This is especially important when you do no rely on the AbstractAppendingLongBuffer.iterator() to extract your data. Here is some experiments I run on my laptop, using BenchmarkAppendLongBufferRead which is included in the patch. The program allows you to play with different read strategies and data size and measure reading times.

This is the result of using AppendingDeltaPackedLongBuffer (previously called AppendingLongBuffer) to sequential read an array of 500000 elements, using it's get method. The data was uniformly distributed numbers between 0 & 7. The program measure 10,000 such read. The total time is the time it took to perform all of them. You also see in the output the number of bits used to store the elements and the storage class used.

------- Storage: DELTA_PACKED, Read: SEQUENTIAL, Read size: 1

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 22.18s avg: 2.22ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 223.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 19.14s avg: 1.91ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 521.13kb)

As you can see, when retrieving elements one by one, the byte based implementation slightly faster. For comparison, the new AppendingPackedLongBuffer with the same setup:

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 1

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 16.69s avg: 1.67ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 13.47s avg: 1.35ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

Next to the fact that is faster, you see the same behavior.

For random reads, the classes display similar behavior:

------- Storage: DELTA_PACKED, Read: RANDOM, Read size: 1

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 23.13s avg: 2.31ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 223.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 19.38s avg: 1.94ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 521.13kb)

------- Storage: PACKED, Read: RANDOM, Read size: 1

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 19.23s avg: 1.92ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 15.95s avg: 1.60ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

Next I looked at the effect of exposing the bulk reads offered by the PackedInts structures in the AppendingLongBuffer family. Here is some results from the new packed implementation, this time reading 4 & 16 consecutive elements in a single read.

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 4

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 11.16s avg: 1.12ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

BULK GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 24.22s avg: 2.42ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 8.35s avg: 0.84ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

BULK GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 8.44s avg: 0.84ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 16

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 9.63s avg: 0.96ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

BULK GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 12.52s avg: 1.25ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 7.46s avg: 0.75ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

BULK GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 3.22s avg: 0.32ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

As you can see the bulk read api for the Direct8 class, starts to pay off when reading 4 elements or more. For the Packed64 it still slower when reading 16 elements at a time. On my MacBook Air, it only started paying off at around 32 elements. This may be different on systems with more CPU caches.

Those tests (and many more I played with) are of course very synthetic. I also run the new classes under our implementation of faceted search. The results were similar.

While making the above changes I did some measurements which I feel is also useful to share.

PackedInts trade CPU for better CPU cache & memory usage. PackedInts gives you an acceptableOverheadRatio parameter to control the trade off but is not exposed in the AbstractAppendingLongBuffer family is based on those. This is especially important when you do no rely on the AbstractAppendingLongBuffer.iterator() to extract your data. Here is some experiments I run on my laptop, using BenchmarkAppendLongBufferRead which is included in the patch. The program allows you to play with different read strategies and data size and measure reading times.

This is the result of using AppendingDeltaPackedLongBuffer (previously called AppendingLongBuffer) to sequential read an array of 500000 elements, using it's get method. The data was uniformly distributed numbers between 0 & 7. The program measure 10,000 such read. The total time is the time it took to perform all of them. You also see in the output the number of bits used to store the elements and the storage class used.

------- Storage: DELTA_PACKED, Read: SEQUENTIAL, Read size: 1

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 22.18s avg: 2.22ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 223.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 19.14s avg: 1.91ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 521.13kb)

As you can see, when retrieving elements one by one, the byte based implementation slightly faster. For comparison, the new AppendingPackedLongBuffer with the same setup:

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 1

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 16.69s avg: 1.67ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 13.47s avg: 1.35ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

Next to the fact that is faster, you see the same behavior.

For random reads, the classes display similar behavior:

------- Storage: DELTA_PACKED, Read: RANDOM, Read size: 1

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 23.13s avg: 2.31ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 223.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 19.38s avg: 1.94ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 521.13kb)

------- Storage: PACKED, Read: RANDOM, Read size: 1

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 19.23s avg: 1.92ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 15.95s avg: 1.60ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

Next I looked at the effect of exposing the bulk reads offered by the PackedInts structures in the AppendingLongBuffer family. Here is some results from the new packed implementation, this time reading 4 & 16 consecutive elements in a single read.

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 4

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 11.16s avg: 1.12ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

BULK GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 24.22s avg: 2.42ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 8.35s avg: 0.84ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

BULK GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 8.44s avg: 0.84ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 16

SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 9.63s avg: 0.96ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

BULK GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 12.52s avg: 1.25ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)

SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 7.46s avg: 0.75ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

BULK GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 3.22s avg: 0.32ms, total read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)

As you can see the bulk read api for the Direct8 class, starts to pay off when reading 4 elements or more. For the Packed64 it still slower when reading 16 elements at a time. On my MacBook Air, it only started paying off at around 32 elements. This may be different on systems with more CPU caches.

Those tests (and many more I played with) are of course very synthetic. I also run the new classes under our implementation of faceted search. The results were similar.