Is this patch for flex, as it contains CodecUtils and so on?
Actually it's intended for trunk; I was thinking this should land
before flex (it's a much smaller change, and it's "isolated" from
flex), and so I wrote the CodecUtil/BytesRef basic infrastructure,
thinking flex would then cutover to them.
Hmm, so random-access would obviously be the preferred approach for SSDs, but
with conventional disks I think the performance would be poor? In 1231
I implemented the var-sized CSF with a skip list, similar to a posting
list. I think we should add that here too and we can still keep the
additional index that stores the pointers? We could have two readers:
one that allows random-access and loads the pointers into RAM (or uses
MMAP as you mentioned), and a second one that doesn't load anything
into RAM, uses the skip lists and only allows iterator-based access?
The intention here is for this ("index values") to replace field
cache, but not aim (initially at least) to do much more. Ie, it's
"meant" to be a RAM resident (either via explicit slurping-into-RAM or
via MMAP). So the SSD or spinning magnets should not be hit on
If we add an iterator API, I think it should be simpler than the
postings API (ie, no seeking, dense (every doc is visited,
It looks like ByteRef is very similar to Payload? Could you use that instead
and extend it with the new String constructor and compare methods?
Good point! I agree. Also, we should use BytesRef when reading the
payload from TermsEnum. Actually I think Payload, BytesRef, TermRef
(in flex) should all eventually be merged; of the three names, I think
I like BytesRef the best. With *Enum in flex we can switch to
BytesRef. For analysis we should switch PayloadAttribute to BytesRef,
and deprecate the methods using Payload? Hmmm... but PayloadAttribute
is an interface.
So it looks like with your approach you want to support certain
"primitive" types out of the box, such as byte, float, int, String?
Actually, all "primitive" types (ie, byte/short/int/long are
"included" under int, as well as arbitrary bit precision "between"
those primitive types). Because the API uses a method invocation (eg
IntSource.get) instead of direct array access, we can "hide" how many
bits are actually used, under the impl. Same is true for float/double
(except we can't [easily] do arbitrary bit precision here... just 4 or
If someone has custom data types, then they have, similar as with
payloads today, the byte indirection?
Right, byte is for String, but also for arbitrary (opaque to Lucene)
extensibility. The six anonymous (separate package private classes)
concrete impls should give good efficiency to fit the different use
The code I initially wrote for 1231 exposed IndexOutput, so that one
can call write*() directly, without having to convert to byte
first. I think we will also want to do that for 2125 (store attributes
in the index). So I'm wondering if this and 2125 should work
This is compelling (letting Attrs read/write directly), but, I have
- How would the random-access API work? (Attrs are designed for
iteration). Eg, just providing IndexInput/Output to the Attr
isn't quite enough – the encoding is sometimes context dependent
(like frq writes the delta between docIDs, the symbol table needed
when reading/writing deref/sorted). How would I build a random
access API on top of that? captureState-per-doc is too costly.
What API would be used to write the shared state, ie, to tell the
Attr "we now are writing the segment, so you need to dump the
- How would the packed ints work? EG say my ints only need 5 bits.
(Attrs are sort of designed for one-value-at-once).
- How would the "symbol table" based encodings (deref, sorted) work?
I guess the attr would need to have some state associated with
it, and when I first create the attr I need to pass it segment
name, Directory, etc, so it opens the right files?
- I'm thinking we should still directly support native types, ie,
Attrs are there for extensibility beyond native types?
- Exposing single attr across a multi reader sounds tricky –
LUCENE-2154 (and, we need this for flex, which is worrying me!).
But it sounds like you and Uwe are making some progress on that
(using some under-the-hood Java reflection magic)... and this
doesn't directly affect this issue, assuming we don't expose this
API at the MultiReader level.
Thinking out loud: Could we have then attributes with
serialize/deserialize methods for primitive types, such as float?
Could we efficiently use such an approach all the way up to
FieldCache? It would be compelling if you could store an attribute as
CSF, or in the postinglist, retrieve it from the flex APIs, and also
from the FieldCache. All would be the same API and there would only be
one place that needs to "know" about the encoding (the attribute).
This is the grand unification of everything I like it, but, I
don't want that future utopia to stall our progress today... ie I'd
rather do something simple yet concrete, now, and then work step by
step towards that future ("progress not perfection").
That said, if we can get some bite sized step in, today, towards that
future, that'd be good.
Eg, the current patch only supports "dense" storage, ie it's assumed
every document will have a value, because it's aiming to replace field
cache. If we wanted to add sparse storage... I think that'd
require/strongly encourage access via a postings-like iteration API,
which I don't see how to take a baby step towards
I do think it would be compelling for an Attr to "only" have to expose
read/write methods, and then the Attr can be stored in CSF or
postings, but I don't see how to make an efficient random-access API
on top of that. I think it's in LUCENE-2125 where we should explore
Norms and deleted docs should be able to eventually switch to CSF.
In fact, norms should just be a FloatSource, with default impl being
the 1-byte float encoding we use today. This then gives apps full
flexibility to plugin their own FloatSource.
For deleted docs we should probably create a BoolSource.
About updating CSF: I hope we can use parallel indexing for that. In
other words: It should be possible for users to use parallel indexes
to update certain fields, and Lucene should use the same approach
internally to store different "generations" of things like norms and
That sounds great, though, I think we need a more efficient way to
store the changes. Ie, norms rewrites all norms on any change, which
is costly. It'd be better to have some sort of delta format, where
you sparsely encode docID + new value, and then when loading we merge
those on the fly (and, segment merging periodically also merges &
Yeah, that's where I got kind of stuck with 1231: We need to figure
out how the public API should look like, with which a user can add CSF
values to the index and retrieve them. The easiest and fastest way
would be to add a dedicated new API. The cleaner one would be to make the whole
Document/Field/FieldInfos API more flexible.
LUCENE-1597 was a first attempt.
LUCENE-1597 is another good but far-away-from-landing
goal. I think a dedicated API is fine for the atomic types. Field
cache today is a dedicated API...
I guess to sum up my thoughts now (but I'm still mulling...):
- I think the random-access-field-cache-like-API should be separate
from the designed-for-iteration-from-a-file postings API.
- Attrs for extensibilty could be compelling, but I don't see how to
build an [efficient] random access API on top of Attrs. It would
be very elegant only having to add a read/write method to your
Attr, but, that's not really enough for a full codec.
- I don't think we should hold up adding direct support for atomic
types until/if we can figure out how to add Attrs. Ie I think we
should do this in two steps. The current patch is [roughly] step
1, and I think should be a compelling replacement for field cache.
Memory usage and GC cost of string sorting should be much lower
than field cache.
I'm also still mulling on these issues w/ the current patch:
- How could we use index values to efficiently maintain stats needed
for flexible scoring (LUCENE-2187).
- Current patch doesn't handle merging yet.
- Could norms/deleted docs "conceivably" cutover to index values
- What "dedicated API" for indexing & sorting.
- Run basic perf tests to see cost of using method instead of direct