Lucene - Core
  1. Lucene - Core
  2. LUCENE-2186

First cut at column-stride fields (index values storage)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA, CSF branch
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I created an initial basic impl for storing "index values" (ie
      column-stride value storage). This is still a work in progress... but
      the approach looks compelling. I'm posting my current status/patch
      here to get feedback/iterate, etc.

      The code is standalone now, and lives under new package
      oal.index.values (plus some util changes, refactorings) – I have yet
      to integrate into Lucene so eg you can mark that a given Field's value
      should be stored into the index values, sorting will use these values
      instead of field cache, etc.

      It handles 3 types of values:

      • Six variants of byte[] per doc, all combinations of fixed vs
        variable length, and stored either "straight" (good for eg a
        "title" field), "deref" (good when many docs share the same value,
        but you won't do any sorting) or "sorted".
      • Integers (variable bit precision used as necessary, ie this can
        store byte/short/int/long, and all precisions in between)
      • Floats (4 or 8 byte precision)

      String fields are stored as the UTF8 byte[]. This patch adds a
      BytesRef, which does the same thing as flex's TermRef (we should merge
      them).

      This patch also adds basic initial impl of PackedInts (LUCENE-1990);
      we can swap that out if/when we get a better impl.

      This storage is dense (like field cache), so it's appropriate when the
      field occurs in all/most docs. It's just like field cache, except the
      reading API is a get() method invocation, per document.

      Next step is to do basic integration with Lucene, and then compare
      sort performance of this vs field cache.

      For the "sort by String value" case, I think RAM usage & GC load of
      this index values API should be much better than field caache, since
      it does not create object per document (instead shares big long[] and
      byte[] across all docs), and because the values are stored in RAM as
      their UTF8 bytes.

      There are abstract Writer/Reader classes. The current reader impls
      are entirely RAM resident (like field cache), but the API is (I think)
      agnostic, ie, one could make an MMAP impl instead.

      I think this is the first baby step towards LUCENE-1231. Ie, it
      cannot yet update values, and the reading API is fully random-access
      by docID (like field cache), not like a posting list, though I
      do think we should add an iterator() api (to return flex's DocsEnum)
      – eg I think this would be a good way to track avg doc/field length
      for BM25/lnu.ltc scoring.

      1. LUCENE-2186.patch
        214 kB
        Simon Willnauer
      2. LUCENE-2186.patch
        234 kB
        Simon Willnauer
      3. LUCENE-2186.patch
        160 kB
        Simon Willnauer
      4. LUCENE-2186.patch
        404 kB
        Michael McCandless
      5. LUCENE-2186.patch
        94 kB
        Michael McCandless
      6. mem.py
        2 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Is this patch for flex, as it contains CodecUtils and so on?

          If it is so we should use "affects version: flex".

          Show
          Uwe Schindler added a comment - Is this patch for flex, as it contains CodecUtils and so on? If it is so we should use "affects version: flex".
          Hide
          Michael Busch added a comment -

          Great to see progress here, Mike!

          String fields are stored as the UTF8 byte[]. This patch adds a
          BytesRef, which does the same thing as flex's TermRef (we should merge
          them).

          It looks like ByteRef is very similar to Payload? Could you use that instead
          and extend it with the new String constructor and compare methods?

          It handles 3 types of values:

          So it looks like with your approach you want to support certain
          "primitive" types out of the box, such as byte[], float, int, String?
          If someone has custom data types, then they have, similar as with
          payloads today, the byte[] indirection?

          The code I initially wrote for 1231 exposed IndexOutput, so that one
          can call write*() directly, without having to convert to byte[]
          first. I think we will also want to do that for 2125 (store attributes
          in the index). So I'm wondering if this and 2125 should work
          similarly?
          Thinking out loud: Could we have then attributes with
          serialize/deserialize methods for primitive types, such as float?
          Could we efficiently use such an approach all the way up to
          FieldCache? It would be compelling if you could store an attribute as
          CSF, or in the postinglist, retrieve it from the flex APIs, and also
          from the FieldCache. All would be the same API and there would only be
          one place that needs to "know" about the encoding (the attribute).

          Next step is to do basic integration with Lucene, and then compare
          sort performance of this vs field cache.

          Yeah, that's where I got kind of stuck with 1231: We need to figure
          out how the public API should look like, with which a user can add CSF
          values to the index and retrieve them. The easiest and fastest way
          would be to add a dedicated new API. The cleaner one would be to make the whole
          Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt.

          There are abstract Writer/Reader classes. The current reader impls
          are entirely RAM resident (like field cache), but the API is (I think)
          agnostic, ie, one could make an MMAP impl instead.

          I think this is the first baby step towards LUCENE-1231. Ie, it
          cannot yet update values, and the reading API is fully random-access
          by docID (like field cache), not like a posting list, though I
          do think we should add an iterator() api (to return flex's DocsEnum)

          Hmm, so random-access would obviously be the preferred approach for SSDs, but
          with conventional disks I think the performance would be poor? In 1231
          I implemented the var-sized CSF with a skip list, similar to a posting
          list. I think we should add that here too and we can still keep the
          additional index that stores the pointers? We could have two readers:
          one that allows random-access and loads the pointers into RAM (or uses
          MMAP as you mentioned), and a second one that doesn't load anything
          into RAM, uses the skip lists and only allows iterator-based access?

          About updating CSF: I hope we can use parallel indexing for that. In
          other words: It should be possible for users to use parallel indexes
          to update certain fields, and Lucene should use the same approach
          internally to store different "generations" of things like norms and CSFs.

          Show
          Michael Busch added a comment - Great to see progress here, Mike! String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). It looks like ByteRef is very similar to Payload? Could you use that instead and extend it with the new String constructor and compare methods? It handles 3 types of values: So it looks like with your approach you want to support certain "primitive" types out of the box, such as byte[], float, int, String? If someone has custom data types, then they have, similar as with payloads today, the byte[] indirection? The code I initially wrote for 1231 exposed IndexOutput, so that one can call write*() directly, without having to convert to byte[] first. I think we will also want to do that for 2125 (store attributes in the index). So I'm wondering if this and 2125 should work similarly? Thinking out loud: Could we have then attributes with serialize/deserialize methods for primitive types, such as float? Could we efficiently use such an approach all the way up to FieldCache? It would be compelling if you could store an attribute as CSF, or in the postinglist, retrieve it from the flex APIs, and also from the FieldCache. All would be the same API and there would only be one place that needs to "know" about the encoding (the attribute). Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. Yeah, that's where I got kind of stuck with 1231: We need to figure out how the public API should look like, with which a user can add CSF values to the index and retrieve them. The easiest and fastest way would be to add a dedicated new API. The cleaner one would be to make the whole Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231 . Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) Hmm, so random-access would obviously be the preferred approach for SSDs, but with conventional disks I think the performance would be poor? In 1231 I implemented the var-sized CSF with a skip list, similar to a posting list. I think we should add that here too and we can still keep the additional index that stores the pointers? We could have two readers: one that allows random-access and loads the pointers into RAM (or uses MMAP as you mentioned), and a second one that doesn't load anything into RAM, uses the skip lists and only allows iterator-based access? About updating CSF: I hope we can use parallel indexing for that. In other words: It should be possible for users to use parallel indexes to update certain fields, and Lucene should use the same approach internally to store different "generations" of things like norms and CSFs.
          Hide
          Michael McCandless added a comment -

          Is this patch for flex, as it contains CodecUtils and so on?

          Actually it's intended for trunk; I was thinking this should land
          before flex (it's a much smaller change, and it's "isolated" from
          flex), and so I wrote the CodecUtil/BytesRef basic infrastructure,
          thinking flex would then cutover to them.

          Hmm, so random-access would obviously be the preferred approach for SSDs, but
          with conventional disks I think the performance would be poor? In 1231
          I implemented the var-sized CSF with a skip list, similar to a posting
          list. I think we should add that here too and we can still keep the
          additional index that stores the pointers? We could have two readers:
          one that allows random-access and loads the pointers into RAM (or uses
          MMAP as you mentioned), and a second one that doesn't load anything
          into RAM, uses the skip lists and only allows iterator-based access?

          The intention here is for this ("index values") to replace field
          cache, but not aim (initially at least) to do much more. Ie, it's
          "meant" to be a RAM resident (either via explicit slurping-into-RAM or
          via MMAP). So the SSD or spinning magnets should not be hit on
          retrieval.

          If we add an iterator API, I think it should be simpler than the
          postings API (ie, no seeking, dense (every doc is visited,
          sequentially) iteration).

          It looks like ByteRef is very similar to Payload? Could you use that instead
          and extend it with the new String constructor and compare methods?

          Good point! I agree. Also, we should use BytesRef when reading the
          payload from TermsEnum. Actually I think Payload, BytesRef, TermRef
          (in flex) should all eventually be merged; of the three names, I think
          I like BytesRef the best. With *Enum in flex we can switch to
          BytesRef. For analysis we should switch PayloadAttribute to BytesRef,
          and deprecate the methods using Payload? Hmmm... but PayloadAttribute
          is an interface.

          So it looks like with your approach you want to support certain
          "primitive" types out of the box, such as byte[], float, int, String?

          Actually, all "primitive" types (ie, byte/short/int/long are
          "included" under int, as well as arbitrary bit precision "between"
          those primitive types). Because the API uses a method invocation (eg
          IntSource.get) instead of direct array access, we can "hide" how many
          bits are actually used, under the impl. Same is true for float/double
          (except we can't [easily] do arbitrary bit precision here... just 4 or
          8 bytes).

          If someone has custom data types, then they have, similar as with
          payloads today, the byte[] indirection?

          Right, byte[] is for String, but also for arbitrary (opaque to Lucene)
          extensibility. The six anonymous (separate package private classes)
          concrete impls should give good efficiency to fit the different use
          cases.

          The code I initially wrote for 1231 exposed IndexOutput, so that one
          can call write*() directly, without having to convert to byte[]
          first. I think we will also want to do that for 2125 (store attributes
          in the index). So I'm wondering if this and 2125 should work
          similarly?

          This is compelling (letting Attrs read/write directly), but, I have
          some questions:

          • How would the random-access API work? (Attrs are designed for
            iteration). Eg, just providing IndexInput/Output to the Attr
            isn't quite enough – the encoding is sometimes context dependent
            (like frq writes the delta between docIDs, the symbol table needed
            when reading/writing deref/sorted). How would I build a random
            access API on top of that? captureState-per-doc is too costly.
            What API would be used to write the shared state, ie, to tell the
            Attr "we now are writing the segment, so you need to dump the
            symbol table".
          • How would the packed ints work? EG say my ints only need 5 bits.
            (Attrs are sort of designed for one-value-at-once).
          • How would the "symbol table" based encodings (deref, sorted) work?
            I guess the attr would need to have some state associated with
            it, and when I first create the attr I need to pass it segment
            name, Directory, etc, so it opens the right files?
          • I'm thinking we should still directly support native types, ie,
            Attrs are there for extensibility beyond native types?
          • Exposing single attr across a multi reader sounds tricky –
            LUCENE-2154 (and, we need this for flex, which is worrying me!).
            But it sounds like you and Uwe are making some progress on that
            (using some under-the-hood Java reflection magic)... and this
            doesn't directly affect this issue, assuming we don't expose this
            API at the MultiReader level.

          Thinking out loud: Could we have then attributes with
          serialize/deserialize methods for primitive types, such as float?
          Could we efficiently use such an approach all the way up to
          FieldCache? It would be compelling if you could store an attribute as
          CSF, or in the postinglist, retrieve it from the flex APIs, and also
          from the FieldCache. All would be the same API and there would only be
          one place that needs to "know" about the encoding (the attribute).

          This is the grand unification of everything I like it, but, I
          don't want that future utopia to stall our progress today... ie I'd
          rather do something simple yet concrete, now, and then work step by
          step towards that future ("progress not perfection").

          That said, if we can get some bite sized step in, today, towards that
          future, that'd be good.

          Eg, the current patch only supports "dense" storage, ie it's assumed
          every document will have a value, because it's aiming to replace field
          cache. If we wanted to add sparse storage... I think that'd
          require/strongly encourage access via a postings-like iteration API,
          which I don't see how to take a baby step towards

          I do think it would be compelling for an Attr to "only" have to expose
          read/write methods, and then the Attr can be stored in CSF or
          postings, but I don't see how to make an efficient random-access API
          on top of that. I think it's in LUCENE-2125 where we should explore
          this.

          Norms and deleted docs should be able to eventually switch to CSF.

          In fact, norms should just be a FloatSource, with default impl being
          the 1-byte float encoding we use today. This then gives apps full
          flexibility to plugin their own FloatSource.

          For deleted docs we should probably create a BoolSource.

          About updating CSF: I hope we can use parallel indexing for that. In
          other words: It should be possible for users to use parallel indexes
          to update certain fields, and Lucene should use the same approach
          internally to store different "generations" of things like norms and
          CSFs.

          That sounds great, though, I think we need a more efficient way to
          store the changes. Ie, norms rewrites all norms on any change, which
          is costly. It'd be better to have some sort of delta format, where
          you sparsely encode docID + new value, and then when loading we merge
          those on the fly (and, segment merging periodically also merges &
          commits them).

          Yeah, that's where I got kind of stuck with 1231: We need to figure
          out how the public API should look like, with which a user can add CSF
          values to the index and retrieve them. The easiest and fastest way
          would be to add a dedicated new API. The cleaner one would be to make the whole
          Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt.

          Right, but LUCENE-1597 is another good but far-away-from-landing
          goal. I think a dedicated API is fine for the atomic types. Field
          cache today is a dedicated API...

          I guess to sum up my thoughts now (but I'm still mulling...):

          • I think the random-access-field-cache-like-API should be separate
            from the designed-for-iteration-from-a-file postings API.
          • Attrs for extensibilty could be compelling, but I don't see how to
            build an [efficient] random access API on top of Attrs. It would
            be very elegant only having to add a read/write method to your
            Attr, but, that's not really enough for a full codec.
          • I don't think we should hold up adding direct support for atomic
            types until/if we can figure out how to add Attrs. Ie I think we
            should do this in two steps. The current patch is [roughly] step
            1, and I think should be a compelling replacement for field cache.
            Memory usage and GC cost of string sorting should be much lower
            than field cache.

          I'm also still mulling on these issues w/ the current patch:

          • How could we use index values to efficiently maintain stats needed
            for flexible scoring (LUCENE-2187).
          • Current patch doesn't handle merging yet.
          • Could norms/deleted docs "conceivably" cutover to index values
            API?
          • What "dedicated API" for indexing & sorting.
          • Run basic perf tests to see cost of using method instead of direct
            array.
          Show
          Michael McCandless added a comment - Is this patch for flex, as it contains CodecUtils and so on? Actually it's intended for trunk; I was thinking this should land before flex (it's a much smaller change, and it's "isolated" from flex), and so I wrote the CodecUtil/BytesRef basic infrastructure, thinking flex would then cutover to them. Hmm, so random-access would obviously be the preferred approach for SSDs, but with conventional disks I think the performance would be poor? In 1231 I implemented the var-sized CSF with a skip list, similar to a posting list. I think we should add that here too and we can still keep the additional index that stores the pointers? We could have two readers: one that allows random-access and loads the pointers into RAM (or uses MMAP as you mentioned), and a second one that doesn't load anything into RAM, uses the skip lists and only allows iterator-based access? The intention here is for this ("index values") to replace field cache, but not aim (initially at least) to do much more. Ie, it's "meant" to be a RAM resident (either via explicit slurping-into-RAM or via MMAP). So the SSD or spinning magnets should not be hit on retrieval. If we add an iterator API, I think it should be simpler than the postings API (ie, no seeking, dense (every doc is visited, sequentially) iteration). It looks like ByteRef is very similar to Payload? Could you use that instead and extend it with the new String constructor and compare methods? Good point! I agree. Also, we should use BytesRef when reading the payload from TermsEnum. Actually I think Payload, BytesRef, TermRef (in flex) should all eventually be merged; of the three names, I think I like BytesRef the best. With *Enum in flex we can switch to BytesRef. For analysis we should switch PayloadAttribute to BytesRef, and deprecate the methods using Payload? Hmmm... but PayloadAttribute is an interface. So it looks like with your approach you want to support certain "primitive" types out of the box, such as byte[], float, int, String? Actually, all "primitive" types (ie, byte/short/int/long are "included" under int, as well as arbitrary bit precision "between" those primitive types). Because the API uses a method invocation (eg IntSource.get) instead of direct array access, we can "hide" how many bits are actually used, under the impl. Same is true for float/double (except we can't [easily] do arbitrary bit precision here... just 4 or 8 bytes). If someone has custom data types, then they have, similar as with payloads today, the byte[] indirection? Right, byte[] is for String, but also for arbitrary (opaque to Lucene) extensibility. The six anonymous (separate package private classes) concrete impls should give good efficiency to fit the different use cases. The code I initially wrote for 1231 exposed IndexOutput, so that one can call write*() directly, without having to convert to byte[] first. I think we will also want to do that for 2125 (store attributes in the index). So I'm wondering if this and 2125 should work similarly? This is compelling (letting Attrs read/write directly), but, I have some questions: How would the random-access API work? (Attrs are designed for iteration). Eg, just providing IndexInput/Output to the Attr isn't quite enough – the encoding is sometimes context dependent (like frq writes the delta between docIDs, the symbol table needed when reading/writing deref/sorted). How would I build a random access API on top of that? captureState-per-doc is too costly. What API would be used to write the shared state, ie, to tell the Attr "we now are writing the segment, so you need to dump the symbol table". How would the packed ints work? EG say my ints only need 5 bits. (Attrs are sort of designed for one-value-at-once). How would the "symbol table" based encodings (deref, sorted) work? I guess the attr would need to have some state associated with it, and when I first create the attr I need to pass it segment name, Directory, etc, so it opens the right files? I'm thinking we should still directly support native types, ie, Attrs are there for extensibility beyond native types? Exposing single attr across a multi reader sounds tricky – LUCENE-2154 (and, we need this for flex, which is worrying me!). But it sounds like you and Uwe are making some progress on that (using some under-the-hood Java reflection magic)... and this doesn't directly affect this issue, assuming we don't expose this API at the MultiReader level. Thinking out loud: Could we have then attributes with serialize/deserialize methods for primitive types, such as float? Could we efficiently use such an approach all the way up to FieldCache? It would be compelling if you could store an attribute as CSF, or in the postinglist, retrieve it from the flex APIs, and also from the FieldCache. All would be the same API and there would only be one place that needs to "know" about the encoding (the attribute). This is the grand unification of everything I like it, but, I don't want that future utopia to stall our progress today... ie I'd rather do something simple yet concrete, now, and then work step by step towards that future ("progress not perfection"). That said, if we can get some bite sized step in, today, towards that future, that'd be good. Eg, the current patch only supports "dense" storage, ie it's assumed every document will have a value, because it's aiming to replace field cache. If we wanted to add sparse storage... I think that'd require/strongly encourage access via a postings-like iteration API, which I don't see how to take a baby step towards I do think it would be compelling for an Attr to "only" have to expose read/write methods, and then the Attr can be stored in CSF or postings, but I don't see how to make an efficient random-access API on top of that. I think it's in LUCENE-2125 where we should explore this. Norms and deleted docs should be able to eventually switch to CSF. In fact, norms should just be a FloatSource, with default impl being the 1-byte float encoding we use today. This then gives apps full flexibility to plugin their own FloatSource. For deleted docs we should probably create a BoolSource. About updating CSF: I hope we can use parallel indexing for that. In other words: It should be possible for users to use parallel indexes to update certain fields, and Lucene should use the same approach internally to store different "generations" of things like norms and CSFs. That sounds great, though, I think we need a more efficient way to store the changes. Ie, norms rewrites all norms on any change, which is costly. It'd be better to have some sort of delta format, where you sparsely encode docID + new value, and then when loading we merge those on the fly (and, segment merging periodically also merges & commits them). Yeah, that's where I got kind of stuck with 1231: We need to figure out how the public API should look like, with which a user can add CSF values to the index and retrieve them. The easiest and fastest way would be to add a dedicated new API. The cleaner one would be to make the whole Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt. Right, but LUCENE-1597 is another good but far-away-from-landing goal. I think a dedicated API is fine for the atomic types. Field cache today is a dedicated API... I guess to sum up my thoughts now (but I'm still mulling...): I think the random-access-field-cache-like-API should be separate from the designed-for-iteration-from-a-file postings API. Attrs for extensibilty could be compelling, but I don't see how to build an [efficient] random access API on top of Attrs. It would be very elegant only having to add a read/write method to your Attr, but, that's not really enough for a full codec. I don't think we should hold up adding direct support for atomic types until/if we can figure out how to add Attrs. Ie I think we should do this in two steps. The current patch is [roughly] step 1, and I think should be a compelling replacement for field cache. Memory usage and GC cost of string sorting should be much lower than field cache. I'm also still mulling on these issues w/ the current patch: How could we use index values to efficiently maintain stats needed for flexible scoring ( LUCENE-2187 ). Current patch doesn't handle merging yet. Could norms/deleted docs "conceivably" cutover to index values API? What "dedicated API" for indexing & sorting. Run basic perf tests to see cost of using method instead of direct array.
          Hide
          Michael McCandless added a comment -

          Attaching my current state – things are still very very rough, and there are contrib/remote test failures.

          This patch has an initial integration with Lucene, enabling a Field to set how its values should be indexed into CSF... values are merged during indexign, and I created FieldComparators to use them for sorting.

          There are still some outright hacks in there, under nocommits (eg how SegmentInfo.files() computes the CSF files)...

          I'm now thinking we should wrap up & land flex, before going much further on this feature...

          Show
          Michael McCandless added a comment - Attaching my current state – things are still very very rough, and there are contrib/remote test failures. This patch has an initial integration with Lucene, enabling a Field to set how its values should be indexed into CSF... values are merged during indexign, and I created FieldComparators to use them for sorting. There are still some outright hacks in there, under nocommits (eg how SegmentInfo.files() computes the CSF files)... I'm now thinking we should wrap up & land flex, before going much further on this feature...
          Hide
          Michael McCandless added a comment -

          I did a RAM cost estimation of FieldCache vs the approach in this
          patch (attached mem.py). I took a 5M doc Wikipedia index I have, and
          computed how much RAM is used by FieldCache for STORE ONLY
          (.getStrings) and for SORTING (.getStringIndex), vs the patch, on
          the title field. I assume all titles are unique:

          STORE ONLY
            32 bit current: 449.0 MB
            32 bit   patch: 136.0 MB [69.7% smaller]
            64 bit current: 487.1 MB
            64 bit   patch: 136.0 MB [72.1% smaller]
          
          SORTING
            32 bit current: 468.0 MB
            32 bit   patch: 149.7 MB [68.0% smaller]
            64 bit current: 506.2 MB
            64 bit   patch: 149.7 MB [70.4% smaller]
          

          This is a sizable RAM savings! Also, FieldCache creates 2X the
          objects (because, I think, String creates a separate char[] to hold
          the characters), whereas with the patch 2 or 3 [shared] arrays are
          created, so there's obviously much less GC load too.

          Show
          Michael McCandless added a comment - I did a RAM cost estimation of FieldCache vs the approach in this patch (attached mem.py). I took a 5M doc Wikipedia index I have, and computed how much RAM is used by FieldCache for STORE ONLY (.getStrings) and for SORTING (.getStringIndex), vs the patch, on the title field. I assume all titles are unique: STORE ONLY 32 bit current: 449.0 MB 32 bit patch: 136.0 MB [69.7% smaller] 64 bit current: 487.1 MB 64 bit patch: 136.0 MB [72.1% smaller] SORTING 32 bit current: 468.0 MB 32 bit patch: 149.7 MB [68.0% smaller] 64 bit current: 506.2 MB 64 bit patch: 149.7 MB [70.4% smaller] This is a sizable RAM savings! Also, FieldCache creates 2X the objects (because, I think, String creates a separate char[] to hold the characters), whereas with the patch 2 or 3 [shared] arrays are created, so there's obviously much less GC load too.
          Hide
          Simon Willnauer added a comment -

          Attaching current status. I ported mikes patch to current trunk and added some tests here and there.
          Current status is:

          • all tests pass
          • supports CompoundFile
          • several no-commits still present

          next steps might be cleaning up most of the no-commits and maybe a first sketch on a Iterator API.

          Mike do you mind if I take this?

          Show
          Simon Willnauer added a comment - Attaching current status. I ported mikes patch to current trunk and added some tests here and there. Current status is: all tests pass supports CompoundFile several no-commits still present next steps might be cleaning up most of the no-commits and maybe a first sketch on a Iterator API. Mike do you mind if I take this?
          Hide
          Michael McCandless added a comment -

          Great – thanks for pushing this forward Simon!

          Mike do you mind if I take this?

          Please do!

          Show
          Michael McCandless added a comment - Great – thanks for pushing this forward Simon! Mike do you mind if I take this? Please do!
          Hide
          Simon Willnauer added a comment -

          Attaching my current state to start iteration over the code as this patch contains may new things.

          • All tests are passing
          • handles merging in a generic way
          • handles deletes
          • implements a dedicated iterator API which operates on the files directly
          • unifies the Values API like Reader, Writer, Source and ValuesEnum
          • enables accessing Values via DirectoryReader (no optimized index needed)
          • add a proposal to utilize Attributes on a per Field basis

          I'd like to throw this out and get initial feedback as quite a couple of things need to be discussed before we can proceed with this one. There are still a whole lot of nocommits in the code and some of the lower - level changes need review and feedback by people with more experience down there (Mike?

          I'd really appreciate any comments especially on the API as this most important to me right now.

          I haven't had time to implement / propose something to use mmap for Source but this would be next on the list.
          If you have questions please join!!

          Again, this is nothing which is really close to be committable but we are getting closer!
          I would also like to move this to a branch as this seems to grow and a feature branch would be more convenient.

          Show
          Simon Willnauer added a comment - Attaching my current state to start iteration over the code as this patch contains may new things. All tests are passing handles merging in a generic way handles deletes implements a dedicated iterator API which operates on the files directly unifies the Values API like Reader, Writer, Source and ValuesEnum enables accessing Values via DirectoryReader (no optimized index needed) add a proposal to utilize Attributes on a per Field basis I'd like to throw this out and get initial feedback as quite a couple of things need to be discussed before we can proceed with this one. There are still a whole lot of nocommits in the code and some of the lower - level changes need review and feedback by people with more experience down there (Mike? I'd really appreciate any comments especially on the API as this most important to me right now. I haven't had time to implement / propose something to use mmap for Source but this would be next on the list. If you have questions please join!! Again, this is nothing which is really close to be committable but we are getting closer! I would also like to move this to a branch as this seems to grow and a feature branch would be more convenient.
          Hide
          Yonik Seeley added a comment -

          I'd really appreciate any comments especially on the API as this most important to me right now.

          Could you show some examples of the most efficient way to use this API?
          i.e. an example that shows both how to index a document with a CSF, and then how to iterate over all values of a CSF (or get the value for a specific set of documents).

          Show
          Yonik Seeley added a comment - I'd really appreciate any comments especially on the API as this most important to me right now. Could you show some examples of the most efficient way to use this API? i.e. an example that shows both how to index a document with a CSF, and then how to iterate over all values of a CSF (or get the value for a specific set of documents).
          Hide
          Simon Willnauer added a comment - - edited

          Hey Yonik,

          Could you show some examples of the most efficient way to use this API?

          Sure! While it's already late over here I am happy to provide you those two examples. This is how you can index CSF with this Attribute approach:

           
              RAMDirectory dir = new RAMDirectory();
              IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_40, new SimpleAnalyzer(Version.LUCENE_40)));
              Document doc = new Document();
              Fieldable fieldable = new AttributeField("myIntField");
              ValuesAttribute valuesAttribute = fieldable.attributes().addAttribute(ValuesAttribute.class);
              valuesAttribute.setType(Values.PACKED_INTS);
              valuesAttribute.ints().set(100);
              doc.add(fieldable);    
              writer.addDocument(doc);
              writer.close();
          

          This is how get the values back via source or the ValuesEnum:

              IndexReader reader = IndexReader.open(dir);
              // this might be integrated into Fields eventually
              Reader indexValues = reader.getIndexValues("myIntField"); // can get cached version too via reader.getIndexValuesCache();
              Source load = indexValues.load();
              long value = load.ints(0);
              System.out.println(value);
          
              // or get it from the enum
              ValuesEnum intEnum = indexValues.getEnum();
              ValuesAttribute attr = intEnum.getAttribute(ValuesAttribute.class);
              while(intEnum.nextDoc() != ValuesEnum.NO_MORE_DOCS) {
                System.out.println(attr.ints().get());
              }
          

          I guess this should make it at least easier to get started. I actually expect people saying ValuesEnum looks very much like DocsEnum which is certainly correct. Yet, I didn't integrate ValuesEnum into Fields etc. for simplicity as changes to Fields touches lot of code. Having ValuesEnum being a "stand-alone" API makes iterating and development easier and a cut over to DocsEnum would be easy API wise as it already implements DocIdSetIterator.

          With that in mind, the use of ValuesAttribute makes sense too - with a stand-also API this would be obsolet and could be direct part of ValuesEnum.

          What I don't like about DocsEnum is the BulkReadResult class and its relative being a first class citizen in DocsEnum. With CSF not every DocsEnum iterates over <id,freq>* - but maybe we can move that to an attribute and make a more general abstract class. The Attributes on those enums don't introduce a real overhead but solve lots of extendability problems though.

          Show
          Simon Willnauer added a comment - - edited Hey Yonik, Could you show some examples of the most efficient way to use this API? Sure! While it's already late over here I am happy to provide you those two examples. This is how you can index CSF with this Attribute approach: RAMDirectory dir = new RAMDirectory(); IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_40, new SimpleAnalyzer(Version.LUCENE_40))); Document doc = new Document(); Fieldable fieldable = new AttributeField( "myIntField" ); ValuesAttribute valuesAttribute = fieldable.attributes().addAttribute(ValuesAttribute.class); valuesAttribute.setType(Values.PACKED_INTS); valuesAttribute.ints().set(100); doc.add(fieldable); writer.addDocument(doc); writer.close(); This is how get the values back via source or the ValuesEnum: IndexReader reader = IndexReader.open(dir); // this might be integrated into Fields eventually Reader indexValues = reader.getIndexValues( "myIntField" ); // can get cached version too via reader.getIndexValuesCache(); Source load = indexValues.load(); long value = load.ints(0); System .out.println(value); // or get it from the enum ValuesEnum intEnum = indexValues.getEnum(); ValuesAttribute attr = intEnum.getAttribute(ValuesAttribute.class); while (intEnum.nextDoc() != ValuesEnum.NO_MORE_DOCS) { System .out.println(attr.ints().get()); } I guess this should make it at least easier to get started. I actually expect people saying ValuesEnum looks very much like DocsEnum which is certainly correct. Yet, I didn't integrate ValuesEnum into Fields etc. for simplicity as changes to Fields touches lot of code. Having ValuesEnum being a "stand-alone" API makes iterating and development easier and a cut over to DocsEnum would be easy API wise as it already implements DocIdSetIterator . With that in mind, the use of ValuesAttribute makes sense too - with a stand-also API this would be obsolet and could be direct part of ValuesEnum . What I don't like about DocsEnum is the BulkReadResult class and its relative being a first class citizen in DocsEnum . With CSF not every DocsEnum iterates over <id,freq>* - but maybe we can move that to an attribute and make a more general abstract class. The Attributes on those enums don't introduce a real overhead but solve lots of extendability problems though.
          Hide
          Michael McCandless added a comment -

          Great work Simon!

          Now that FieldCache has cut over to shared byte[] blocks, them mem
          gains for CSF when storing byte[] data are mostly gone.

          But, there is still important benefits with CSF:

          • The full image is stored on-disk (= much faster than uninversion
            (& sometimes sorting) that FieldCache does, on startup)
          • You can specify all 6 combinations of variable/fixed length X
            straight/deref/sorted. FieldCache is either var-length X deref
            (FieldCache.getTerms) or var-length X sorted
            (FieldCache.getTermsIndex).
          • It should be more extensible, ie, you can make your own attrs to
            store whatever you want. EG we should be able to use this to
            store the flex scoring stats (LUCENE-2392).

          The end-user API is rather cumbersome now (ie, that the user must
          interact directly w/ attrs). It seems like we should have a sugar
          layer on top, eg an IntField(Type) and I can do IntField.set/get.

          Also... maybe we should use Attrs the way NumericField does. Ie, for
          CSF we'd have a TokenStream (single valued, for now anyway), and then
          attrs could be added to it. If we can get attr serialization
          (LUCENE-2125) online, then we can refactor all the read/write code in
          this issue as the default attr serializers? And, then, indexer would
          have no special code for CSF in particular. It just asks attrs to
          serialize themselves...

          Shouldn't FloatsRef be FloatRef (same for IntsRef)? It's ref'ing a
          single value right?

          Show
          Michael McCandless added a comment - Great work Simon! Now that FieldCache has cut over to shared byte[] blocks, them mem gains for CSF when storing byte[] data are mostly gone. But, there is still important benefits with CSF: The full image is stored on-disk (= much faster than uninversion (& sometimes sorting) that FieldCache does, on startup) You can specify all 6 combinations of variable/fixed length X straight/deref/sorted. FieldCache is either var-length X deref (FieldCache.getTerms) or var-length X sorted (FieldCache.getTermsIndex). It should be more extensible, ie, you can make your own attrs to store whatever you want. EG we should be able to use this to store the flex scoring stats ( LUCENE-2392 ). The end-user API is rather cumbersome now (ie, that the user must interact directly w/ attrs). It seems like we should have a sugar layer on top, eg an IntField(Type) and I can do IntField.set/get. Also... maybe we should use Attrs the way NumericField does. Ie, for CSF we'd have a TokenStream (single valued, for now anyway), and then attrs could be added to it. If we can get attr serialization ( LUCENE-2125 ) online, then we can refactor all the read/write code in this issue as the default attr serializers? And, then, indexer would have no special code for CSF in particular. It just asks attrs to serialize themselves... Shouldn't FloatsRef be FloatRef (same for IntsRef)? It's ref'ing a single value right?
          Hide
          Simon Willnauer added a comment -

          It should be more extensible, ie, you can make your own attrs to
          store whatever you want. EG we should be able to use this to
          store the flex scoring stats (LUCENE-2392).

          This is actually the first real use-case together with the norms which is kind of part of LUCENE-2392 anyway

          The end-user API is rather cumbersome now (ie, that the user must
          interact directly w/ attrs). It seems like we should have a sugar
          layer on top, eg an IntField(Type) and I can do IntField.set/get.

          Yeah I guess lots of users would have a rather hard time with that. I remember Grant saying that he tries to explain Document and Fields since every in his trainings and with users in mind this should be done with least amount of changes. Nevertheless this is something which should be fixed outside of this particular issue, LUCENE-2310 would be one I could think of. Guess I need to talk to chrismale on Friday about that.

          Also... maybe we should use Attrs the way NumericField does. Ie, for
          CSF we'd have a TokenStream (single valued, for now anyway), and then
          attrs could be added to it. If we can get attr serialization
          (LUCENE-2125) online, then we can refactor all the read/write code in
          this issue as the default attr serializers? And, then, indexer would
          have no special code for CSF in particular. It just asks attrs to
          serialize themselves...

          LUCENE-2125 is something which would be nice to have together with CSF. Yet I don't think it depends on each other but it should use the same or very closely related APIs eventually. LUCENE-2125 has different problems to tackle first I guess - but I am closely following that! I will update that patch to make use of the

          {NumericField}

          - lets call it - work-around to make this patch "less hairy". Still hairy but I like the idea of using TokenStream to attach the ValuesAttribute.

          Shouldn't FloatsRef be FloatRef (same for IntsRef)? It's ref'ing a
          single value right?

          Yes and no. I was too lazy to add all the capabilities

          {BytesRef}

          has but I could imagine that this can benefit from being able to hold more values - maybe a entire page when paging is used. If it only holds a single value we don't need offset and length too. I will leaf it like that for now, can still change it later if it turns out that we don't need this flexibility.

          I guess I will move the ValuesEnum down to Fields and FieldsEnum soon. I don't think we should confuse this with an DocsEnum since DocsEnum is so closely related to Terms and has explicit getters for freq() though. DocIdSetIterator seems to be fine for that purpose - while the AttributeSource could be pulled up.

          Show
          Simon Willnauer added a comment - It should be more extensible, ie, you can make your own attrs to store whatever you want. EG we should be able to use this to store the flex scoring stats ( LUCENE-2392 ). This is actually the first real use-case together with the norms which is kind of part of LUCENE-2392 anyway The end-user API is rather cumbersome now (ie, that the user must interact directly w/ attrs). It seems like we should have a sugar layer on top, eg an IntField(Type) and I can do IntField.set/get. Yeah I guess lots of users would have a rather hard time with that. I remember Grant saying that he tries to explain Document and Fields since every in his trainings and with users in mind this should be done with least amount of changes. Nevertheless this is something which should be fixed outside of this particular issue, LUCENE-2310 would be one I could think of. Guess I need to talk to chrismale on Friday about that. Also... maybe we should use Attrs the way NumericField does. Ie, for CSF we'd have a TokenStream (single valued, for now anyway), and then attrs could be added to it. If we can get attr serialization ( LUCENE-2125 ) online, then we can refactor all the read/write code in this issue as the default attr serializers? And, then, indexer would have no special code for CSF in particular. It just asks attrs to serialize themselves... LUCENE-2125 is something which would be nice to have together with CSF. Yet I don't think it depends on each other but it should use the same or very closely related APIs eventually. LUCENE-2125 has different problems to tackle first I guess - but I am closely following that! I will update that patch to make use of the {NumericField} - lets call it - work-around to make this patch "less hairy". Still hairy but I like the idea of using TokenStream to attach the ValuesAttribute. Shouldn't FloatsRef be FloatRef (same for IntsRef)? It's ref'ing a single value right? Yes and no. I was too lazy to add all the capabilities {BytesRef} has but I could imagine that this can benefit from being able to hold more values - maybe a entire page when paging is used. If it only holds a single value we don't need offset and length too. I will leaf it like that for now, can still change it later if it turns out that we don't need this flexibility. I guess I will move the ValuesEnum down to Fields and FieldsEnum soon. I don't think we should confuse this with an DocsEnum since DocsEnum is so closely related to Terms and has explicit getters for freq() though. DocIdSetIterator seems to be fine for that purpose - while the AttributeSource could be pulled up.
          Hide
          Simon Willnauer added a comment -

          We should get the BytesHash in first to make this patch little simpler. I don't wanna refactor TermsHashPerField in here.

          Show
          Simon Willnauer added a comment - We should get the BytesHash in first to make this patch little simpler. I don't wanna refactor TermsHashPerField in here.
          Hide
          Simon Willnauer added a comment -

          I would want to move this to a branch for further development. If nobody objects I'm gonna move forward within the next days.

          simon

          Show
          Simon Willnauer added a comment - I would want to move this to a branch for further development. If nobody objects I'm gonna move forward within the next days. simon
          Hide
          Robert Muir added a comment -

          I would want to move this to a branch for further development. If nobody objects I'm gonna move forward within the next days.

          +1

          In my opinion, if its helpful to use a branch for a feature like this, we should not hesitate!
          With a lot of development on trunk, big patches make it difficult for anyone to get involved.

          Additionally its extremely difficult to iterate, because its hard to see the differences between iterations.
          But say, with the flexible indexing branch, this history is preserved since it was done in a branch.
          So I am able to just click 'view merged revisions' in my IDE and see all that history.

          Show
          Robert Muir added a comment - I would want to move this to a branch for further development. If nobody objects I'm gonna move forward within the next days. +1 In my opinion, if its helpful to use a branch for a feature like this, we should not hesitate! With a lot of development on trunk, big patches make it difficult for anyone to get involved. Additionally its extremely difficult to iterate, because its hard to see the differences between iterations. But say, with the flexible indexing branch, this history is preserved since it was done in a branch. So I am able to just click 'view merged revisions' in my IDE and see all that history.
          Hide
          Simon Willnauer added a comment -

          Updates patch to trunk - all tests pass. Since LUCENE-1990 and LUCENE-2662 have been committed some of the improvement / refactorings in this patch became obsolet. I update everything to current trunk and made the Field / Values API somewhat easier to use. I would go a create a branch based on this patch in the next days.

          Anybody preferences on the name? I would suggest "values_branch" or "perdoc-payloads" but it really doesn't matter though. Suggestions?

          Show
          Simon Willnauer added a comment - Updates patch to trunk - all tests pass. Since LUCENE-1990 and LUCENE-2662 have been committed some of the improvement / refactorings in this patch became obsolet. I update everything to current trunk and made the Field / Values API somewhat easier to use. I would go a create a branch based on this patch in the next days. Anybody preferences on the name? I would suggest "values_branch" or "perdoc-payloads" but it really doesn't matter though. Suggestions?
          Hide
          Michael McCandless added a comment -

          Updates patch to trunk - all tests pass

          Patch is looking good....

          There are still many nocommits but most look like they could become TODOs?

          Do you have a high level sense of what's missing before we can commit to trunk?

          Anybody preferences on the name? I would suggest "values_branch" or "perdoc-payloads" but it really doesn't matter though. Suggestions?

          How about docvalues? You don't need the _branch part since it'll be at http://svn.apache.org.../branches/docvalues.

          Show
          Michael McCandless added a comment - Updates patch to trunk - all tests pass Patch is looking good.... There are still many nocommits but most look like they could become TODOs? Do you have a high level sense of what's missing before we can commit to trunk? Anybody preferences on the name? I would suggest "values_branch" or "perdoc-payloads" but it really doesn't matter though. Suggestions? How about docvalues? You don't need the _branch part since it'll be at http://svn.apache.org.../branches/docvalues .
          Hide
          Simon Willnauer added a comment - - edited

          There are still many nocommits but most look like they could become TODOs?
          Do you have a high level sense of what's missing before we can commit to trunk?

          Yes and No , here is my roadmap for this issue. We have 47 nocommit pending where about the half of it can be TODOs while the other half of it are rather easy task and should be fixed before we go to trunk.
          These are the major steps I would like to finish until we land this on trunk

          • Implement bulk copies for merging where possible. Currently there are still some value types not bulk copied and the ones which are only do if there are no deletes. Yet, the deletes thing I would make a TODO for now - we can still make that more efficient once we are on trunk. If I recall correctly figuring out the next deleted document is still a linear problem (I need to iterate through deletes), right? I guess that would be easier if I could figure out the next one so see if bulks are reasonable - maybe an invalid concern though.
          • Exposing the API via Fields / IndexReader. I think we should expose the Iterator API via Fields just like Terms is today. Currently it doesn't feel very natural to get the ValuesEnum via IR.
          • Rethink the Source API - I get the feeling that we don't really need the Source class but could rather use a Random Access Enum like Terms where we can see back and forth depending on how we loaded the fields values. We could actually unify the iterator API and random access which would catch two birds with one stone. internally we simply use the *Refs to set the actual values, default values would no be needed anymore (would save some code / branches internally) and the user would not have to access two different APIs. Additionally we could expose bulk reads just like BulkReadResult in DocsEnum to obtain all values in an array. Maybe if we wanna populate FieldsCache from it. I think we won't have perf. losts due to that since there is not really an overhead compared to the get() call on Source. - Reminds me I need to think about how we use that with sorted values.... If we keep Source we should at least make it implement ValuesEnum so we can use it as enumeration if they are in mem already.
          • To do merging for byte values correctly we need to figure out how to specify the comparator for each field. I don't have a concrete idea for this but I think this should somehow go into IndexWriterConfig in a per field map. Thougths?

          Remaining nocommits could be converted into TODOs - I think we can do so with the following

          • Evaluate if we can decide if a Bytes Payload should be stored as straight or as fixed which would make it easier for the user to use the byte variants.
          • Evaluate if we need String variants or if they can simple be solved with the byte ones
          • We should have some king of compatibility notion so that slightly different segments can be merged like fixed vs. var bytes float32 vs. float64.
          • For a cleaner transition we should create a sep. SortField that always uses index values.
          • explore a better way to obtain all dat / idx fiels in SegmentInfo to do segment merges for index values.
          • BytesValueProcessor should be thread private but I will leave that as a todo since this code might change anyway once realtime lands on trunk though. Not super urgent for now.
          • Fix some exception handling issues especially in MultiSource & MultiValuesEnum
          • Fix the singed / unsigned limitations in Ints implementation
          • Explore ways to preven Ints impl do two method calls maybe we can expose PackedInts directly somehow

          How about docvalues? You don't need the _branch part since it'll be at http://svn.apache.org.../branches/docvalues.

          OK

          Show
          Simon Willnauer added a comment - - edited There are still many nocommits but most look like they could become TODOs? Do you have a high level sense of what's missing before we can commit to trunk? Yes and No , here is my roadmap for this issue. We have 47 nocommit pending where about the half of it can be TODOs while the other half of it are rather easy task and should be fixed before we go to trunk. These are the major steps I would like to finish until we land this on trunk Implement bulk copies for merging where possible. Currently there are still some value types not bulk copied and the ones which are only do if there are no deletes. Yet, the deletes thing I would make a TODO for now - we can still make that more efficient once we are on trunk. If I recall correctly figuring out the next deleted document is still a linear problem (I need to iterate through deletes), right? I guess that would be easier if I could figure out the next one so see if bulks are reasonable - maybe an invalid concern though. Exposing the API via Fields / IndexReader. I think we should expose the Iterator API via Fields just like Terms is today. Currently it doesn't feel very natural to get the ValuesEnum via IR. Rethink the Source API - I get the feeling that we don't really need the Source class but could rather use a Random Access Enum like Terms where we can see back and forth depending on how we loaded the fields values. We could actually unify the iterator API and random access which would catch two birds with one stone. internally we simply use the *Refs to set the actual values, default values would no be needed anymore (would save some code / branches internally) and the user would not have to access two different APIs. Additionally we could expose bulk reads just like BulkReadResult in DocsEnum to obtain all values in an array. Maybe if we wanna populate FieldsCache from it. I think we won't have perf. losts due to that since there is not really an overhead compared to the get() call on Source. - Reminds me I need to think about how we use that with sorted values.... If we keep Source we should at least make it implement ValuesEnum so we can use it as enumeration if they are in mem already. To do merging for byte values correctly we need to figure out how to specify the comparator for each field. I don't have a concrete idea for this but I think this should somehow go into IndexWriterConfig in a per field map. Thougths? Remaining nocommits could be converted into TODOs - I think we can do so with the following Evaluate if we can decide if a Bytes Payload should be stored as straight or as fixed which would make it easier for the user to use the byte variants. Evaluate if we need String variants or if they can simple be solved with the byte ones We should have some king of compatibility notion so that slightly different segments can be merged like fixed vs. var bytes float32 vs. float64. For a cleaner transition we should create a sep. SortField that always uses index values. explore a better way to obtain all dat / idx fiels in SegmentInfo to do segment merges for index values. BytesValueProcessor should be thread private but I will leave that as a todo since this code might change anyway once realtime lands on trunk though. Not super urgent for now. Fix some exception handling issues especially in MultiSource & MultiValuesEnum Fix the singed / unsigned limitations in Ints implementation Explore ways to preven Ints impl do two method calls maybe we can expose PackedInts directly somehow How about docvalues? You don't need the _branch part since it'll be at http://svn.apache.org.../branches/docvalues . OK
          Hide
          Simon Willnauer added a comment - - edited

          created branch at docvalues and committed the last patch at r1021636. I think the next steps are adding a fix version "docvalues" to JIRA and create new issues according to the "roadmap" above. Once we are through with the mandatory stuff and documentation we can land this on trunk. Thoughts?

          I'm not sure if we should continue on this issue or close it and create a new "top level" one and spawn issues from there.

          simon

          Show
          Simon Willnauer added a comment - - edited created branch at docvalues and committed the last patch at r1021636. I think the next steps are adding a fix version "docvalues" to JIRA and create new issues according to the "roadmap" above. Once we are through with the mandatory stuff and documentation we can land this on trunk. Thoughts? I'm not sure if we should continue on this issue or close it and create a new "top level" one and spawn issues from there. simon
          Hide
          Michael McCandless added a comment -

          Implement bulk copies for merging where possible.

          I don't think this should block landing on trunk? (Even in the non-deletes case).

          But, yes, searching for next del doc is a linear op, but a very small constant in front (at least OpenBitSet.nextSetBit, though del docs are currently a BitVector), yet is very much worth it once we get the bulk copying in since presumably big chunks of docs can be bulk copied.

          Exposing the API via Fields / IndexReader. I think we should expose the Iterator API via Fields just like Terms is today. Currently it doesn't feel very natural to get the ValuesEnum via IR.

          Ahh that does sound like the right place.

          Maybe if we wanna populate FieldsCache from it.

          We should be careful here – it's best if things consume the docvalues instead of double-copying into the FC.

          Show
          Michael McCandless added a comment - Implement bulk copies for merging where possible. I don't think this should block landing on trunk? (Even in the non-deletes case). But, yes, searching for next del doc is a linear op, but a very small constant in front (at least OpenBitSet.nextSetBit, though del docs are currently a BitVector), yet is very much worth it once we get the bulk copying in since presumably big chunks of docs can be bulk copied. Exposing the API via Fields / IndexReader. I think we should expose the Iterator API via Fields just like Terms is today. Currently it doesn't feel very natural to get the ValuesEnum via IR. Ahh that does sound like the right place. Maybe if we wanna populate FieldsCache from it. We should be careful here – it's best if things consume the docvalues instead of double-copying into the FC.
          Hide
          Michael McCandless added a comment -

          I think this is very close!!

          • Using attr source as the way to specify the docValue is nice in
            that we get full extensibility, but, it's also heavyweight
            compared to a dedicated API (ie, .setIntValue, etc.). So I think
            this means apps that use doc values really must re-use their Field
            instances (if they are using doc values) else indexing performance
            will likely take a good hit.
          • ValuesField is nice sugar on top (of the attr) Can you add some
            jdocs to ValuesField? EG it's not stored/indexed. It's OK to have
            same field name as existing field (hmm... is it)? Etc.
          • Did you want to make FieldsConsumer.addValuesField abstract?
          • The javadoc above DocValues.Source is wrong – Source is not just
            for ints.
          • You can change jdocs like "This feature is experimental and the
            API is free to change in non-backwards-compatible ways." to
            @lucene.experimental (eg in Values.java)
          • So, you're not allowed to change the DocValues type for a field
            once you've set it the first time... and, also, segments cannot be
            merged if the same field has different value types. I'm thinking
            it's really important now to carry over the same FieldInfos from
            the last segment when opening the writer (LUCENE-1737)... because
            hitting that IllegalStateExc during merge is a trap. This would
            let us change that IllegalStateExc into an assert (in
            SegmentMerger) and also turn the assert back on in FieldsConsumer.
          • Should we rename MissingValues to MissingValue? Ie it holds the
            single value for your type that represents "missing"?
          • We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4
            heh.
          • It'd be nice to have a more approachable test case that shows the
            "simple" way to index doc values, ie using ValuesField instead of
            getting the attr, getting the intsRef, setting it, etc. I think
            such an "example" should be very compact right?
          Show
          Michael McCandless added a comment - I think this is very close!! Using attr source as the way to specify the docValue is nice in that we get full extensibility, but, it's also heavyweight compared to a dedicated API (ie, .setIntValue, etc.). So I think this means apps that use doc values really must re-use their Field instances (if they are using doc values) else indexing performance will likely take a good hit. ValuesField is nice sugar on top (of the attr) Can you add some jdocs to ValuesField? EG it's not stored/indexed. It's OK to have same field name as existing field (hmm... is it)? Etc. Did you want to make FieldsConsumer.addValuesField abstract? The javadoc above DocValues.Source is wrong – Source is not just for ints. You can change jdocs like "This feature is experimental and the API is free to change in non-backwards-compatible ways." to @lucene.experimental (eg in Values.java) So, you're not allowed to change the DocValues type for a field once you've set it the first time... and, also, segments cannot be merged if the same field has different value types. I'm thinking it's really important now to carry over the same FieldInfos from the last segment when opening the writer ( LUCENE-1737 )... because hitting that IllegalStateExc during merge is a trap. This would let us change that IllegalStateExc into an assert (in SegmentMerger) and also turn the assert back on in FieldsConsumer. Should we rename MissingValues to MissingValue? Ie it holds the single value for your type that represents "missing"? We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4 heh. It'd be nice to have a more approachable test case that shows the "simple" way to index doc values, ie using ValuesField instead of getting the attr, getting the intsRef, setting it, etc. I think such an "example" should be very compact right?
          Hide
          Simon Willnauer added a comment -

          I think this is very close!!

          Heh, I strongly agree!

          Using attr source as the way to specify the docValue is nice in
          that we get full extensibility, but, it's also heavyweight
          compared to a dedicated API (ie, .setIntValue, etc.). So I think
          this means apps that use doc values really must re-use their Field
          instances (if they are using doc values) else indexing performance
          will likely take a good hit.

          Well it is a nice way of extending field but I am not sure if we
          should keep it since it is heavy weight. We could get rid of
          ValuesAttribute for landing on trunk and work on making field
          extendible - which is desperately needed anyway. I was also thinking
          that the ValuesEnum doesn't need the ValuesAttribute per se. it would
          be more intuitive to have getter on ValuesEnum too. I just really hate
          those instanceof checks on fields.

          ValuesField is nice sugar on top (of the attr) Can you add some
          jdocs to ValuesField? EG it's not stored/indexed. It's OK to have
          same field name as existing field (hmm... is it)? Etc.

          Yeah - until here I haven't done much javadoc but that is on top of
          the list. I will start adding JavaDoc to main classes of the API and
          ValuesField is 100% a main class of it.
          BTW. it is ok to have the same name as a existing field.

          Did you want to make FieldsConsumer.addValuesField abstract?

          That is a leftover - I will remove it.

          The javadoc above DocValues.Source is wrong – Source is not just for ints.

          True - see above that class had a different purpose back in the days
          where it was a patch

          You can change jdocs like "This feature is experimental and the
          API is free to change in non-backwards-compatible ways." to
          @lucene.experimental (eg in Values.java)

          yeah - its good to have stuff like that left!!!!! yay!

          So, you're not allowed to change the DocValues type for a field
          once you've set it the first time... and, also, segments cannot be
          merged if the same field has different value types. I'm thinking
          it's really important now to carry over the same FieldInfos from
          the last segment when opening the writer (LUCENE-1737)... because
          hitting that IllegalStateExc during merge is a trap. This would
          let us change that IllegalStateExc into an assert (in
          SegmentMerger) and also turn the assert back on in FieldsConsumer.

          I think that should not block us from moving forward and landing on trunk ey?

          Should we rename MissingValues to MissingValue? Ie it holds the single
          value for your type that represents "missing"?

          True, I was also thinking to rename some of the classes like
          Values -> DocValueType
          PackedIntsImpl -> Ints

          We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4

          hehe yeah - lemme change the one I added and lets fix the rest on
          trunk. I will open an issue once I have a reliable inet connection
          again.

          It'd be nice to have a more approachable test case that shows the
          "simple" way to index doc values, ie using ValuesField instead of
          getting the attr, getting the intsRef, setting it, etc. I think
          such an "example" should be very compact right?

          done on my checkout!

          so on my list there are the following topics until landing:

          • missing testcase for addIndexes and a simple one to show how to use the api
          • split up exiting tests in smaller tests - they test too much and
            they are hard to understand
          • Add JavaDoc to main classes like DocValues, Source, ValuesEnum, ValuesField
          • Document the different types
          • Consistent class naming - see above
          • enable ram usage tracking for all DocValuesProducer to support
            flush by RAM usage

          That seems very very close to me. Lets see how much I get done on my
          flight to boston

          Show
          Simon Willnauer added a comment - I think this is very close!! Heh, I strongly agree! Using attr source as the way to specify the docValue is nice in that we get full extensibility, but, it's also heavyweight compared to a dedicated API (ie, .setIntValue, etc.). So I think this means apps that use doc values really must re-use their Field instances (if they are using doc values) else indexing performance will likely take a good hit. Well it is a nice way of extending field but I am not sure if we should keep it since it is heavy weight. We could get rid of ValuesAttribute for landing on trunk and work on making field extendible - which is desperately needed anyway. I was also thinking that the ValuesEnum doesn't need the ValuesAttribute per se. it would be more intuitive to have getter on ValuesEnum too. I just really hate those instanceof checks on fields. ValuesField is nice sugar on top (of the attr) Can you add some jdocs to ValuesField? EG it's not stored/indexed. It's OK to have same field name as existing field (hmm... is it)? Etc. Yeah - until here I haven't done much javadoc but that is on top of the list. I will start adding JavaDoc to main classes of the API and ValuesField is 100% a main class of it. BTW. it is ok to have the same name as a existing field. Did you want to make FieldsConsumer.addValuesField abstract? That is a leftover - I will remove it. The javadoc above DocValues.Source is wrong – Source is not just for ints. True - see above that class had a different purpose back in the days where it was a patch You can change jdocs like "This feature is experimental and the API is free to change in non-backwards-compatible ways." to @lucene.experimental (eg in Values.java) yeah - its good to have stuff like that left!!!!! yay! So, you're not allowed to change the DocValues type for a field once you've set it the first time... and, also, segments cannot be merged if the same field has different value types. I'm thinking it's really important now to carry over the same FieldInfos from the last segment when opening the writer ( LUCENE-1737 )... because hitting that IllegalStateExc during merge is a trap. This would let us change that IllegalStateExc into an assert (in SegmentMerger) and also turn the assert back on in FieldsConsumer. I think that should not block us from moving forward and landing on trunk ey? Should we rename MissingValues to MissingValue? Ie it holds the single value for your type that represents "missing"? True, I was also thinking to rename some of the classes like Values -> DocValueType PackedIntsImpl -> Ints We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4 hehe yeah - lemme change the one I added and lets fix the rest on trunk. I will open an issue once I have a reliable inet connection again. It'd be nice to have a more approachable test case that shows the "simple" way to index doc values, ie using ValuesField instead of getting the attr, getting the intsRef, setting it, etc. I think such an "example" should be very compact right? done on my checkout! so on my list there are the following topics until landing: missing testcase for addIndexes and a simple one to show how to use the api split up exiting tests in smaller tests - they test too much and they are hard to understand Add JavaDoc to main classes like DocValues, Source, ValuesEnum, ValuesField Document the different types Consistent class naming - see above enable ram usage tracking for all DocValuesProducer to support flush by RAM usage That seems very very close to me. Lets see how much I get done on my flight to boston
          Hide
          Michael McCandless added a comment -

          BTW. it is ok to have the same name as a existing field.

          It is, usually... but we should add a test to assert this is still the
          case for other field + ValuesField?

          I'm thinking it's really important now to carry over the same FieldInfos from the last segment when opening the writer (LUCENE-1737)... because hitting that IllegalStateExc during merge is a trap.

          I think that should not block us from moving forward and landing on trunk ey?

          It makes me mighty nervous though... I'll try to get that issue done
          soon.

          Well it is a nice way of extending field but I am not sure if we
          should keep it since it is heavy weight.

          The ValuesAttr for ValuesField is actually really heavyweight. Not
          only must it fire up an AttrSource, but then ValuesAttrImpl itself has
          a field for each type. Worse, for the type you do actually use, it's
          then another object eg FloatsRef, which in turn holds
          array/offset/len, a new length 1 array, etc.

          Maybe we shouldn't use attrs here? And instead somehow let
          ValuesField store a single value as it's own private member?

          FloatsRef, LongsRef are missing the ASL header. Maybe it's time to
          run RAT

          Show
          Michael McCandless added a comment - BTW. it is ok to have the same name as a existing field. It is, usually... but we should add a test to assert this is still the case for other field + ValuesField? I'm thinking it's really important now to carry over the same FieldInfos from the last segment when opening the writer ( LUCENE-1737 )... because hitting that IllegalStateExc during merge is a trap. I think that should not block us from moving forward and landing on trunk ey? It makes me mighty nervous though... I'll try to get that issue done soon. Well it is a nice way of extending field but I am not sure if we should keep it since it is heavy weight. The ValuesAttr for ValuesField is actually really heavyweight. Not only must it fire up an AttrSource, but then ValuesAttrImpl itself has a field for each type. Worse, for the type you do actually use, it's then another object eg FloatsRef, which in turn holds array/offset/len, a new length 1 array, etc. Maybe we shouldn't use attrs here? And instead somehow let ValuesField store a single value as it's own private member? FloatsRef, LongsRef are missing the ASL header. Maybe it's time to run RAT
          Hide
          Simon Willnauer added a comment -

          It is, usually... but we should add a test to assert this is still the
          case for other field + ValuesField?

          I already implemented a simple testcase that shows that this works as an example (I will commit that soon though) but I think we need to add another test that ensures that this works with all types of DocValues though. Yet, I work on making the test more "atomic" and test only a single thing anyway so i will add that too.

          It makes me mighty nervous though... I'll try to get that issue done soon.

          Well until then I just go on and get the remaining stuff done here.

          Maybe we shouldn't use attrs here? And instead somehow let ValuesField store a single value as it's own private member?

          I more and more think we can nuke ValuesAttribute completely since its other purpose on ValuesEnum is somewhat obsolete too. It is actually a leftover from earlier days where I was experimenting with using DocEnum to serve CSF too. There it would have made sense though but now we can provide a dedicated API. It still bugs me that Field is so hard to extend. We really need to fix that soon!

          I think what we should do is extend AbstractField and simply use a long/double/BytesRef and force folks to add another field instance if they want to have it indexed and stored.

          Maybe it's time to run RAT

          +1

          Show
          Simon Willnauer added a comment - It is, usually... but we should add a test to assert this is still the case for other field + ValuesField? I already implemented a simple testcase that shows that this works as an example (I will commit that soon though) but I think we need to add another test that ensures that this works with all types of DocValues though. Yet, I work on making the test more "atomic" and test only a single thing anyway so i will add that too. It makes me mighty nervous though... I'll try to get that issue done soon. Well until then I just go on and get the remaining stuff done here. Maybe we shouldn't use attrs here? And instead somehow let ValuesField store a single value as it's own private member? I more and more think we can nuke ValuesAttribute completely since its other purpose on ValuesEnum is somewhat obsolete too. It is actually a leftover from earlier days where I was experimenting with using DocEnum to serve CSF too. There it would have made sense though but now we can provide a dedicated API. It still bugs me that Field is so hard to extend. We really need to fix that soon! I think what we should do is extend AbstractField and simply use a long/double/BytesRef and force folks to add another field instance if they want to have it indexed and stored. Maybe it's time to run RAT +1
          Hide
          Michael McCandless added a comment -

          Is there any test cases that cover the new FieldComparators that use the doc values?

          I think a good test case would be to sort w/ FieldCache and then again w/ doc values and verify they match...

          Show
          Michael McCandless added a comment - Is there any test cases that cover the new FieldComparators that use the doc values? I think a good test case would be to sort w/ FieldCache and then again w/ doc values and verify they match...
          Hide
          Simon Willnauer added a comment -

          Is there any test cases that cover the new FieldComparators that use the doc values?

          not yet, I added it to my internal roadmap to land on trunk. I just committed my latest changes including a simple testcase to show how to use the API and used bytes tracking.

          here is a list of what is missing:

            /*
             * TODO:
             * Roadmap to land on trunk
             *   - Cut over to a direct API on ValuesEnum vs. ValuesAttribute 
             *   - Add documentation for:
             *      - Source and ValuesEnum
             *      - DocValues
             *      - ValuesField
             *      - ValuesAttribute
             *      - Values
             *   - Add @lucene.experimental to all necessary classes
             *   - Try to make ValuesField more lightweight -> AttributeSource
             *   - add test for unoptimized case with deletes
             *   - add a test for addIndexes
             *   - split up existing testcases and give them meaningfull names
             *   - use consistent naming throughout DocValues
             *     - Values -> DocValueType
             *     - PackedIntsImpl -> Ints
             *   - run RAT
             *   - add tests for FieldComparator FloatIndexValuesComparator vs. FloatValuesComparator etc.
             */
          

          once I am through with it I will create a new issue and create the final patch so we can iterate over it if needed.

          simon

          Show
          Simon Willnauer added a comment - Is there any test cases that cover the new FieldComparators that use the doc values? not yet, I added it to my internal roadmap to land on trunk. I just committed my latest changes including a simple testcase to show how to use the API and used bytes tracking. here is a list of what is missing: /* * TODO: * Roadmap to land on trunk * - Cut over to a direct API on ValuesEnum vs. ValuesAttribute * - Add documentation for : * - Source and ValuesEnum * - DocValues * - ValuesField * - ValuesAttribute * - Values * - Add @lucene.experimental to all necessary classes * - Try to make ValuesField more lightweight -> AttributeSource * - add test for unoptimized case with deletes * - add a test for addIndexes * - split up existing testcases and give them meaningfull names * - use consistent naming throughout DocValues * - Values -> DocValueType * - PackedIntsImpl -> Ints * - run RAT * - add tests for FieldComparator FloatIndexValuesComparator vs. FloatValuesComparator etc. */ once I am through with it I will create a new issue and create the final patch so we can iterate over it if needed. simon
          Hide
          Yonik Seeley added a comment -

          Whew... this interface is more expansive than I thought it would be (but I guess it's really many issues rolled into one... like sorting, caching, etc).
          So it seems like DocValuesEnum is the traditional lowest level "read the index", and Source is a cached version of that?

          A higher level question I have is why we're not reusing the FieldCache for caching/sorting?

          Show
          Yonik Seeley added a comment - Whew... this interface is more expansive than I thought it would be (but I guess it's really many issues rolled into one... like sorting, caching, etc). So it seems like DocValuesEnum is the traditional lowest level "read the index", and Source is a cached version of that? A higher level question I have is why we're not reusing the FieldCache for caching/sorting?
          Hide
          Simon Willnauer added a comment -

          Whew... this interface is more expansive than I thought it would be (but I guess it's really many issues rolled into one... like sorting, caching, etc).

          sorry about that

          So it seems like DocValuesEnum is the traditional lowest level "read the index", and Source is a cached version of that?

          Not quiet DocValuesEnum is an iterator based access to the DocValues which does not load everything to memory while Source is a entirely Ram-Resident offering random access to values similar to field cache. Yet, you can also obtain a DocValuesEnum from a Source since its already in memory.

          A higher level question I have is why we're not reusing the FieldCache for caching/sorting?

          You mean as a replacement for Source? - For caching what we did in here is to leave it to the user to do the caching or cache based on Source instance how would that relate to FieldCache in your opinion?

          Show
          Simon Willnauer added a comment - Whew... this interface is more expansive than I thought it would be (but I guess it's really many issues rolled into one... like sorting, caching, etc). sorry about that So it seems like DocValuesEnum is the traditional lowest level "read the index", and Source is a cached version of that? Not quiet DocValuesEnum is an iterator based access to the DocValues which does not load everything to memory while Source is a entirely Ram-Resident offering random access to values similar to field cache. Yet, you can also obtain a DocValuesEnum from a Source since its already in memory. A higher level question I have is why we're not reusing the FieldCache for caching/sorting? You mean as a replacement for Source? - For caching what we did in here is to leave it to the user to do the caching or cache based on Source instance how would that relate to FieldCache in your opinion?
          Hide
          Jason Rutherglen added a comment -

          Out of curiosity, re: LUCENE-2312, are we planning on putting CSF into Lucene 4.x? What's left to be done?

          Show
          Jason Rutherglen added a comment - Out of curiosity, re: LUCENE-2312 , are we planning on putting CSF into Lucene 4.x? What's left to be done?
          Hide
          Simon Willnauer added a comment -

          Out of curiosity, re: LUCENE-2312, are we planning on putting CSF into Lucene 4.x? What's left to be done?

          we are very close - to land on trunk there is about an evening of work left. JDoc is missing here and there plus some tests for FieldComparators - thats it!

          Show
          Simon Willnauer added a comment - Out of curiosity, re: LUCENE-2312 , are we planning on putting CSF into Lucene 4.x? What's left to be done? we are very close - to land on trunk there is about an evening of work left. JDoc is missing here and there plus some tests for FieldComparators - thats it!
          Hide
          Jason Rutherglen added a comment -

          we are very close - to land on trunk there is about an evening of work left. JDoc is missing here and there plus some tests for FieldComparators - thats it!

          Nice! Once it's in I'll try to get started on the RT field cache/doc values, which can likely be implemented and tested somewhat independent of the RT inverted index.

          Show
          Jason Rutherglen added a comment - we are very close - to land on trunk there is about an evening of work left. JDoc is missing here and there plus some tests for FieldComparators - thats it! Nice! Once it's in I'll try to get started on the RT field cache/doc values, which can likely be implemented and tested somewhat independent of the RT inverted index.
          Hide
          Jason Rutherglen added a comment -

          I'm wondering if there is a limitation on whether or not we can randomly access the doc values from the underlying Directory implementation, rather than need to load all the values directly into the main heap space. This seems doable, and if so let me know if I can provide a patch.

          Show
          Jason Rutherglen added a comment - I'm wondering if there is a limitation on whether or not we can randomly access the doc values from the underlying Directory implementation, rather than need to load all the values directly into the main heap space. This seems doable, and if so let me know if I can provide a patch.
          Hide
          Simon Willnauer added a comment -

          I'm wondering if there is a limitation on whether or not we can randomly access the doc values from the underlying Directory implementation, rather than need to load all the values directly into the main heap space. This seems doable, and if so let me know if I can provide a patch.

          the current implementation to access docValues not loaded into memory uses DocIdSetIterator as its parent interface so it works only in one direction currently. changing this to a random access "seekable" API should be not too hard.
          Look at http://svn.apache.org/repos/asf/lucene/dev/branches/docvalues/lucene/src/java/org/apache/lucene/index/values/DocValuesEnum.java

          simon

          Show
          Simon Willnauer added a comment - I'm wondering if there is a limitation on whether or not we can randomly access the doc values from the underlying Directory implementation, rather than need to load all the values directly into the main heap space. This seems doable, and if so let me know if I can provide a patch. the current implementation to access docValues not loaded into memory uses DocIdSetIterator as its parent interface so it works only in one direction currently. changing this to a random access "seekable" API should be not too hard. Look at http://svn.apache.org/repos/asf/lucene/dev/branches/docvalues/lucene/src/java/org/apache/lucene/index/values/DocValuesEnum.java simon
          Hide
          Jason Rutherglen added a comment -

          changing this to a random access "seekable" API should be not too hard

          I think we can offer the option of MMap'ing the field caches, which I think will help alleviate OOMs?

          Show
          Jason Rutherglen added a comment - changing this to a random access "seekable" API should be not too hard I think we can offer the option of MMap'ing the field caches, which I think will help alleviate OOMs?
          Hide
          Lance Norskog added a comment -

          What's the current status on this?

          Show
          Lance Norskog added a comment - What's the current status on this?
          Hide
          Simon Willnauer added a comment -

          What's the current status on this?

          I am currently focusing on RT and DWPT to be landed on trunk. Once this is done I can merge DocValues and finish the last remaining limitations. Its pretty close there are a couple of issues like javadoc (not complete but close), Codec integration is somewhat flaky and needs some new api. Feature wise its complete.

          Show
          Simon Willnauer added a comment - What's the current status on this? I am currently focusing on RT and DWPT to be landed on trunk. Once this is done I can merge DocValues and finish the last remaining limitations. Its pretty close there are a couple of issues like javadoc (not complete but close), Codec integration is somewhat flaky and needs some new api. Feature wise its complete.
          Hide
          Simon Willnauer added a comment -

          currently landing on LUCENE-3108

          Show
          Simon Willnauer added a comment - currently landing on LUCENE-3108

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Michael McCandless
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development