Details

    • Lucene Fields:
      New

      Description

      Its time to move another feature from branch to trunk. I want to start this process now while still a couple of issues remain on the branch. Currently I am down to a single nocommit (javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and unoptimized with deletions) but I think those are not worth separate issues so we can resolve them as we go.
      The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process here IMO, we can fix them once we are on trunk.

      Here is a quick feature overview of what has been implemented:

      • DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed / variable size each in sorted, straight and deref variations)
      • Integration into Flex-API, Codec provides a PerDocConsumer->DocValuesConsumer (write) / PerDocValues->DocValues (read)
      • By-Default enabled in all codecs except of PreFlex
      • Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues if on DirReader etc.
      • Integration into IndexWriter, FieldInfos etc.
      • Random-testing enabled via RandomIW - injecting random DocValues into documents
      • Basic checks in CheckIndex (which runs after each test)
      • FieldComparator for int and float variants (Sorting, currently directly integrated into SortField, this might go into a separate DocValuesSortField eventually)
      • Extended TestSort for DocValues
      • RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential access) -> Source.java / DocValuesEnum.java
      • Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into RAM only once and freed once IR is closed) -> SourceCache.java

      PS: Currently the RAM resident API is named Source (Source.java) which seems too generic. I think we should rename it into RamDocValues or something like that, suggestion welcome!

      Any comments, questions (rants ) are very much appreciated.

      1. LUCENE-3108.patch
        21 kB
        Simon Willnauer
      2. LUCENE-3108.patch
        765 kB
        Simon Willnauer
      3. LUCENE-3108.patch
        414 kB
        Simon Willnauer
      4. LUCENE-3108_CHANGES.patch
        2 kB
        Simon Willnauer
      5. LUCENE-3108.patch
        452 kB
        Simon Willnauer
      6. LUCENE-3108.patch
        451 kB
        Simon Willnauer

        Activity

        Hide
        Aliaksandr Zhuhrou added a comment -

        Thank you. The thing that make me ask this question is that in the org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl.FixedBytesWriterBase#add we have logic that handles cases when => (lastDocID+1 < docID) so I assumed that docId may have gap greater than 1.

        "in general you should rather ask these kind of questions on the main dev mailing list."
        Sure.

        Show
        Aliaksandr Zhuhrou added a comment - Thank you. The thing that make me ask this question is that in the org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl.FixedBytesWriterBase#add we have logic that handles cases when => (lastDocID+1 < docID) so I assumed that docId may have gap greater than 1. "in general you should rather ask these kind of questions on the main dev mailing list." Sure.
        Hide
        Simon Willnauer added a comment -

        Hi, Simon. Can doc values be optional? I am looking into org.apache.lucene.codecs.DocValuesConsumer#merge and see that the logic assumes that for every docId we have a existing value. Or we use the default value instead?

        hey, DocValues are dense and assume a value for each document. Yet, if you don't enable DocValues on a fields its not stored so you only store it for certain fields. If you have just a small set of repeated values DocValues can store them efficiently and dedupliate if you are concerned about that.

        in general you should rather ask these kind of questions on the main dev mailing list.

        simon

        Show
        Simon Willnauer added a comment - Hi, Simon. Can doc values be optional? I am looking into org.apache.lucene.codecs.DocValuesConsumer#merge and see that the logic assumes that for every docId we have a existing value. Or we use the default value instead? hey, DocValues are dense and assume a value for each document. Yet, if you don't enable DocValues on a fields its not stored so you only store it for certain fields. If you have just a small set of repeated values DocValues can store them efficiently and dedupliate if you are concerned about that. in general you should rather ask these kind of questions on the main dev mailing list. simon
        Hide
        Aliaksandr Zhuhrou added a comment -

        Hi, Simon. Can doc values be optional? I am looking into org.apache.lucene.codecs.DocValuesConsumer#merge and see that the logic assumes that for every docId we have a existing value. Or we use the default value instead?

        Show
        Aliaksandr Zhuhrou added a comment - Hi, Simon. Can doc values be optional? I am looking into org.apache.lucene.codecs.DocValuesConsumer#merge and see that the logic assumes that for every docId we have a existing value. Or we use the default value instead?
        Hide
        Simon Willnauer added a comment -

        Reintegrated, Tested, Committed to trunk in revision 1134311

        thanks guys, its in eventually!

        Show
        Simon Willnauer added a comment - Reintegrated, Tested, Committed to trunk in revision 1134311 thanks guys, its in eventually!
        Hide
        Ryan McKinley added a comment -

        +1 This looks great.

        To avoid more svn work, I think committing soon is better then later.

        Show
        Ryan McKinley added a comment - +1 This looks great. To avoid more svn work, I think committing soon is better then later.
        Hide
        Simon Willnauer added a comment -

        Patch that reflects the last changes to sync with trunk after I ran svn merge -reintegrate
        The reintegrated branch looks good, no unchanged additions etc.

        I think we are ready to land this on trunk... I will wait a day or two if somebody has objections.

        here is my +1 to commit

        Show
        Simon Willnauer added a comment - Patch that reflects the last changes to sync with trunk after I ran svn merge -reintegrate The reintegrated branch looks good, no unchanged additions etc. I think we are ready to land this on trunk... I will wait a day or two if somebody has objections. here is my +1 to commit
        Hide
        Simon Willnauer added a comment -

        here is the latest diff for docvalues I will now reintegrate the branch and post diffs later.

        Show
        Simon Willnauer added a comment - here is the latest diff for docvalues I will now reintegrate the branch and post diffs later.
        Hide
        Simon Willnauer added a comment -

        There seems to be a merge missing in file TestIndexSplitter, the changes in there are unrelated, so this reverts a commit on trunk for improving tests.

        fixed revision 1133794

        thanks uwe!

        Show
        Simon Willnauer added a comment - There seems to be a merge missing in file TestIndexSplitter, the changes in there are unrelated, so this reverts a commit on trunk for improving tests. fixed revision 1133794 thanks uwe!
        Hide
        Uwe Schindler added a comment -

        One small issue:

        There seems to be a merge missing in file TestIndexSplitter, the changes in there are unrelated, so this reverts a commit on trunk for improving tests.

        The problem with the README.txt is already fixed.

        ...still digging

        Show
        Uwe Schindler added a comment - One small issue: There seems to be a merge missing in file TestIndexSplitter, the changes in there are unrelated, so this reverts a commit on trunk for improving tests. The problem with the README.txt is already fixed. ...still digging
        Hide
        Simon Willnauer added a comment -

        here is a changes entry for docvalues - comments welcome

        Show
        Simon Willnauer added a comment - here is a changes entry for docvalues - comments welcome
        Hide
        Michael McCandless added a comment -

        I did another review here – I think it's ready to land on trunk! Nice work Simon!

        Show
        Michael McCandless added a comment - I did another review here – I think it's ready to land on trunk! Nice work Simon!
        Hide
        Simon Willnauer added a comment -

        here is a smaller version that doesn't contain all the solr svn attribute changes etc.

        Show
        Simon Willnauer added a comment - here is a smaller version that doesn't contain all the solr svn attribute changes etc.
        Hide
        Simon Willnauer added a comment -

        here is a first patch for review. Any comments very much appreciated!

        I know its big but what can I say

        Show
        Simon Willnauer added a comment - here is a first patch for review. Any comments very much appreciated! I know its big but what can I say
        Hide
        Simon Willnauer added a comment -

        Hey folks,
        we are ready for the final review rounds here, I resolved the naming conflict by renaming DocValues to IndexDocValues, fixed all the outstanding documentation issues and added a fixed ints impl that automatically switches over to fixed int/long if packed ints can not handle the range of the values in a field.

        I am preparing a review patch now.

        Show
        Simon Willnauer added a comment - Hey folks, we are ready for the final review rounds here, I resolved the naming conflict by renaming DocValues to IndexDocValues, fixed all the outstanding documentation issues and added a fixed ints impl that automatically switches over to fixed int/long if packed ints can not handle the range of the values in a field. I am preparing a review patch now.
        Hide
        Simon Willnauer added a comment -

        It is tricky... but, eg, when someone does SortField("title", SortField.STRING), which cache (DV or FC) should we populate?

        I think we should have a specialized sort field eventually. FCSortField / DVSortField?

        Both ValueSource and DocValues have long been used by function queries.

        Suggestions welcome - nothing is fixed yet so we should find non-conflicting names. Maybe we can call it o.a.l.index.columns.Columns and o.a.l.index.columns.ColumnsEnum / ColumnsArray (instead of source)

        OK, but I think if we make a "straight longs" impl (ie no packed ints at all) then we can handle all long values? But in that case we'd require the app to pick a sentinel to mean "unset"?

        yes, I will open an issue.

        Show
        Simon Willnauer added a comment - It is tricky... but, eg, when someone does SortField("title", SortField.STRING), which cache (DV or FC) should we populate? I think we should have a specialized sort field eventually. FCSortField / DVSortField? Both ValueSource and DocValues have long been used by function queries. Suggestions welcome - nothing is fixed yet so we should find non-conflicting names. Maybe we can call it o.a.l.index.columns.Columns and o.a.l.index.columns.ColumnsEnum / ColumnsArray (instead of source) OK, but I think if we make a "straight longs" impl (ie no packed ints at all) then we can handle all long values? But in that case we'd require the app to pick a sentinel to mean "unset"? yes, I will open an issue.
        Hide
        Yonik Seeley added a comment -

        ValueSource? (conflicts w/ FQs though) Though, maybe we can just refer to it as DocValues.Source, then it's clear?

        Both ValueSource and DocValues have long been used by function queries.

        Show
        Yonik Seeley added a comment - ValueSource? (conflicts w/ FQs though) Though, maybe we can just refer to it as DocValues.Source, then it's clear? Both ValueSource and DocValues have long been used by function queries.
        Hide
        Michael McCandless added a comment -

        How come codecID changed from String to int on the branch?

        due to DocValues I need to compare the ID to certain fields to see for
        what field I stored and need to open docValues. I always had to parse
        the given string which is kind of odd. I think its more natural to
        have the same datatype on FieldInfo, SegmentCodecs and eventually in
        the Codec#files() method. Making a string out of it is way simpler /
        less risky than parsing IMO.

        OK that sounds great.

        Can SortField somehow detect whether the needed field was stored in FC vs DV

        This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all.

        It is tricky... but, eg, when someone does SortField("title",
        SortField.STRING), which cache (DV or FC) should we populate?

        Should we rename oal.index.values.Type -> .ValueType?

        agreed. I also think we should rename Source but I don't have a good name yet. Any idea?

        ValueSource? (conflicts w/ FQs though) Though, maybe we can just
        refer to it as DocValues.Source, then it's clear?

        Since we dynamically reserve a value to mean "unset", does that mean there are some datasets we cannot index?

        Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something?

        OK, but I think if we make a "straight longs" impl (ie no packed ints at all) then we can handle all long values? But in that case we'd require the app to pick a sentinel to mean "unset"?

        Show
        Michael McCandless added a comment - How come codecID changed from String to int on the branch? due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO. OK that sounds great. Can SortField somehow detect whether the needed field was stored in FC vs DV This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all. It is tricky... but, eg, when someone does SortField("title", SortField.STRING), which cache (DV or FC) should we populate? Should we rename oal.index.values.Type -> .ValueType? agreed. I also think we should rename Source but I don't have a good name yet. Any idea? ValueSource? (conflicts w/ FQs though) Though, maybe we can just refer to it as DocValues.Source, then it's clear? Since we dynamically reserve a value to mean "unset", does that mean there are some datasets we cannot index? Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something? OK, but I think if we make a "straight longs" impl (ie no packed ints at all) then we can handle all long values? But in that case we'd require the app to pick a sentinel to mean "unset"?
        Hide
        Simon Willnauer added a comment -

        I found some problems with tracking the used bytes during indexing. The AtomicLong was passed only to BytesRefHash but not to the block Allocator. I moved the tracking direct block allocator from DWPT into ByteBlockPool and use it in each of the Bytes Writers now.

        I also made some minor cleanups here and there and added more try / finally around Closeables.

        I plan to commit soon since this is really work in progress.

        Show
        Simon Willnauer added a comment - I found some problems with tracking the used bytes during indexing. The AtomicLong was passed only to BytesRefHash but not to the block Allocator. I moved the tracking direct block allocator from DWPT into ByteBlockPool and use it in each of the Bytes Writers now. I also made some minor cleanups here and there and added more try / finally around Closeables. I plan to commit soon since this is really work in progress.
        Hide
        Simon Willnauer added a comment -

        FYI. I ran indexing benchmarks trunk vs. branch and they are super close together. its like 3 sec difference while branch was faster so its in the noise. I also indexed one docvalues field (floats) which was also about the same 2 sec. slower including merges etc. So we are on the save side that this feature does not influence indexing performance. I didn't expect anything else really since the only difference is a single condition in DocFieldProcessor.

        Show
        Simon Willnauer added a comment - FYI. I ran indexing benchmarks trunk vs. branch and they are super close together. its like 3 sec difference while branch was faster so its in the noise. I also indexed one docvalues field (floats) which was also about the same 2 sec. slower including merges etc. So we are on the save side that this feature does not influence indexing performance. I didn't expect anything else really since the only difference is a single condition in DocFieldProcessor.
        Hide
        Simon Willnauer added a comment - - edited

        Mike thanks for the review!!!!!

        Phew been a long time since I looked at this branch!

        its been changing

        We have some stale jdocs that reference .setIntValue methods (they
        are now .setInt)

        True - thanks I will fix.

        Hmm do we have byte ordering problems? Ie, if I write index on
        machine with little-endian but then try to load values on
        big-endian...? I think we're OK (we seem to always use
        IndexOutput.writeInt, and we convert float-to-raw-int-bits using
        java's APIs)?

        We are ok here since we write big-endian (enforced by DataOutput) and read it back in as plain bytes. The created ByteBuffer will always use BIG_ENDIAN as the default order. I added a comment for this.

        How come codecID changed from String to int on the branch?

        due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO.

        What are oal.util.Pair and ParallelArray for?

        legacy I will remove

        FloatsRef should state in the jdocs that it's really slicing a
        double[]?

        yep done!

        Can SortField somehow detect whether the needed field was stored
        in FC vs DV and pick the right comparator accordingly...? Kind of
        like how NumericField can detect whether the ints are encoded as
        "plain text" or as NF? We can open a new issue for this,
        post-landing...

        This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all.

        It looks like we can sort by int/long/float/double pulled from DV,
        but not by terms? This is fine for landing... but I think we
        should open a post-landing issue to also make FieldComparators for
        the Terms cases?

        Yeah true. I didn't add a FieldComparator for bytes yet. I think this is post landing!

        Should we rename oal.index.values.Type -> .ValueType? Just
        because... it looks so generic when its imported & used as "Type"
        somewhere?

        agreed. I also think we should rename Source but I don't have a good name yet. Any idea?

        Since we dynamically reserve a value to mean "unset", does that
        mean there are some datasets we cannot index? Or... do we tap
        into the unused bit of a long, ie the sentinel value can be
        negative? But if the data set spans Long.MIN_VALUE to
        Long.MAX_VALUE, what do we do...?

        Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something?

        I will make the changes once SVN is writeable again.

        Show
        Simon Willnauer added a comment - - edited Mike thanks for the review!!!!! Phew been a long time since I looked at this branch! its been changing We have some stale jdocs that reference .setIntValue methods (they are now .setInt) True - thanks I will fix. Hmm do we have byte ordering problems? Ie, if I write index on machine with little-endian but then try to load values on big-endian...? I think we're OK (we seem to always use IndexOutput.writeInt, and we convert float-to-raw-int-bits using java's APIs)? We are ok here since we write big-endian (enforced by DataOutput) and read it back in as plain bytes. The created ByteBuffer will always use BIG_ENDIAN as the default order. I added a comment for this. How come codecID changed from String to int on the branch? due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO. What are oal.util.Pair and ParallelArray for? legacy I will remove FloatsRef should state in the jdocs that it's really slicing a double[]? yep done! Can SortField somehow detect whether the needed field was stored in FC vs DV and pick the right comparator accordingly...? Kind of like how NumericField can detect whether the ints are encoded as "plain text" or as NF? We can open a new issue for this, post-landing... This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all. It looks like we can sort by int/long/float/double pulled from DV, but not by terms? This is fine for landing... but I think we should open a post-landing issue to also make FieldComparators for the Terms cases? Yeah true. I didn't add a FieldComparator for bytes yet. I think this is post landing! Should we rename oal.index.values.Type -> .ValueType? Just because... it looks so generic when its imported & used as "Type" somewhere? agreed. I also think we should rename Source but I don't have a good name yet. Any idea? Since we dynamically reserve a value to mean "unset", does that mean there are some datasets we cannot index? Or... do we tap into the unused bit of a long, ie the sentinel value can be negative? But if the data set spans Long.MIN_VALUE to Long.MAX_VALUE, what do we do...? Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something? I will make the changes once SVN is writeable again.
        Hide
        Michael McCandless added a comment -

        This is an awesome change!

        Phew been a long time since I looked at this branch!

        Some questions on a quick pass – still need to iterate/dig deeper:

        • We have some stale jdocs that reference .setIntValue methods (they
          are now .setInt)
        • Hmm do we have byte ordering problems? Ie, if I write index on
          machine with little-endian but then try to load values on
          big-endian...? I think we're OK (we seem to always use
          IndexOutput.writeInt, and we convert float-to-raw-int-bits using
          java's APIs)?
        • Since we dynamically reserve a value to mean "unset", does that
          mean there are some datasets we cannot index? Or... do we tap
          into the unused bit of a long, ie the sentinel value can be
          negative? But if the data set spans Long.MIN_VALUE to
          Long.MAX_VALUE, what do we do...?
        • How come codecID changed from String to int on the branch?
        • What are oal.util.Pair and ParallelArray for?
        • FloatsRef should state in the jdocs that it's really slicing a
          double[]?
        • Can SortField somehow detect whether the needed field was stored
          in FC vs DV and pick the right comparator accordingly...? Kind of
          like how NumericField can detect whether the ints are encoded as
          "plain text" or as NF? We can open a new issue for this,
          post-landing...
        • It looks like we can sort by int/long/float/double pulled from DV,
          but not by terms? This is fine for landing... but I think we
          should open a post-landing issue to also make FieldComparators for
          the Terms cases?
        • Should we rename oal.index.values.Type -> .ValueType? Just
          because... it looks so generic when its imported & used as "Type"
          somewhere?
        Show
        Michael McCandless added a comment - This is an awesome change! Phew been a long time since I looked at this branch! Some questions on a quick pass – still need to iterate/dig deeper: We have some stale jdocs that reference .setIntValue methods (they are now .setInt) Hmm do we have byte ordering problems? Ie, if I write index on machine with little-endian but then try to load values on big-endian...? I think we're OK (we seem to always use IndexOutput.writeInt, and we convert float-to-raw-int-bits using java's APIs)? Since we dynamically reserve a value to mean "unset", does that mean there are some datasets we cannot index? Or... do we tap into the unused bit of a long, ie the sentinel value can be negative? But if the data set spans Long.MIN_VALUE to Long.MAX_VALUE, what do we do...? How come codecID changed from String to int on the branch? What are oal.util.Pair and ParallelArray for? FloatsRef should state in the jdocs that it's really slicing a double[]? Can SortField somehow detect whether the needed field was stored in FC vs DV and pick the right comparator accordingly...? Kind of like how NumericField can detect whether the ints are encoded as "plain text" or as NF? We can open a new issue for this, post-landing... It looks like we can sort by int/long/float/double pulled from DV, but not by terms? This is fine for landing... but I think we should open a post-landing issue to also make FieldComparators for the Terms cases? Should we rename oal.index.values.Type -> .ValueType? Just because... it looks so generic when its imported & used as "Type" somewhere?
        Hide
        Michael McCandless added a comment -

        +1, excellent!

        Show
        Michael McCandless added a comment - +1, excellent!

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development