Lucene - Core
  1. Lucene - Core
  2. LUCENE-3216

Store DocValues per segment instead of per field

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      currently we are storing docvalues per field which results in at least one file per field that uses docvalues (or at most two per field per segment depending on the impl.). Yet, we should try to by default pack docvalues into a single file if possible. To enable this we need to hold all docvalues in memory during indexing and write them to disk once we flush a segment.

      1. LUCENE-3216_floats.patch
        10 kB
        Simon Willnauer
      2. LUCENE-3216.patch
        60 kB
        Simon Willnauer
      3. LUCENE-3216.patch
        71 kB
        Simon Willnauer
      4. LUCENE-3216.patch
        69 kB
        Simon Willnauer
      5. LUCENE-3216.patch
        17 kB
        Simon Willnauer
      6. LUCENE-3216.patch
        40 kB
        Simon Willnauer
      7. LUCENE-3216.patch
        30 kB
        Simon Willnauer
      8. LUCENE-3216.patch
        25 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          Committed in revision 1143776.

          Show
          Simon Willnauer added a comment - Committed in revision 1143776.
          Hide
          Simon Willnauer added a comment -

          I plan to commit this soon if nobody objects.

          Show
          Simon Willnauer added a comment - I plan to commit this soon if nobody objects.
          Hide
          Simon Willnauer added a comment -

          here is a new patch that moves the DocValues configuration to setters.

          I also added a randomizeCodec(Codec) to LuceneTestCase that sets the CFS flag at random.

          Show
          Simon Willnauer added a comment - here is a new patch that moves the DocValues configuration to setters. I also added a randomizeCodec(Codec) to LuceneTestCase that sets the CFS flag at random.
          Hide
          Simon Willnauer added a comment -

          I will back out the config stuff and make it default to CFS. Somehow somebody who needs it eventually will figure it out how to make it non-private whatever.

          Show
          Simon Willnauer added a comment - I will back out the config stuff and make it default to CFS. Somehow somebody who needs it eventually will figure it out how to make it non-private whatever.
          Hide
          Robert Muir added a comment -

          I am not sure here, I had the same thought but when you look at Solr and other high level users they need to configure stuff somehow so I put all reasonable core stuff in there. I think its ok to have this for only one codec. Thoughts?

          I don't like CodecConfig actually. It doesn't make sense that it contains all these codec-specific parameters, which should be private to the codec. I think lucene's codecs should just be APIs and have ordinary ctors.

          As far as higher-level stuff like Solr, we can improve it there so its easier for users to configure this stuff, for example the Solr codec configuration allows you to specify a codecproviderfactory that takes arbitrary nested xml and parses it however you want.

          The only problem is we don't have a concrete (e.g. non-mock/test) implementation in Solr that actually exposes all of what lucene can offer... I would prefer we instead just do this, and make a SolrCodecProviderFactory that lets you configure skip intervals, pulsing cutoffs, and all these other codec-specific options in a type-safe way.

          Show
          Robert Muir added a comment - I am not sure here, I had the same thought but when you look at Solr and other high level users they need to configure stuff somehow so I put all reasonable core stuff in there. I think its ok to have this for only one codec. Thoughts? I don't like CodecConfig actually. It doesn't make sense that it contains all these codec-specific parameters, which should be private to the codec. I think lucene's codecs should just be APIs and have ordinary ctors. As far as higher-level stuff like Solr, we can improve it there so its easier for users to configure this stuff, for example the Solr codec configuration allows you to specify a codecproviderfactory that takes arbitrary nested xml and parses it however you want. The only problem is we don't have a concrete (e.g. non-mock/test) implementation in Solr that actually exposes all of what lucene can offer... I would prefer we instead just do this, and make a SolrCodecProviderFactory that lets you configure skip intervals, pulsing cutoffs, and all these other codec-specific options in a type-safe way.
          Hide
          Simon Willnauer added a comment -

          So this means, if you use default StandardCodec, and 3 fields store
          doc values, and "main" CFS is off but doc values CFS is on, you'll see
          a cfs file holding the 3-6 sub-files that your docvalues created,
          right?

          Correct!

          But eg if some fields use another codec, then that codec will have its
          own CFS for any fields it has with docvalues (this is the TODO)?
          That's seems fine for starters.

          again correct. So what I have in mind is a "global" cfs that a codec can pull via PerDocWriteState or something that holds all of them but for now having this per codec is fine IMO. I will create a follow up for this.

          For the nested test... couldn't you createCompoundOutput directly from an opened CompoundFileDirectory? (Vs creating externally & copying in).

          Yes I could but this functionality is tricky and not needed currently so I left it out for now.

          I like CodecConfig, but I'm not sure it should hold things specific
          only to 1 codec, like the Pulsing cutoff? The other settings seem
          more widely applicable... though I guess even terms cache size is not
          used by various codecs, but it is by enough to have it in
          CodecConfig, I think?

          I am not sure here, I had the same thought but when you look at Solr and other high level users they need to configure stuff somehow so I put all reasonable core stuff in there. I think its ok to have this for only one codec. Thoughts?

          Show
          Simon Willnauer added a comment - So this means, if you use default StandardCodec, and 3 fields store doc values, and "main" CFS is off but doc values CFS is on, you'll see a cfs file holding the 3-6 sub-files that your docvalues created, right? Correct! But eg if some fields use another codec, then that codec will have its own CFS for any fields it has with docvalues (this is the TODO)? That's seems fine for starters. again correct. So what I have in mind is a "global" cfs that a codec can pull via PerDocWriteState or something that holds all of them but for now having this per codec is fine IMO. I will create a follow up for this. For the nested test... couldn't you createCompoundOutput directly from an opened CompoundFileDirectory? (Vs creating externally & copying in). Yes I could but this functionality is tricky and not needed currently so I left it out for now. I like CodecConfig, but I'm not sure it should hold things specific only to 1 codec, like the Pulsing cutoff? The other settings seem more widely applicable... though I guess even terms cache size is not used by various codecs, but it is by enough to have it in CodecConfig, I think? I am not sure here, I had the same thought but when you look at Solr and other high level users they need to configure stuff somehow so I put all reasonable core stuff in there. I think its ok to have this for only one codec. Thoughts?
          Hide
          Michael McCandless added a comment -

          Looks great!

          So this means, if you use default StandardCodec, and 3 fields store
          doc values, and "main" CFS is off but doc values CFS is on, you'll see
          a cfs file holding the 3-6 sub-files that your docvalues created,
          right?

          But eg if some fields use another codec, then that codec will have its
          own CFS for any fields it has with docvalues (this is the TODO)?
          That's seems fine for starters.

          I like CodecConfig, but I'm not sure it should hold things specific
          only to 1 codec, like the Pulsing cutoff? The other settings seem
          more widely applicable... though I guess even terms cache size is not
          used by various codecs, but it is by enough to have it in
          CodecConfig, I think?

          CodecConfig needs @experimental?

          For the nested test... couldn't you createCompoundOutput directly from
          an opened CompoundFileDirectory? (Vs creating externally & copying
          in).

          Show
          Michael McCandless added a comment - Looks great! So this means, if you use default StandardCodec, and 3 fields store doc values, and "main" CFS is off but doc values CFS is on, you'll see a cfs file holding the 3-6 sub-files that your docvalues created, right? But eg if some fields use another codec, then that codec will have its own CFS for any fields it has with docvalues (this is the TODO)? That's seems fine for starters. I like CodecConfig, but I'm not sure it should hold things specific only to 1 codec, like the Pulsing cutoff? The other settings seem more widely applicable... though I guess even terms cache size is not used by various codecs, but it is by enough to have it in CodecConfig, I think? CodecConfig needs @experimental? For the nested test... couldn't you createCompoundOutput directly from an opened CompoundFileDirectory? (Vs creating externally & copying in).
          Hide
          Simon Willnauer added a comment -

          one more iteration adding a NestedCompoundDirectory that uses the parents openInputSlice method for efficiency.

          Show
          Simon Willnauer added a comment - one more iteration adding a NestedCompoundDirectory that uses the parents openInputSlice method for efficiency.
          Hide
          Simon Willnauer added a comment -

          we are getting closer to the overall target here. This patch enables each codec to decided to use CFS for DocValues or write individual files.

          To configure this and more stuff per codec I introduced a CodecConfig (just like IWC) that holds configuration for core codecs and is passed to each codec on creation. I added testcases for the Config and for nested CFS in the case IW or SegmentMerger decides to use CFS too so DocValues still can safely open the CFS.

          For test coverage I added a static newCodecConfig() to LuceneTestCase that randomly configures a codec per file to use CFS or individual files for DocValues and other stuff I figured make sense in the CodecConfig.

          All tests pass and there is no nocommit left I think its close. Review is appreciated

          Show
          Simon Willnauer added a comment - we are getting closer to the overall target here. This patch enables each codec to decided to use CFS for DocValues or write individual files. To configure this and more stuff per codec I introduced a CodecConfig (just like IWC) that holds configuration for core codecs and is passed to each codec on creation. I added testcases for the Config and for nested CFS in the case IW or SegmentMerger decides to use CFS too so DocValues still can safely open the CFS. For test coverage I added a static newCodecConfig() to LuceneTestCase that randomly configures a codec per file to use CFS or individual files for DocValues and other stuff I figured make sense in the CodecConfig. All tests pass and there is no nocommit left I think its close. Review is appreciated
          Hide
          Simon Willnauer added a comment -

          I committed the latest patch, this patch is a first sketch using the CFS separately in DocValuesConsumer / Producer to reduce the number of files created by DocValues. Yet, this is currently two files per codec in a segment (.cfs & .cfe) which is not too bad though but we could go even further and have a global CFS for all docValues that could be pulled on demand

          the patch still has some nocommits but all tests pass.

          Show
          Simon Willnauer added a comment - I committed the latest patch, this patch is a first sketch using the CFS separately in DocValuesConsumer / Producer to reduce the number of files created by DocValues. Yet, this is currently two files per codec in a segment (.cfs & .cfe) which is not too bad though but we could go even further and have a global CFS for all docValues that could be pulled on demand the patch still has some nocommits but all tests pass.
          Hide
          Simon Willnauer added a comment -

          this patch converts all docvalue types to index into memory. The majority now also merges directly to disk except of PackedInts, sorted and deref byte variants

          Show
          Simon Willnauer added a comment - this patch converts all docvalue types to index into memory. The majority now also merges directly to disk except of PackedInts, sorted and deref byte variants
          Hide
          Simon Willnauer added a comment -

          next iteration. this patch also includes FixedStraightBytes converted to use an in memory ByteBlockPool for indexing and straight disk access for merging. Yet, I tend to leave out the VarStraightBytes variant and open a follow up issue that converts the VarStraight case to use a skip list.

          A review would be cool otherwise I will commit in a day or two if nobody objects.

          Show
          Simon Willnauer added a comment - next iteration. this patch also includes FixedStraightBytes converted to use an in memory ByteBlockPool for indexing and straight disk access for merging. Yet, I tend to leave out the VarStraightBytes variant and open a follow up issue that converts the VarStraight case to use a skip list. A review would be cool otherwise I will commit in a day or two if nobody objects.
          Hide
          Simon Willnauer added a comment -

          next iteration, this time fixing most of the Byte variants to only write / open one file at a time. Straight variants are still missing.

          Show
          Simon Willnauer added a comment - next iteration, this time fixing most of the Byte variants to only write / open one file at a time. Straight variants are still missing.
          Hide
          Simon Willnauer added a comment -

          here is a first patch that converts the floats impl to buffer values in ram during indexing but writes values directly during merge. all tests pass

          I plan to commit this soon too. Rather go small iterations here instead of a large patch.

          Show
          Simon Willnauer added a comment - here is a first patch that converts the floats impl to buffer values in ram during indexing but writes values directly during merge. all tests pass I plan to commit this soon too. Rather go small iterations here instead of a large patch.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development