Lucene - Core
  1. Lucene - Core
  2. LUCENE-5618

DocValues updates send wrong fieldinfos to codec producers

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Spinoff from LUCENE-5616.

      See the example there, docvalues readers get a fieldinfos, but it doesn't contain the correct ones, so they have invalid field numbers at read time.

      This should really be fixed. Maybe a simple solution is to not write "batches" of fields in updates but just have only one field per gen?

      This removes many-many relationships and would make things easy to understand.

      1. LUCENE-5618.patch
        63 kB
        Shai Erera
      2. LUCENE-5618.patch
        63 kB
        Shai Erera
      3. LUCENE-5618.patch
        63 kB
        Shai Erera
      4. LUCENE-5618.patch
        60 kB
        Shai Erera
      5. LUCENE-5618.patch
        56 kB
        Shai Erera

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          The problem here is we only store the latest gen for a given field, not all gens that field participated in, so we don't have enough information at read-time to know when loading gen N all fields that were there.

          Show
          Michael McCandless added a comment - The problem here is we only store the latest gen for a given field, not all gens that field participated in, so we don't have enough information at read-time to know when loading gen N all fields that were there.
          Hide
          Robert Muir added a comment -

          I think its a design flaw in how this stuff is written. It only "worked" because dv reader didn't validate this stuff in init. I'm still unsure if this should block 4.8 or if its ok to defer to 4.9, but this is really bogus. The simple solution is for dv update gens to always only have one field.

          Show
          Robert Muir added a comment - I think its a design flaw in how this stuff is written. It only "worked" because dv reader didn't validate this stuff in init. I'm still unsure if this should block 4.8 or if its ok to defer to 4.9, but this is really bogus. The simple solution is for dv update gens to always only have one field.
          Hide
          Shai Erera added a comment -

          I think its a design flaw in how this stuff is written

          Today we pass only the FIs the Codec should "care" about, rather than pass all the FIs it "knows" about. This allows the Codec to optimize. E.g. today we don't take advantage of that, and a DVP reads all the metadata of a field, even if the field isn't passed in the FIS (and therefore will never be asked for).

          If we write each field in its own gen, then since we don't allow adding new fields through dvUpdates, for gen=-1 we just pass all known dvFieldInfos, and for gen > 0 we will pass a single FI only, therefore the Codec always receives the FIs it knows about, even though for gen=-1 it is given some FIs it shouldn't care about. Our Codecs only read metadata into memory, the actual data is loaded lazily, so perhaps optimizing them is less important at the moment.

          I wish we could be more flexible though in our code. It feels odd to me that each field is written in its own gen, just because we cannot add a FIS.exists() check in the Codec. Like, if we always pass all DV FIS to every DVP, each will be able to do the exists() check, but a DVP will see fields it doesn't know about. Is that bad? It still covers the corruption case of a bad field number being encoded in the first place...

          Show
          Shai Erera added a comment - I think its a design flaw in how this stuff is written Today we pass only the FIs the Codec should "care" about, rather than pass all the FIs it "knows" about. This allows the Codec to optimize. E.g. today we don't take advantage of that, and a DVP reads all the metadata of a field, even if the field isn't passed in the FIS (and therefore will never be asked for). If we write each field in its own gen, then since we don't allow adding new fields through dvUpdates, for gen=-1 we just pass all known dvFieldInfos, and for gen > 0 we will pass a single FI only, therefore the Codec always receives the FIs it knows about, even though for gen=-1 it is given some FIs it shouldn't care about. Our Codecs only read metadata into memory, the actual data is loaded lazily, so perhaps optimizing them is less important at the moment. I wish we could be more flexible though in our code. It feels odd to me that each field is written in its own gen, just because we cannot add a FIS.exists() check in the Codec. Like, if we always pass all DV FIS to every DVP, each will be able to do the exists() check, but a DVP will see fields it doesn't know about. Is that bad? It still covers the corruption case of a bad field number being encoded in the first place...
          Hide
          Shai Erera added a comment -

          I'm still unsure if this should block 4.8 or if its ok to defer to 4.9

          I don't think it should block 4.8...

          Show
          Shai Erera added a comment - I'm still unsure if this should block 4.8 or if its ok to defer to 4.9 I don't think it should block 4.8...
          Hide
          Robert Muir added a comment -

          Its not "allows codec to optimize". The codec should NOT be aware of dv updates.

          Show
          Robert Muir added a comment - Its not "allows codec to optimize". The codec should NOT be aware of dv updates.
          Hide
          Shai Erera added a comment -

          The codec should NOT be aware of dv updates.

          Why? Are they so horrible?

          So what if we always pass all FIS to the Codec?

          Show
          Shai Erera added a comment - The codec should NOT be aware of dv updates. Why? Are they so horrible? So what if we always pass all FIS to the Codec?
          Hide
          Robert Muir added a comment -

          Why? Are they so horrible?

          Because thats not the codec's responsibility.

          Please, pass all the FieldInfos. thats an easy solution. my check that "fieldinfo is valid" is perfectly legitimate. This is totally broken today!

          Show
          Robert Muir added a comment - Why? Are they so horrible? Because thats not the codec's responsibility. Please, pass all the FieldInfos. thats an easy solution. my check that "fieldinfo is valid" is perfectly legitimate. This is totally broken today!
          Hide
          Shai Erera added a comment -

          OK I'll take a look at passing all FIs. I agree the current check is important. If we pass only the set of FIs the Codec sees though, I think we should also check that the Codec read all the fields the FIS say it should read (currently it only checks that all the fields it reads are in the FI, but not if there are "missing" fields). But if we pass all FIS, that's not a problem.

          Show
          Shai Erera added a comment - OK I'll take a look at passing all FIs. I agree the current check is important. If we pass only the set of FIs the Codec sees though, I think we should also check that the Codec read all the fields the FIS say it should read (currently it only checks that all the fields it reads are in the FI, but not if there are "missing" fields). But if we pass all FIS, that's not a problem.
          Hide
          Shai Erera added a comment -

          I modified the code to pass all the FIs to the codec, no matter the gen, and tests fail with FileNotFoundException. The reason is that PerFieldDVF tries to open DVPs e.g. of gen=1 of all fields, whether they were written in that gen or not, which leads to the FNFE. I am not sure that we can pass all FIs to the Codec that way ... so our options are:

          • Pass all the fields that were written in a gen (whether we need them or not) – this does not make sense to me, as we'll need to track it somewhere, and it seems a waste
          • Add leniency in the form of "here are the fields you should care about" – this makes the codec partially updates aware, but I don't think it's a bad idea
          • Write each updated field in its own gen – if you update many fields, many times, this will create many files in the index directory. Technically it's not "wrong", it just looks weird
          • Remain w/ the current code's corruption detection if the read fieldNumber < 0

          Anyway, I think the issue's title is wrong – DocValues updates do pass the correct fieldInfos to the producers. They pass only the infos that the producer should care about, and we see that passing too many is wrong (PerFieldDVF).

          I will think about it more. If you see other alternatives, feel free to propose them.

          Show
          Shai Erera added a comment - I modified the code to pass all the FIs to the codec, no matter the gen, and tests fail with FileNotFoundException. The reason is that PerFieldDVF tries to open DVPs e.g. of gen=1 of all fields, whether they were written in that gen or not, which leads to the FNFE. I am not sure that we can pass all FIs to the Codec that way ... so our options are: Pass all the fields that were written in a gen (whether we need them or not) – this does not make sense to me, as we'll need to track it somewhere, and it seems a waste Add leniency in the form of "here are the fields you should care about" – this makes the codec partially updates aware, but I don't think it's a bad idea Write each updated field in its own gen – if you update many fields, many times, this will create many files in the index directory. Technically it's not "wrong", it just looks weird Remain w/ the current code's corruption detection if the read fieldNumber < 0 Anyway, I think the issue's title is wrong – DocValues updates do pass the correct fieldInfos to the producers. They pass only the infos that the producer should care about, and we see that passing too many is wrong (PerFieldDVF). I will think about it more. If you see other alternatives, feel free to propose them.
          Hide
          Robert Muir added a comment -

          Write each updated field in its own gen – if you update many fields, many times, this will create many files in the index directory. Technically it's not "wrong", it just looks weird

          Why? This is how separate norms worked. Its the obvious solution. The current behavior is broken: lets fix the bug. This optimization is what is to blame. The optimization is invalid.

          Anyway, I think the issue's title is wrong – DocValues updates do pass the correct fieldInfos to the producers. They pass only the infos that the producer should care about, and we see that passing too many is wrong (PerFieldDVF).

          Absolutely not! You get a different fieldinfos at read time than you get at write. This is broken!

          Show
          Robert Muir added a comment - Write each updated field in its own gen – if you update many fields, many times, this will create many files in the index directory. Technically it's not "wrong", it just looks weird Why? This is how separate norms worked. Its the obvious solution. The current behavior is broken: lets fix the bug. This optimization is what is to blame. The optimization is invalid. Anyway, I think the issue's title is wrong – DocValues updates do pass the correct fieldInfos to the producers. They pass only the infos that the producer should care about, and we see that passing too many is wrong (PerFieldDVF). Absolutely not! You get a different fieldinfos at read time than you get at write . This is broken!
          Hide
          Shai Erera added a comment -

          If we separate each DV update into its own file, I think we will need to track another gen in SegmentCommitInfo: deletes, fieldInfos and dvUpdates. Though each FI writes its dvGen in the FIS file, we need to know from where to increment the gen for the next update. This isn't a big deal, just adds complexity to SCI (4 methods and index format change).

          But why do you think that it's wrong to write 2 fields and then at read time ask to provide only 1 field? I.e. what if the Codecs API was more "lazy", or a Codec wants to implement lazy loading of even just the metadata?

          Passing all the fields a Codec wrote, e.g. in the gen=-1 case, even though none of them is not going to be used because they were all updated in later gens, seems awkward to me as well.

          What sort of index corruption does this check detect? As I see it, the Codec gets a subset of the fields that it already wrote. It's worse if it gets a superset of those fields, because you don't know e.g. if there are perhaps missing fields that disappeared from the file system.

          Show
          Shai Erera added a comment - If we separate each DV update into its own file, I think we will need to track another gen in SegmentCommitInfo: deletes, fieldInfos and dvUpdates. Though each FI writes its dvGen in the FIS file, we need to know from where to increment the gen for the next update. This isn't a big deal, just adds complexity to SCI (4 methods and index format change). But why do you think that it's wrong to write 2 fields and then at read time ask to provide only 1 field? I.e. what if the Codecs API was more "lazy", or a Codec wants to implement lazy loading of even just the metadata? Passing all the fields a Codec wrote, e.g. in the gen=-1 case, even though none of them is not going to be used because they were all updated in later gens, seems awkward to me as well. What sort of index corruption does this check detect? As I see it, the Codec gets a subset of the fields that it already wrote. It's worse if it gets a superset of those fields, because you don't know e.g. if there are perhaps missing fields that disappeared from the file system.
          Hide
          Robert Muir added a comment -

          What sort of index corruption does this check detect? As I see it, the Codec gets a subset of the fields that it already wrote.

          That the field numbers it is using are valid!

          Please, stop pushing back on this. I will make this a blocker issue for 4.9. Maybe we should disable the dv update tests and enable this check in the meantime? This is the most fair.

          I dont like this pushing back against completely valid checks.

          Show
          Robert Muir added a comment - What sort of index corruption does this check detect? As I see it, the Codec gets a subset of the fields that it already wrote. That the field numbers it is using are valid! Please, stop pushing back on this. I will make this a blocker issue for 4.9. Maybe we should disable the dv update tests and enable this check in the meantime? This is the most fair. I dont like this pushing back against completely valid checks.
          Hide
          Shai Erera added a comment -

          I dont like this pushing back against completely valid checks.

          I don't push back, I'm trying to have a discussion. Why do you assume that questions indicate push back???

          Do you also think that it's OK for a Codec to receive fields it never handled? If not, we should check that too. That to me indicates a bigger problem than sending a subset of fields.

          I will look into adding another gen to SCI. But if all that we want to achieve is "That the field numbers it is using are valid!", there's another way to do that – we can pass to a Codec a FieldsValidator or something for this purpose. That way we don't need to pass all FIs to a Codec and don't run into the PerFieldDVF issue I mentioned above, and don't complicate SCI with another gen. Just mentioning there are other ways to achieve consistency checks...

          Show
          Shai Erera added a comment - I dont like this pushing back against completely valid checks. I don't push back, I'm trying to have a discussion. Why do you assume that questions indicate push back??? Do you also think that it's OK for a Codec to receive fields it never handled? If not, we should check that too. That to me indicates a bigger problem than sending a subset of fields. I will look into adding another gen to SCI. But if all that we want to achieve is "That the field numbers it is using are valid!", there's another way to do that – we can pass to a Codec a FieldsValidator or something for this purpose. That way we don't need to pass all FIs to a Codec and don't run into the PerFieldDVF issue I mentioned above, and don't complicate SCI with another gen. Just mentioning there are other ways to achieve consistency checks...
          Hide
          Shai Erera added a comment -

          Patch addresses the following:

          • Modifies Lucene45/42DocValuesProducer to assert that all encoded fields exist in the FieldInfos.
          • Simplifies ReaderAndUpdates.writeFieldUpdates readability by breaking out the updates to separate methods.
          • Each DocValues field's updates are written to separate files.
          • Adds SegmentCommitInfo.docValuesGen, separate from fieldInfosGen.
          • Fixes LUCENE-5636 by tracking per-field updates files, as well as fieldInfos files.
            • per-generation update files are kept as deprecated, needed for 4.6-4.8 indexes back-compat. These become empty after the segment is merged.
          • Improved testDeleteUnusedUpdatesFiles to cover two fields' updates (this exposes the bug on LUCENE-5636).

          In terms of backwards compatibility, indexes between 4.6-4.8 will continue to reference unneeded files until the segment is merged. This is impossible to fix without breaking back-compat or introduce weird hacks which assume the default codec. This is not terrible though, since the number of unneeded-but-referenced files is limited by the number of DV fields the app has updated.

          I'd appreciate a review on this. Before I commit it though, I want to take care of LUCENE-5619, so we're sure the back-compat logic in this patch indeed works.

          Show
          Shai Erera added a comment - Patch addresses the following: Modifies Lucene45/42DocValuesProducer to assert that all encoded fields exist in the FieldInfos. Simplifies ReaderAndUpdates.writeFieldUpdates readability by breaking out the updates to separate methods. Each DocValues field's updates are written to separate files. Adds SegmentCommitInfo.docValuesGen, separate from fieldInfosGen. Fixes LUCENE-5636 by tracking per-field updates files, as well as fieldInfos files. per-generation update files are kept as deprecated, needed for 4.6-4.8 indexes back-compat. These become empty after the segment is merged. Improved testDeleteUnusedUpdatesFiles to cover two fields' updates (this exposes the bug on LUCENE-5636 ). In terms of backwards compatibility, indexes between 4.6-4.8 will continue to reference unneeded files until the segment is merged. This is impossible to fix without breaking back-compat or introduce weird hacks which assume the default codec. This is not terrible though, since the number of unneeded-but-referenced files is limited by the number of DV fields the app has updated. I'd appreciate a review on this. Before I commit it though, I want to take care of LUCENE-5619 , so we're sure the back-compat logic in this patch indeed works.
          Hide
          Shai Erera added a comment -

          After committing LUCENE-5619, with the previous patch TestBackwardsCompatibility failed since it didn't handle pre-4.9 indexes well – it didn't handle the case where one generation references multiple fields... to resolve that I added in this patch:

          • SegmentReader acts accordingly only pre-4.9 indexes: beyond sending all the FieldInfos to a certain DocValuesProducer's gen, it ensures each such DVP is initialized once per generation.
          • Lucene45DocValuesProducer does a lenient fields check if the segment's version is pre-4.9.

          Note that I didn't add this leniency to Lucene42DocValuesProducer since that one doesn't support DocValues updates anyway, and so doesn't experience this issue at all.

          Show
          Shai Erera added a comment - After committing LUCENE-5619 , with the previous patch TestBackwardsCompatibility failed since it didn't handle pre-4.9 indexes well – it didn't handle the case where one generation references multiple fields... to resolve that I added in this patch: SegmentReader acts accordingly only pre-4.9 indexes: beyond sending all the FieldInfos to a certain DocValuesProducer's gen, it ensures each such DVP is initialized once per generation. Lucene45DocValuesProducer does a lenient fields check if the segment's version is pre-4.9. Note that I didn't add this leniency to Lucene42DocValuesProducer since that one doesn't support DocValues updates anyway, and so doesn't experience this issue at all.
          Hide
          Robert Muir added a comment -

          This looks good to me. My only concern (which can be a followup issue), is to try to simplify a lot of stuff in SegmentReader.initDocValuesProducers

          In general I think SegmentReader needs a cleanup to ensure things are fast.

          This logic is now more complex than before as there is back compat etc going on, and involves multiple full passes over fieldinfos/dv fields. In general we should really try to avoid this.

          Its now quite a bit difficult to see what is happening in the common case (no updates for a segment) via initDocValuesProducers/SegmentDocValues/getNumericXXX codepaths.

          On that issue we should cleanup other inefficiencies while we are there: e.g. we also want to try to reduce the overhead going on in e.g. SR.getNumericDocValues. For example today this is doing two hash lookups, when this method could just try 'dvFields' first and optimize the common case.

          But lets fix the bugs first, this approach looks good to me. Long-term we should also investigate refactoring the livedocs format maybe to use this "files" approach recorded in the commit. Because currently the LiveDocs codec api is really horrible, and really its just an updatable 1-bit numeric docvalues.

          Show
          Robert Muir added a comment - This looks good to me. My only concern (which can be a followup issue), is to try to simplify a lot of stuff in SegmentReader.initDocValuesProducers In general I think SegmentReader needs a cleanup to ensure things are fast. This logic is now more complex than before as there is back compat etc going on, and involves multiple full passes over fieldinfos/dv fields. In general we should really try to avoid this. Its now quite a bit difficult to see what is happening in the common case (no updates for a segment) via initDocValuesProducers/SegmentDocValues/getNumericXXX codepaths. On that issue we should cleanup other inefficiencies while we are there: e.g. we also want to try to reduce the overhead going on in e.g. SR.getNumericDocValues. For example today this is doing two hash lookups, when this method could just try 'dvFields' first and optimize the common case. But lets fix the bugs first, this approach looks good to me. Long-term we should also investigate refactoring the livedocs format maybe to use this "files" approach recorded in the commit. Because currently the LiveDocs codec api is really horrible, and really its just an updatable 1-bit numeric docvalues.
          Hide
          Shai Erera added a comment -

          Thanks Rob!

          My only concern (which can be a followup issue), is to try to simplify a lot of stuff in SegmentReader.initDocValuesProducers

          I tried, but I didn't see it simplifies a lot. However, I now took a second look and I think what prevents the simplification in the no-updates case is the API on SegDocValues, which insists on getting a List<FieldInfo> rather than FieldInfos. I will try to change the API (it's internal) and hopefully it will allow to have a simple if (!si.hasFieldUpdates()) code block.

          For example today this is doing two hash lookups, when this method could just try 'dvFields' first and optimize the common case.

          Where do you see the two hash lookups? Only in the first getNumeric() there are two hash lookups, but after that it's always one. Or did I miss something?

          Show
          Shai Erera added a comment - Thanks Rob! My only concern (which can be a followup issue), is to try to simplify a lot of stuff in SegmentReader.initDocValuesProducers I tried, but I didn't see it simplifies a lot. However, I now took a second look and I think what prevents the simplification in the no-updates case is the API on SegDocValues, which insists on getting a List<FieldInfo> rather than FieldInfos. I will try to change the API (it's internal) and hopefully it will allow to have a simple if (!si.hasFieldUpdates()) code block. For example today this is doing two hash lookups, when this method could just try 'dvFields' first and optimize the common case. Where do you see the two hash lookups? Only in the first getNumeric() there are two hash lookups, but after that it's always one. Or did I miss something?
          Hide
          Shai Erera added a comment -

          I modified SegDocValues API to take FieldInfos, and SegReader.initDocValues now distinguishes between no-updates, with updates and back-compat logic. I wish we didn't have the back-compat case, but that's what we have.

          I'm still unsure about the getNumeric inefficiencies that you mentioned, so if you can clarify I will try to address them here too. If it's a general SR refactoring and simplification, I agree that we should pursue that in a separate issue.

          Show
          Shai Erera added a comment - I modified SegDocValues API to take FieldInfos, and SegReader.initDocValues now distinguishes between no-updates, with updates and back-compat logic. I wish we didn't have the back-compat case, but that's what we have. I'm still unsure about the getNumeric inefficiencies that you mentioned, so if you can clarify I will try to address them here too. If it's a general SR refactoring and simplification, I agree that we should pursue that in a separate issue.
          Hide
          Michael McCandless added a comment -

          In SegmentReader.initDocValuesProducers, when there are no DV updates,
          can't you just init dvp right off (not lazily)? Because up above we
          only call it if FIS.hasDocValues.

          I think what Rob meant by the double-lookup is we should just call
          dvFields.get(field) first, and only if that's null do we do the logic
          to initialize it. Ie, the common case here is retrieving a DV field
          that's already loaded.

          in this code from ReadersAndUpdates.writeFieldUpdates:

              // update the doc-values updates files
              assert !newDVFiles.isEmpty();
              for (Entry<Integer,Set<String>> e : info.getDocValuesUpdatesFiles().entrySet()) {
               if (!newDVFiles.containsKey(e.getKey())) {
                 newDVFiles.put(e.getKey(), e.getValue());
               }
              }
          

          Why would the newDVFiles contain e.getKey()? Aren't we only writing
          the new generation update here? Also the indent is off a bit.

          Show
          Michael McCandless added a comment - In SegmentReader.initDocValuesProducers, when there are no DV updates, can't you just init dvp right off (not lazily)? Because up above we only call it if FIS.hasDocValues. I think what Rob meant by the double-lookup is we should just call dvFields.get(field) first, and only if that's null do we do the logic to initialize it. Ie, the common case here is retrieving a DV field that's already loaded. in this code from ReadersAndUpdates.writeFieldUpdates: // update the doc-values updates files assert !newDVFiles.isEmpty(); for (Entry<Integer,Set<String>> e : info.getDocValuesUpdatesFiles().entrySet()) { if (!newDVFiles.containsKey(e.getKey())) { newDVFiles.put(e.getKey(), e.getValue()); } } Why would the newDVFiles contain e.getKey()? Aren't we only writing the new generation update here? Also the indent is off a bit.
          Hide
          Shai Erera added a comment -

          In SegmentReader.initDocValuesProducers, when there are no DV updates,
          can't you just init dvp right off (not lazily)? Because up above we
          only call it if FIS.hasDocValues.

          Ooops, you're right. Will fix!

          I think what Rob meant by the double-lookup is we should just call
          dvFields.get(field) first, and only if that's null do we do the logic
          to initialize it. Ie, the common case here is retrieving a DV field
          that's already loaded.

          This is the code of getNumeric():

          NumericDocValues dvs = (NumericDocValues) dvFields.get(field);
          if (dvs == null) {
            DocValuesProducer dvProducer = dvProducersByField.get(field);
            assert dvProducer != null;
            dvs = dvProducer.getNumeric(fi);
            dvFields.put(field, dvs);
          }
          

          It seems already optimized to do one lookup in the common case?

          Why would the newDVFiles contain e.getKey()? Aren't we only writing the new generation update here?

          Notice that the key is the fieldNumber (Integer) and not the generation (Long). I modified SegmentCommitInfo to track the files per fieldNumber instead of generation, to avoid future issues, and also I think it lets us be more flexible, i.e. easier back-compat support if we will want to change things again. Therefore it could be that the existing files mapping contain a fieldNumber which we just rewrote (updated), and hence the if.

          Also the indent is off a bit.

          Thanks, I'll fix.

          Show
          Shai Erera added a comment - In SegmentReader.initDocValuesProducers, when there are no DV updates, can't you just init dvp right off (not lazily)? Because up above we only call it if FIS.hasDocValues. Ooops, you're right. Will fix! I think what Rob meant by the double-lookup is we should just call dvFields.get(field) first, and only if that's null do we do the logic to initialize it. Ie, the common case here is retrieving a DV field that's already loaded. This is the code of getNumeric(): NumericDocValues dvs = (NumericDocValues) dvFields.get(field); if (dvs == null ) { DocValuesProducer dvProducer = dvProducersByField.get(field); assert dvProducer != null ; dvs = dvProducer.getNumeric(fi); dvFields.put(field, dvs); } It seems already optimized to do one lookup in the common case? Why would the newDVFiles contain e.getKey()? Aren't we only writing the new generation update here? Notice that the key is the fieldNumber (Integer) and not the generation (Long). I modified SegmentCommitInfo to track the files per fieldNumber instead of generation, to avoid future issues, and also I think it lets us be more flexible, i.e. easier back-compat support if we will want to change things again. Therefore it could be that the existing files mapping contain a fieldNumber which we just rewrote (updated), and hence the if . Also the indent is off a bit. Thanks, I'll fix.
          Hide
          Shai Erera added a comment -

          Patch folds in latest feedback.

          Show
          Shai Erera added a comment - Patch folds in latest feedback.
          Hide
          Robert Muir added a comment -

          This is the code of getNumeric():
          ...
          It seems already optimized to do one lookup in the common case?

          Thats not the code. you are omitting the first lookup!

          Show
          Robert Muir added a comment - This is the code of getNumeric(): ... It seems already optimized to do one lookup in the common case? Thats not the code. you are omitting the first lookup!
          Hide
          Shai Erera added a comment -

          Thats not the code. you are omitting the first lookup!

          Oh, you mean the FieldInfos lookup? That was the original code from before DV updates, so I completely ignored it.. I'll check if it can be moved inside the if dvs == null check and upload a patch later.

          Show
          Shai Erera added a comment - Thats not the code. you are omitting the first lookup! Oh, you mean the FieldInfos lookup? That was the original code from before DV updates, so I completely ignored it.. I'll check if it can be moved inside the if dvs == null check and upload a patch later.
          Hide
          Shai Erera added a comment -

          I looked and I can easily pull all the initialization inside the if (dvs == null). But since this involves the DV API as well as getDocsWithField, I'll open a separate issue just to keep this one focused.

          Show
          Shai Erera added a comment - I looked and I can easily pull all the initialization inside the if (dvs == null) . But since this involves the DV API as well as getDocsWithField, I'll open a separate issue just to keep this one focused.
          Hide
          Michael McCandless added a comment -

          Therefore it could be that the existing files mapping contain a fieldNumber which we just rewrote (updated), and hence the if.

          Ahh right. OK that makes sense. Maybe just add a comment explaining that?

          Show
          Michael McCandless added a comment - Therefore it could be that the existing files mapping contain a fieldNumber which we just rewrote (updated), and hence the if. Ahh right. OK that makes sense. Maybe just add a comment explaining that?
          Hide
          Shai Erera added a comment -

          Good idea Mike, patch clarifies this code block.

          Show
          Shai Erera added a comment - Good idea Mike, patch clarifies this code block.
          Hide
          Michael McCandless added a comment -

          +1, thanks Shai.

          Show
          Michael McCandless added a comment - +1, thanks Shai.
          Hide
          ASF subversion and git services added a comment -

          Commit 1596570 from Shai Erera in branch 'dev/trunk'
          [ https://svn.apache.org/r1596570 ]

          LUCENE-5618, LUCENE-5636: write each DocValues update in a separate file; stop referencing old fieldInfos files

          Show
          ASF subversion and git services added a comment - Commit 1596570 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1596570 ] LUCENE-5618 , LUCENE-5636 : write each DocValues update in a separate file; stop referencing old fieldInfos files
          Hide
          ASF subversion and git services added a comment -

          Commit 1596582 from Shai Erera in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1596582 ]

          LUCENE-5618, LUCENE-5636: write each DocValues update in a separate file; stop referencing old fieldInfos files

          Show
          ASF subversion and git services added a comment - Commit 1596582 from Shai Erera in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1596582 ] LUCENE-5618 , LUCENE-5636 : write each DocValues update in a separate file; stop referencing old fieldInfos files
          Hide
          Shai Erera added a comment -

          Committed to trunk and 4x.

          Show
          Shai Erera added a comment - Committed to trunk and 4x.

            People

            • Assignee:
              Shai Erera
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development