Lucene - Core
  1. Lucene - Core
  2. LUCENE-2881

Track FieldInfo per segment instead of per-IW-session

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA, Realtime Branch, CSF branch
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over.

      The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments.

      We already work around this "limitation" in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.

      1. lucene-2881.patch
        85 kB
        Michael Busch
      2. lucene-2881.patch
        73 kB
        Michael Busch
      3. lucene-2881.patch
        73 kB
        Michael Busch
      4. lucene-2881.patch
        60 kB
        Michael Busch
      5. lucene-2881.patch
        47 kB
        Michael Busch
      6. LUCENE-2881.patch
        133 kB
        Simon Willnauer
      7. LUCENE-2881.patch
        123 kB
        Simon Willnauer
      8. LUCENE-2881.patch
        112 kB
        Simon Willnauer
      9. LUCENE-2881.patch
        91 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Michael Busch added a comment -

          It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata.

          Yeah, I was going to do this in the realtime branch, because I don't think we can get DWPTs working correctly without SegmentInfos having private FIs.

          Show
          Michael Busch added a comment - It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. Yeah, I was going to do this in the realtime branch, because I don't think we can get DWPTs working correctly without SegmentInfos having private FIs.
          Hide
          Simon Willnauer added a comment -

          Yeah, I was going to do this in the realtime branch, because I don't think we can get DWPTs working correctly without SegmentInfos having private FIs.

          i think we should do that on trunk and then merge to RT - do you have time to work on this soon?

          Show
          Simon Willnauer added a comment - Yeah, I was going to do this in the realtime branch, because I don't think we can get DWPTs working correctly without SegmentInfos having private FIs. i think we should do that on trunk and then merge to RT - do you have time to work on this soon?
          Hide
          Michael Busch added a comment -

          It would probably make sense to have a new class (maybe an extension of SegmentInfo) for in-memory (not-yet-flushed) segments that references the corresponding FieldInfos and SegmentDeletes. That'd be better I think that adding another map SegmentInfo -> FieldInfos and we could then also remove the SegmentInfo -> SegmentDeletes map (in BufferedDeletes).

          Show
          Michael Busch added a comment - It would probably make sense to have a new class (maybe an extension of SegmentInfo) for in-memory (not-yet-flushed) segments that references the corresponding FieldInfos and SegmentDeletes. That'd be better I think that adding another map SegmentInfo -> FieldInfos and we could then also remove the SegmentInfo -> SegmentDeletes map (in BufferedDeletes).
          Hide
          Michael Busch added a comment -

          i think we should do that on trunk and then merge to RT - do you have time to work on this soon?

          Yeah I agree. Hmm maybe I can spend some hours tonight on this, otherwise I don't think I'll have much time before Thursday.

          Show
          Michael Busch added a comment - i think we should do that on trunk and then merge to RT - do you have time to work on this soon? Yeah I agree. Hmm maybe I can spend some hours tonight on this, otherwise I don't think I'll have much time before Thursday.
          Hide
          Simon Willnauer added a comment -

          Yeah I agree. Hmm maybe I can spend some hours tonight on this, otherwise I don't think I'll have much time before Thursday.

          Michael, if you start something I can work on it tomorrow for a while too. That way we can get it in quickly

          Show
          Simon Willnauer added a comment - Yeah I agree. Hmm maybe I can spend some hours tonight on this, otherwise I don't think I'll have much time before Thursday. Michael, if you start something I can work on it tomorrow for a while too. That way we can get it in quickly
          Hide
          Michael Busch added a comment -
          • Creates for every segment a new FieldInfos
          • Changes FieldInfos, so that the FieldInfo numbers within a single FieldInfos don't have to be contiguous - this allows using the same numbering as the previous segment(s), even if not all fields are present in the new segment
          • Adds a global fieldName -> fieldNumber map; if possible when a new field is added to a FieldInfo it tries to use an already assigned number for that field

          All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue.

          Show
          Michael Busch added a comment - Creates for every segment a new FieldInfos Changes FieldInfos, so that the FieldInfo numbers within a single FieldInfos don't have to be contiguous - this allows using the same numbering as the previous segment(s), even if not all fields are present in the new segment Adds a global fieldName -> fieldNumber map; if possible when a new field is added to a FieldInfo it tries to use an already assigned number for that field All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue.
          Hide
          Simon Willnauer added a comment -

          Michael, this looks very good!

          All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue.

          I agree we should make FieldInfos a member of SegmentInfo and remove the hasVectors / hasProx an push them down. I tried to apply this patch to the DocValues branch but I didn't get very far - I haven't merged for a while that killed me ;(
          I need to push that to after my vacation. I really hoped that I can get further but it didn't work out ;(

          Show
          Simon Willnauer added a comment - Michael, this looks very good! All tests pass. Though I need to verify if the global map works correctly (it'd probably be good to add a test for that). Also it'd be nice to remove hasVectors and hasProx from SegmentInfo, but we could also do that in a separate issue. I agree we should make FieldInfos a member of SegmentInfo and remove the hasVectors / hasProx an push them down. I tried to apply this patch to the DocValues branch but I didn't get very far - I haven't merged for a while that killed me ;( I need to push that to after my vacation. I really hoped that I can get further but it didn't work out ;(
          Hide
          Michael Busch added a comment -

          New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos.

          For backwards-compatibility reasons we can't completely remove the hasVectors and hasProx bytes from the serialized SegmentInfo yet. Eg. if someone uses addIndexes(Directory...) to add external old pre-4.0 segments to a new index, we "upgrade" the SegmentInfo to the latest version. However, we don't modify the FieldInfos of that segment, instead we just copy it over to the new dir. So the hasVector and hasProx bits in the FieldInfos might not be accurate and we have to keep those bits in the SegmentInfo instead. Not an ideal solution, but we can remove it entirely in Lucene 5.0 . The alternative would be to
          rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files.

          All core & contrib tests pass.

          Show
          Michael Busch added a comment - New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos. For backwards-compatibility reasons we can't completely remove the hasVectors and hasProx bytes from the serialized SegmentInfo yet. Eg. if someone uses addIndexes(Directory...) to add external old pre-4.0 segments to a new index, we "upgrade" the SegmentInfo to the latest version. However, we don't modify the FieldInfos of that segment, instead we just copy it over to the new dir. So the hasVector and hasProx bits in the FieldInfos might not be accurate and we have to keep those bits in the SegmentInfo instead. Not an ideal solution, but we can remove it entirely in Lucene 5.0 . The alternative would be to rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files. All core & contrib tests pass.
          Hide
          Simon Willnauer added a comment -

          New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos.

          Wow! nice work Michael! I like how you preserve bw compat in SegmentInfo and FieldInfos is now bound to SegmentInfo - yay! this solves two problems at once for DocValues branch.

          The alternative would be to rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files.

          I think copying over is fine. Ideally we will move all those boolean etc to the codec level so that we don't need that at all. Once stored fields and vectors are written by the codec we can push all that into PreFlex codec (maybe!?) and get rid of the bw compat code.

          I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue.

          Show
          Simon Willnauer added a comment - New patch that removes the tracking of 'hasVectors' and 'hasProx' in SegmentInfo. Instead SegmentInfo now has a reference to its corresponding FieldInfos. Wow! nice work Michael! I like how you preserve bw compat in SegmentInfo and FieldInfos is now bound to SegmentInfo - yay! this solves two problems at once for DocValues branch. The alternative would be to rewrite the FieldInfos instead of just copying the files, but then we have to rewrite the cfs files. I think copying over is fine. Ideally we will move all those boolean etc to the codec level so that we don't need that at all. Once stored fields and vectors are written by the codec we can push all that into PreFlex codec (maybe!?) and get rid of the bw compat code. I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue.
          Hide
          Simon Willnauer added a comment -

          I gave the patch another glance - here are a couple of very minor comments:

          • maybe we should return Iterable<FieldInfo> from FieldInfos#getFieldInfoIterator() this would make the iteration syntactically more javaish
             
            
            for (FieldInfo info : getFieldInfoIterator()) {
              // do something with it
            }
            
            
          • If we return Iterable<FieldInfo> should we name it getFieldInfos?
          • Maybe we can simply implement Iterable<FieldInfo>?
          • Maybe we can rename SI#clearFilesCache() to SI#clearCache() or simply SI#clear() this would make thing less coupled to the SI internal cache but rather something that is called to clear internal state after flush?
          Show
          Simon Willnauer added a comment - I gave the patch another glance - here are a couple of very minor comments: maybe we should return Iterable<FieldInfo> from FieldInfos#getFieldInfoIterator() this would make the iteration syntactically more javaish for (FieldInfo info : getFieldInfoIterator()) { // do something with it } If we return Iterable<FieldInfo> should we name it getFieldInfos? Maybe we can simply implement Iterable<FieldInfo>? Maybe we can rename SI#clearFilesCache() to SI#clearCache() or simply SI#clear() this would make thing less coupled to the SI internal cache but rather something that is called to clear internal state after flush?
          Hide
          Michael Busch added a comment -

          Thanks for reviewing!

          I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue.

          I just want to add another tests for the global fieldname->number map, after that I think it'll be ready to commit. Will do that tonight

          Show
          Michael Busch added a comment - Thanks for reviewing! I think you should commit that patch. I'll port to docvalues and run some tests that rely on this issue. I just want to add another tests for the global fieldname->number map, after that I think it'll be ready to commit. Will do that tonight
          Hide
          Michael Busch added a comment -

          Maybe we can simply implement Iterable<FieldInfo>?

          good idea - done.

          Maybe we can rename SI#clearFilesCache()

          Actually I renamed it intentionally, because all this method does is really clearing the files cache. SI has a separate reset() method for resetting its state entirely.

          Show
          Michael Busch added a comment - Maybe we can simply implement Iterable<FieldInfo>? good idea - done. Maybe we can rename SI#clearFilesCache() Actually I renamed it intentionally, because all this method does is really clearing the files cache. SI has a separate reset() method for resetting its state entirely.
          Hide
          Michael Busch added a comment -

          New patch that adds a new junit for testing that field numbering is consistent across segments. It tests two cases: 1) one IW is used to write two segments; 2) two IWs are used to write two segments.
          And it also tests that addIndexes(Directory...) doesn't mess up the field numbering of the external segment.

          All tests pass. I'll commit this in a day or two if nobody objects.

          Show
          Michael Busch added a comment - New patch that adds a new junit for testing that field numbering is consistent across segments. It tests two cases: 1) one IW is used to write two segments; 2) two IWs are used to write two segments. And it also tests that addIndexes(Directory...) doesn't mess up the field numbering of the external segment. All tests pass. I'll commit this in a day or two if nobody objects.
          Hide
          Simon Willnauer added a comment -

          All tests pass. I'll commit this in a day or two if nobody objects.

          +1

          Show
          Simon Willnauer added a comment - All tests pass. I'll commit this in a day or two if nobody objects. +1
          Hide
          Simon Willnauer added a comment -

          I just integrated this patch to the docvalues branch! It works like a charm! Nice work Michael, this brings docValues a huge step closer. All tests pass which failed before, FieldInfo is reliable now!

          Show
          Simon Willnauer added a comment - I just integrated this patch to the docvalues branch! It works like a charm! Nice work Michael, this brings docValues a huge step closer. All tests pass which failed before, FieldInfo is reliable now!
          Hide
          Michael Busch added a comment -

          Awesome, thanks for letting me know! I hope I'll be able to say the same about the RT branch after I tried it there...

          Show
          Michael Busch added a comment - Awesome, thanks for letting me know! I hope I'll be able to say the same about the RT branch after I tried it there...
          Hide
          Simon Willnauer added a comment -

          I actually have some intermitted failures on docvalues which seem to be caused by some mixed up codec IDs. I found one which fixed most of the cases in FieldInfo#clone() where the codec ID is lost. But there must be some other situation where the codec ID gets mixed up. I will be AFK for 10 days at least so I have to leave you alone with that. Seems that we need a test for that though. The docValues test bring that up quickly

          Show
          Simon Willnauer added a comment - I actually have some intermitted failures on docvalues which seem to be caused by some mixed up codec IDs. I found one which fixed most of the cases in FieldInfo#clone() where the codec ID is lost. But there must be some other situation where the codec ID gets mixed up. I will be AFK for 10 days at least so I have to leave you alone with that. Seems that we need a test for that though. The docValues test bring that up quickly
          Hide
          Michael Busch added a comment -

          I fixed a bug in FieldInfos that could lead to wrong field numbers, that might have been related to the wrong behavior you're seeing, Simon.

          About codecIds: I made the fix to FieldInfo.clone() to set the codecId on the clone. I also made FieldInfo.codecId private and added getter and setter. The setter checks whether the new value for codecId is different from the previous one, and throws in exception in that case (unless it was set to the default 0 before, which I think means Preflex codec).

          All tests pass. Please let me know if that fixes your problem. If not then you should at least see the new exception that I added, which might make debugging easier.

          Show
          Michael Busch added a comment - I fixed a bug in FieldInfos that could lead to wrong field numbers, that might have been related to the wrong behavior you're seeing, Simon. About codecIds: I made the fix to FieldInfo.clone() to set the codecId on the clone. I also made FieldInfo.codecId private and added getter and setter. The setter checks whether the new value for codecId is different from the previous one, and throws in exception in that case (unless it was set to the default 0 before, which I think means Preflex codec). All tests pass. Please let me know if that fixes your problem. If not then you should at least see the new exception that I added, which might make debugging easier.
          Hide
          Simon Willnauer added a comment -

          I fixed a bug in FieldInfos that could lead to wrong field numbers, that might have been related to the wrong behavior you're seeing, Simon

          ah that could be the reason. I will need to patch my branch again to see if your patch helps. Will do tomorrow.

          About codecIds: I made the fix to FieldInfo.clone() to set the codecId on the clone. I also made FieldInfo.codecId private and added getter and setter. The setter checks whether the new value for codecId is different from the previous one, and throws in exception in that case (unless it was set to the default 0 before, which I think means Preflex codec).

          The clone issue seems to be fixed in your latest patch. While the setter seems kind of wrong. Lemme explain how that numbering works. If you create a SI for a flushed segment SegmentCodecs creates an ordinal for each codec and sets it to the corresponding fields. The ordinal (codecID) is the array index in the SegmentCodec's Codec array which holds the codec instance used for that field. So codecID = 0 is a valid value for segments having PreFlex or a >= 4.0 codec. But if we open a segment that is pre-flex there will only be one codec for the entire segment with codecID=0, thats why this is assigned. (Note: I need to document this where its set!) I think we should initialize the codecID with a different value and replace the this.codecId != 0 check with something like this.codecId != -1. Maybe we should just use an assert here instead of an exception, this is somewhat internal though.


          All tests pass. Please let me know if that fixes your problem. If not then you should at least see the new exception that I added, which might make debugging easier.

          Will do tomorrow! What exactly was the problem with the previous patch beside the codecID clone issue?

          Show
          Simon Willnauer added a comment - I fixed a bug in FieldInfos that could lead to wrong field numbers, that might have been related to the wrong behavior you're seeing, Simon ah that could be the reason. I will need to patch my branch again to see if your patch helps. Will do tomorrow. About codecIds: I made the fix to FieldInfo.clone() to set the codecId on the clone. I also made FieldInfo.codecId private and added getter and setter. The setter checks whether the new value for codecId is different from the previous one, and throws in exception in that case (unless it was set to the default 0 before, which I think means Preflex codec). The clone issue seems to be fixed in your latest patch. While the setter seems kind of wrong. Lemme explain how that numbering works. If you create a SI for a flushed segment SegmentCodecs creates an ordinal for each codec and sets it to the corresponding fields. The ordinal (codecID) is the array index in the SegmentCodec's Codec array which holds the codec instance used for that field. So codecID = 0 is a valid value for segments having PreFlex or a >= 4.0 codec. But if we open a segment that is pre-flex there will only be one codec for the entire segment with codecID=0, thats why this is assigned. (Note: I need to document this where its set!) I think we should initialize the codecID with a different value and replace the this.codecId != 0 check with something like this.codecId != -1. Maybe we should just use an assert here instead of an exception, this is somewhat internal though. All tests pass. Please let me know if that fixes your problem. If not then you should at least see the new exception that I added, which might make debugging easier. Will do tomorrow! What exactly was the problem with the previous patch beside the codecID clone issue?
          Hide
          Michael Busch added a comment -

          I think we should initialize the codecID with a different value and replace the this.codecId != 0 check with something like this.codecId != -1.

          Yeah, I had the same though. I changed it to use -1 and use an assert now instead of throwing the exception.
          (will post the new patch shortly)

          What exactly was the problem with the previous patch beside the codecID clone issue?

          Not sure if that's what caused your codecID issues, but the previous patch had a problem with assigning field numbers. It could happen that a global number for a FieldInfo was acquired, but that number wasn't available anymore in the local FieldInfos. I think this would be quite rare, but now I'm preventing this from happening.

          Show
          Michael Busch added a comment - I think we should initialize the codecID with a different value and replace the this.codecId != 0 check with something like this.codecId != -1. Yeah, I had the same though. I changed it to use -1 and use an assert now instead of throwing the exception. (will post the new patch shortly) What exactly was the problem with the previous patch beside the codecID clone issue? Not sure if that's what caused your codecID issues, but the previous patch had a problem with assigning field numbers. It could happen that a global number for a FieldInfo was acquired, but that number wasn't available anymore in the local FieldInfos. I think this would be quite rare, but now I'm preventing this from happening.
          Hide
          Michael Busch added a comment -
          • Uses -1 now as initial value for codecID.
          • updated to current trunk

          Let me know if it works without problems now in the doc values branch, Simon!

          Show
          Michael Busch added a comment - Uses -1 now as initial value for codecID. updated to current trunk Let me know if it works without problems now in the doc values branch, Simon!
          Hide
          Simon Willnauer added a comment -

          Uses -1 now as initial value for codecID.

          alright that looks good!

          Let me know if it works without problems now in the doc values branch, Simon!

          this looks good now. I think we need to add a CHANGES.TXT entry for this issue since it changes the runtime behavior. +1 to commit Nice work!

          Show
          Simon Willnauer added a comment - Uses -1 now as initial value for codecID. alright that looks good! Let me know if it works without problems now in the doc values branch, Simon! this looks good now. I think we need to add a CHANGES.TXT entry for this issue since it changes the runtime behavior. +1 to commit Nice work!
          Hide
          Michael Busch added a comment -

          Committed revision 1073110.

          Thanks for reviewing the patch, Simon!

          Show
          Michael Busch added a comment - Committed revision 1073110. Thanks for reviewing the patch, Simon!
          Hide
          Michael McCandless added a comment -

          One question: we seem to have lost DocFieldProcessorPerThread.trimFields?

          If an app indexes docs where each doc has different fields... does this mean we are not going to reclaim the DFPPerField for the fields that are no longer seen after flushing? Ie, would it be a memory leak? Somewhere we have a test case for this I think.

          Show
          Michael McCandless added a comment - One question: we seem to have lost DocFieldProcessorPerThread.trimFields? If an app indexes docs where each doc has different fields... does this mean we are not going to reclaim the DFPPerField for the fields that are no longer seen after flushing? Ie, would it be a memory leak? Somewhere we have a test case for this I think.
          Hide
          Michael Busch added a comment -

          One question: we seem to have lost DocFieldProcessorPerThread.trimFields?

          I actually renamed it to doAfterFlush(). It now resets the whole hashmap in DocFieldProcessorPerThread, because we don't want to carry over any field settings into the next segment anymore with per-segment FieldInfos. I think this should be fine?

          Show
          Michael Busch added a comment - One question: we seem to have lost DocFieldProcessorPerThread.trimFields? I actually renamed it to doAfterFlush(). It now resets the whole hashmap in DocFieldProcessorPerThread, because we don't want to carry over any field settings into the next segment anymore with per-segment FieldInfos. I think this should be fine?
          Hide
          Michael McCandless added a comment -

          I actually renamed it to doAfterFlush(). It now resets the whole hashmap in DocFieldProcessorPerThread, because we don't want to carry over any field settings into the next segment anymore with per-segment FieldInfos. I think this should be fine?

          Ahh, good, I think you're right! I missed that we clear this on flush.

          Show
          Michael McCandless added a comment - I actually renamed it to doAfterFlush(). It now resets the whole hashmap in DocFieldProcessorPerThread, because we don't want to carry over any field settings into the next segment anymore with per-segment FieldInfos. I think this should be fine? Ahh, good, I think you're right! I missed that we clear this on flush.
          Hide
          Michael Busch added a comment -

          I mentioned on dev that assigning the same field number across segments is best effort now and wanted to explain in greater detail here how it works:

          There is now a global fieldName <-> fieldNumber bi-map in FieldInfos, which contains all fieldName/number pairs seen in a IndexWriter session. It is passed into each new FieldInfos that is created in the same IndexWriter session.

          Also, when a new IndexWriter is opened, the FieldInfos of all segments are read and the global map created - this is tested in a new unit test this issue adds.

          A FieldInfos has in addition to the reference to the global map also a "private" map, which holds all FieldInfo objects that belong to the corresponding segment (remember there's now a 1-1 mapping SegmentInfo->FieldInfos).

          Now the fieldNumber assignment strategy works as follows: If a new FI is added to FieldInfos, the global map is checked for the number of that field. If the field name hasn't been seen before, the smallest number available in the local map is picked (to keep the numbers dense).
          Otherwise, if we have seen the field before, the global number is used. The problem now might be, that the global number might already be taken in the local FieldInfos. In this case the global and local numbers for the same fieldName would differ. This is not a problem in terms of correctness, but could prevent that field from being efficiently bulk-merged.

          With DocumentsWriterPerThreads (DWPTs) in mind I don't see how we could guarantee consistent field numbering across DWPTs, that's why I implemented it in this "best effort" way.

          Here's an example on how we can get into a situation where a field would get different numbers in different segments:
          segment_1 has fields A and B, therefore these mappings A -> 1, B -> 2.
          Now in segment_2 the first field we add is C, which hasn't been seen ever before, so we pick locally number 1 for it. Then we add the next document which has field A, but since number 1 is already taken, it would get a different number than in segment_1. This means A would not get bulk merged.

          Hmm, after writing this example down I'm realizing that it would be better to just always pick the next available global field number for a new field, then, at least until we get DWPTs, we should never get different numbers across segments, I think? The disadvantage would be that FieldInfos could have "gaps" in the numbers. I implemented the current approach because I wanted to avoid those gaps, but having them would probably not be a big deal?

          Show
          Michael Busch added a comment - I mentioned on dev that assigning the same field number across segments is best effort now and wanted to explain in greater detail here how it works: There is now a global fieldName <-> fieldNumber bi-map in FieldInfos, which contains all fieldName/number pairs seen in a IndexWriter session. It is passed into each new FieldInfos that is created in the same IndexWriter session. Also, when a new IndexWriter is opened, the FieldInfos of all segments are read and the global map created - this is tested in a new unit test this issue adds. A FieldInfos has in addition to the reference to the global map also a "private" map, which holds all FieldInfo objects that belong to the corresponding segment (remember there's now a 1-1 mapping SegmentInfo->FieldInfos). Now the fieldNumber assignment strategy works as follows: If a new FI is added to FieldInfos, the global map is checked for the number of that field. If the field name hasn't been seen before, the smallest number available in the local map is picked (to keep the numbers dense). Otherwise, if we have seen the field before, the global number is used. The problem now might be, that the global number might already be taken in the local FieldInfos. In this case the global and local numbers for the same fieldName would differ. This is not a problem in terms of correctness, but could prevent that field from being efficiently bulk-merged. With DocumentsWriterPerThreads (DWPTs) in mind I don't see how we could guarantee consistent field numbering across DWPTs, that's why I implemented it in this "best effort" way. Here's an example on how we can get into a situation where a field would get different numbers in different segments: segment_1 has fields A and B, therefore these mappings A -> 1, B -> 2. Now in segment_2 the first field we add is C, which hasn't been seen ever before, so we pick locally number 1 for it. Then we add the next document which has field A, but since number 1 is already taken, it would get a different number than in segment_1. This means A would not get bulk merged. Hmm, after writing this example down I'm realizing that it would be better to just always pick the next available global field number for a new field, then, at least until we get DWPTs, we should never get different numbers across segments, I think? The disadvantage would be that FieldInfos could have "gaps" in the numbers. I implemented the current approach because I wanted to avoid those gaps, but having them would probably not be a big deal?
          Hide
          Simon Willnauer added a comment -

          Thanks for the clarification, michael!

          ... the smallest number available in the local map is picked (to keep the numbers dense). ...

          Oh man I was not aware of this. I got totally confused... see next comment

          Hmm, after writing this example down I'm realizing that it would be better to just always pick the next available global field number for a new field, then, at least until we get DWPTs, we should never get different numbers across segments, I think? The disadvantage would be that FieldInfos could have "gaps" in the numbers. I implemented the current approach because I wanted to avoid those gaps, but having them would probably not be a big deal?

          This is how I thought it works but I obviously got confused by global vs. local this is also why I had trouble to understand how that failure could ever have happened. But after looking at the code again this makes sense I don't see any problems in FieldInfo number gaps. this should work just fine and guarantee the bulk copy just for now at least.

          Show
          Simon Willnauer added a comment - Thanks for the clarification, michael! ... the smallest number available in the local map is picked (to keep the numbers dense). ... Oh man I was not aware of this. I got totally confused... see next comment Hmm, after writing this example down I'm realizing that it would be better to just always pick the next available global field number for a new field, then, at least until we get DWPTs, we should never get different numbers across segments, I think? The disadvantage would be that FieldInfos could have "gaps" in the numbers. I implemented the current approach because I wanted to avoid those gaps, but having them would probably not be a big deal? This is how I thought it works but I obviously got confused by global vs. local this is also why I had trouble to understand how that failure could ever have happened. But after looking at the code again this makes sense I don't see any problems in FieldInfo number gaps. this should work just fine and guarantee the bulk copy just for now at least.
          Hide
          Michael Busch added a comment -

          Reopening to make the described improvement that ensures consistent field numbers.

          Show
          Michael Busch added a comment - Reopening to make the described improvement that ensures consistent field numbers.
          Hide
          Michael Busch added a comment -

          I don't see any problems in FieldInfo number gaps. this should work just fine and guarantee the bulk copy just for now at least.

          I was thinking that we probably write field numbers as VInts in a lot of places, and it would therefore be less efficient to have gaps... but this is probably negligible.

          Show
          Michael Busch added a comment - I don't see any problems in FieldInfo number gaps. this should work just fine and guarantee the bulk copy just for now at least. I was thinking that we probably write field numbers as VInts in a lot of places, and it would therefore be less efficient to have gaps... but this is probably negligible.
          Hide
          Michael McCandless added a comment -

          This was an impressive change Not only did we shift FieldInfos under SegmentInfo, we also fixed a given FieldInfos to allow for sparse mapping (ie, it only contains certain field numbers). Various places previously assumed this was a dense mapping.

          I think even in the DWPT case we should try to assign consistent field numbers? I'm already worried enough about the possible perf hit DWPT will put on normal indexing and the NRT case, since it basically pushes merging out from RAM and onto the "normal" more costly disk based merging. If we also suddenly risk these merges not being bulk merges, that's even worse.

          Can't we sync globally on the assignment of field name -> number (the global map lookup)? And FieldInfos per-DWPT would share the same global map. Wouldn't that keep us consistent in the DWPT case?

          Show
          Michael McCandless added a comment - This was an impressive change Not only did we shift FieldInfos under SegmentInfo, we also fixed a given FieldInfos to allow for sparse mapping (ie, it only contains certain field numbers). Various places previously assumed this was a dense mapping. I think even in the DWPT case we should try to assign consistent field numbers? I'm already worried enough about the possible perf hit DWPT will put on normal indexing and the NRT case, since it basically pushes merging out from RAM and onto the "normal" more costly disk based merging. If we also suddenly risk these merges not being bulk merges, that's even worse. Can't we sync globally on the assignment of field name -> number (the global map lookup)? And FieldInfos per-DWPT would share the same global map. Wouldn't that keep us consistent in the DWPT case?
          Hide
          Michael Busch added a comment -

          Can't we sync globally on the assignment of field name -> number (the global map lookup)? And FieldInfos per-DWPT would share the same global map. Wouldn't that keep us consistent in the DWPT case?

          Yes. The global map is already shared across DWPTs and the lookup is synchronized on the global map. I think if we change the logic to always pick the next available global number we would increase the likelihood that fields get bulk-merged.

          It can't be perfect though, because e.g. if you use addIndexes() to add an external segment that has a different field number assignment. That's why we definitely have to keep the code that can fallback to a different local number if it's not possible to use the global number in a segment. But I agree that we should optimize for the "normal" indexing case. And it seems like we all agree that field number gaps are fine.

          Show
          Michael Busch added a comment - Can't we sync globally on the assignment of field name -> number (the global map lookup)? And FieldInfos per-DWPT would share the same global map. Wouldn't that keep us consistent in the DWPT case? Yes. The global map is already shared across DWPTs and the lookup is synchronized on the global map. I think if we change the logic to always pick the next available global number we would increase the likelihood that fields get bulk-merged. It can't be perfect though, because e.g. if you use addIndexes() to add an external segment that has a different field number assignment. That's why we definitely have to keep the code that can fallback to a different local number if it's not possible to use the global number in a segment. But I agree that we should optimize for the "normal" indexing case. And it seems like we all agree that field number gaps are fine.
          Hide
          Simon Willnauer added a comment -

          For the record, robert reverted the changes made by this issue since we have been experiencing a fair bit of problems lately.

          eventually reproducible with:

          ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetSingleValued -Dtests.seed=-4971136915249645135:5200209917417531291 -Dtests.multiplier=3
          
          ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetSingleValuedFcs -Dtests.seed=-4971136915249645135:-3738166620811568832 -Dtests.multiplier=3
          
          ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixMultiValued -Dtests.seed=-4971136915249645135:4594369826150277150 -Dtests.multiplier=3
          
          ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixSingleValued -Dtests.seed=-4971136915249645135:-7702531001769827248 -Dtests.multiplier=3
          
          ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixSingleValuedFcs -Dtests.seed=-4971136915249645135:698398490325732548 -Dtests.multiplier=3
          
          

          I found the problem causing this where certain field numbers got mixed up when the FieldInfos get build initially in IndexWriter and a segment is loaded first which had gaps in its field numbering.
          FieldInfos is ignoring the FieldInfo's number if the FieldInfo does not exist yet and tries to assigne a new "local" field number. But if the next available field number x while the actual FI's number was > x+1 the new added FI will be set to x instead.

          in other words, lets say we have 2 segments:

           seg1 : { fields : [(a:0, c:2)] } 
           seg2 : { fields : [(a:0, b:1, c:2)] } 
          

          if we load seg1's FI we end up with

          fields : [(a:0, c:1)] 

          then we add seg2's FI's and end up with

          fields : [(a:0, c:1, b:2)] 

          this will also explain the TestNRTThreads.testNRTThreads failure where bulkMerge could not be applied due to different field numbers across segments.

          I will upload a patch tomorrow.

          Show
          Simon Willnauer added a comment - For the record, robert reverted the changes made by this issue since we have been experiencing a fair bit of problems lately. eventually reproducible with: ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetSingleValued -Dtests.seed=-4971136915249645135:5200209917417531291 -Dtests.multiplier=3 ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetSingleValuedFcs -Dtests.seed=-4971136915249645135:-3738166620811568832 -Dtests.multiplier=3 ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixMultiValued -Dtests.seed=-4971136915249645135:4594369826150277150 -Dtests.multiplier=3 ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixSingleValued -Dtests.seed=-4971136915249645135:-7702531001769827248 -Dtests.multiplier=3 ant test -Dtestcase=SimpleFacetsTest -Dtestmethod=testFacetPrefixSingleValuedFcs -Dtests.seed=-4971136915249645135:698398490325732548 -Dtests.multiplier=3 I found the problem causing this where certain field numbers got mixed up when the FieldInfos get build initially in IndexWriter and a segment is loaded first which had gaps in its field numbering. FieldInfos is ignoring the FieldInfo's number if the FieldInfo does not exist yet and tries to assigne a new "local" field number. But if the next available field number x while the actual FI's number was > x+1 the new added FI will be set to x instead. in other words, lets say we have 2 segments: seg1 : { fields : [(a:0, c:2)] } seg2 : { fields : [(a:0, b:1, c:2)] } if we load seg1's FI we end up with fields : [(a:0, c:1)] then we add seg2's FI's and end up with fields : [(a:0, c:1, b:2)] this will also explain the TestNRTThreads.testNRTThreads failure where bulkMerge could not be applied due to different field numbers across segments. I will upload a patch tomorrow.
          Hide
          Simon Willnauer added a comment -

          Here is an updated patch against trunk that fixes the problems we had with SimpleFacetsTest.
          I added a testcase that triggered anyNonBulkMerges to be false as well as field number to be mixed up across segments. This patch also tracks the fieldNumbers globally within a IW session.

          I have been running tests with while(true)

          { ant clean test }

          on a top level directory without any failure for 12 hours now. Aside of that the testcase I added failed reliably with the previous version and it explains all failures we have see so far especially the one with NRTThreads.

          Show
          Simon Willnauer added a comment - Here is an updated patch against trunk that fixes the problems we had with SimpleFacetsTest. I added a testcase that triggered anyNonBulkMerges to be false as well as field number to be mixed up across segments. This patch also tracks the fieldNumbers globally within a IW session. I have been running tests with while(true) { ant clean test } on a top level directory without any failure for 12 hours now. Aside of that the testcase I added failed reliably with the previous version and it explains all failures we have see so far especially the one with NRTThreads.
          Hide
          Michael McCandless added a comment -

          Unfortunately, I'm still seeing the assert trip (that there are no non-bulk merges) in TestNRTThreads. Took beast a while to repro but eventually it did...

          I'm also seeing this in TestIndexWriterExceptions:

          NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterExceptions -Dtestmethod=testDocumentsWriterExceptions -Dtests.seed=-1056365072156857559:-3700694839465596125
          
          
          There was 1 failure:
          1) testDocumentsWriterExceptions(org.apache.lucene.index.TestIndexWriterExceptions)
          java.io.FileNotFoundException: _1.tvx
          	at org.apache.lucene.store.RAMDirectory.fileLength(RAMDirectory.java:147)
          	at org.apache.lucene.store.MockDirectoryWrapper.fileLength(MockDirectoryWrapper.java:524)
          	at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:290)
          	at org.apache.lucene.index.LogMergePolicy.sizeBytes(LogMergePolicy.java:211)
          	at org.apache.lucene.index.LogByteSizeMergePolicy.size(LogByteSizeMergePolicy.java:45)
          	at org.apache.lucene.index.LogMergePolicy.useCompoundFile(LogMergePolicy.java:165)
          	at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:629)
          	at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2538)
          	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2503)
          	at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1077)
          	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1041)
          	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1005)
          	at org.apache.lucene.index.TestIndexWriterExceptions.testDocumentsWriterExceptions(TestIndexWriterExceptions.java:565)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
          	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
          	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
          	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
          	at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48)
          	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
          	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
          	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
          	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1213)
          	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1145)
          	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
          	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
          	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
          	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
          	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
          	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
          	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
          	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
          	at org.junit.runners.Suite.runChild(Suite.java:128)
          	at org.junit.runners.Suite.runChild(Suite.java:24)
          	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
          	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
          	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
          	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
          	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
          	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
          	at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
          	at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
          	at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
          	at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98)
          	at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53)
          	at org.junit.runner.JUnitCore.main(JUnitCore.java:45)
          
          FAILURES!!!
          

          The failure reproduces for me (with the above seed)...

          Show
          Michael McCandless added a comment - Unfortunately, I'm still seeing the assert trip (that there are no non-bulk merges) in TestNRTThreads. Took beast a while to repro but eventually it did... I'm also seeing this in TestIndexWriterExceptions: NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterExceptions -Dtestmethod=testDocumentsWriterExceptions -Dtests.seed=-1056365072156857559:-3700694839465596125 There was 1 failure: 1) testDocumentsWriterExceptions(org.apache.lucene.index.TestIndexWriterExceptions) java.io.FileNotFoundException: _1.tvx at org.apache.lucene.store.RAMDirectory.fileLength(RAMDirectory.java:147) at org.apache.lucene.store.MockDirectoryWrapper.fileLength(MockDirectoryWrapper.java:524) at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:290) at org.apache.lucene.index.LogMergePolicy.sizeBytes(LogMergePolicy.java:211) at org.apache.lucene.index.LogByteSizeMergePolicy.size(LogByteSizeMergePolicy.java:45) at org.apache.lucene.index.LogMergePolicy.useCompoundFile(LogMergePolicy.java:165) at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:629) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2538) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2503) at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1077) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1041) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1005) at org.apache.lucene.index.TestIndexWriterExceptions.testDocumentsWriterExceptions(TestIndexWriterExceptions.java:565) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1213) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1145) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:24) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.junit.runner.JUnitCore.run(JUnitCore.java:157) at org.junit.runner.JUnitCore.run(JUnitCore.java:136) at org.junit.runner.JUnitCore.run(JUnitCore.java:117) at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98) at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53) at org.junit.runner.JUnitCore.main(JUnitCore.java:45) FAILURES!!! The failure reproduces for me (with the above seed)...
          Hide
          Michael McCandless added a comment -

          Also, I think we should understand why Solr's SimpleFacetsTest failed from the previous patch and hopefully make a standalone test showing the problem (and that we fixed it!).

          Show
          Michael McCandless added a comment - Also, I think we should understand why Solr's SimpleFacetsTest failed from the previous patch and hopefully make a standalone test showing the problem (and that we fixed it!).
          Hide
          Michael McCandless added a comment -

          This is a big patch! Some comments...:

          • Do we even use FieldInfos.add(Document)? Maybe we can remove
            it...
          • FieldNumberBiMap.addOrGet doesn't need to take the FieldInfoBiMap?
            (It's an unused param).
          • Why create the FieldInfoBiMap class? Ie, why not "merge" that
            back into FieldInfos itself (like it used to be)? (I understand
            why we need FieldNumberBiMap – so we can share a single instance
            across FieldInfos).
          • We mix up autoboxing then unboxing, eg
            FieldInfos.addOrUpdateInternal takes int preferredFieldNumber,
            which is boxed when calling localFieldInfos.nextFieldNumber, then
            manually unboxed on calling globalFieldNumbers.addOrGet.
          • On working with a pre-4.0 index that has non-congruent assignments
            across segments... I fear we may not necessarily ever "stabilize" on
            a fixed global name/number bimap, because we re-compute this map
            on every IW init? Ie, when you first open 4.0 IW on pre-4.0
            index, you'll compute a certain global map, and write new segs
            with those bindings. But say some fields differed in their
            assignment... but then some of those conflicting segs are merged.
            Later, when you open the IW again, you'll get a different global
            map? And write new segments conflicting with the previous new
            segments you had written?
          • The fact that SegmentInfo.clearFilesCache is now a public API and
            consumer is responsible for knowing when to call this
            is... spooky. Previously this cache was a fully private thing (ie
            invalidated whenever a change was made to the SegmentInfo). But I
            don't see any way around it; since we now embed a FieldInfos and
            that FieldInfos (hasVectors) could change... maybe, whenever we
            call this method, add a comment explaining why?
          • It makes me nervous that the API that's allowed to pick a new
            field number (FieldInfo.addInternal) is the same API that used
            when reading a FieldInfos from _X.fnm (when we better not pick a
            different field number!). In theory of course those
            field numbers will never conflict and we'll always get our
            preferred field number... but still. Maybe add an assert in
            FieldInfos.read() verifying we always get that field number?
          • The call to localFieldInfos.setIfNotSet in FieldInfos.addInternal
            makes me nervous... is it actually possible for it to already be
            set, to a conflicting binding? Shouldn't it always match? (Ie,
            above, in addOrUpdateInternal, we just consulted the global map to
            get the binding). Can't we assert the global binding is either
            not present (and we add it) or if it is present it "matches"?
          • Why do we have FieldInfos.clearVectors? Nobody should call
            that...?
          • It's not great that we open then close the CFS reader inside
            SegmentInfo, just to read the FieldInfos. Ie, this means on
            opening an SR we will open this CFS reader twice... it also means
            that opening a SegmentInfos is quite a bit more costly than it
            used to be. EG creating an IW must now go and open/close a CFS reader
            per-segment... not sure what we can really do about that
            though... maybe, we should store the FieldInfos inside the
            segments file? Hmmm....
          • Shouldn't IndexWriter.getFieldInfos(SegmentInfo) use the
            SegmentInfo's fieldInfos rather than loading it again from the
            directory...?

          Indentation is also off in various places, for us lonely people who
          still use Emacs

          Show
          Michael McCandless added a comment - This is a big patch! Some comments...: Do we even use FieldInfos.add(Document)? Maybe we can remove it... FieldNumberBiMap.addOrGet doesn't need to take the FieldInfoBiMap? (It's an unused param). Why create the FieldInfoBiMap class? Ie, why not "merge" that back into FieldInfos itself (like it used to be)? (I understand why we need FieldNumberBiMap – so we can share a single instance across FieldInfos). We mix up autoboxing then unboxing, eg FieldInfos.addOrUpdateInternal takes int preferredFieldNumber, which is boxed when calling localFieldInfos.nextFieldNumber, then manually unboxed on calling globalFieldNumbers.addOrGet. On working with a pre-4.0 index that has non-congruent assignments across segments... I fear we may not necessarily ever "stabilize" on a fixed global name/number bimap, because we re-compute this map on every IW init? Ie, when you first open 4.0 IW on pre-4.0 index, you'll compute a certain global map, and write new segs with those bindings. But say some fields differed in their assignment... but then some of those conflicting segs are merged. Later, when you open the IW again, you'll get a different global map? And write new segments conflicting with the previous new segments you had written? The fact that SegmentInfo.clearFilesCache is now a public API and consumer is responsible for knowing when to call this is... spooky. Previously this cache was a fully private thing (ie invalidated whenever a change was made to the SegmentInfo). But I don't see any way around it; since we now embed a FieldInfos and that FieldInfos (hasVectors) could change... maybe, whenever we call this method, add a comment explaining why? It makes me nervous that the API that's allowed to pick a new field number (FieldInfo.addInternal) is the same API that used when reading a FieldInfos from _X.fnm (when we better not pick a different field number!). In theory of course those field numbers will never conflict and we'll always get our preferred field number... but still. Maybe add an assert in FieldInfos.read() verifying we always get that field number? The call to localFieldInfos.setIfNotSet in FieldInfos.addInternal makes me nervous... is it actually possible for it to already be set, to a conflicting binding? Shouldn't it always match? (Ie, above, in addOrUpdateInternal, we just consulted the global map to get the binding). Can't we assert the global binding is either not present (and we add it) or if it is present it "matches"? Why do we have FieldInfos.clearVectors? Nobody should call that...? It's not great that we open then close the CFS reader inside SegmentInfo, just to read the FieldInfos. Ie, this means on opening an SR we will open this CFS reader twice... it also means that opening a SegmentInfos is quite a bit more costly than it used to be. EG creating an IW must now go and open/close a CFS reader per-segment... not sure what we can really do about that though... maybe, we should store the FieldInfos inside the segments file? Hmmm.... Shouldn't IndexWriter.getFieldInfos(SegmentInfo) use the SegmentInfo's fieldInfos rather than loading it again from the directory...? Indentation is also off in various places, for us lonely people who still use Emacs
          Hide
          Michael Busch added a comment -

          maybe, we should store the FieldInfos inside the segments file? Hmmm....

          I had the same thought while adding the ref to FieldInfos to SegmentInfo. Actually this is probably the right thing to do. At the same time we could switch to a human-readable format

          I fear we may not necessarily ever "stabilize" on a fixed global name/number bimap, because we re-compute this map on every IW init?

          We could also store the global map on disk? addIndexes() would have to ignore the global map from the external index(es).

          Show
          Michael Busch added a comment - maybe, we should store the FieldInfos inside the segments file? Hmmm.... I had the same thought while adding the ref to FieldInfos to SegmentInfo. Actually this is probably the right thing to do. At the same time we could switch to a human-readable format I fear we may not necessarily ever "stabilize" on a fixed global name/number bimap, because we re-compute this map on every IW init? We could also store the global map on disk? addIndexes() would have to ignore the global map from the external index(es).
          Hide
          Michael McCandless added a comment -

          I had the same thought while adding the ref to FieldInfos to SegmentInfo. Actually this is probably the right thing to do. At the same time we could switch to a human-readable format

          Human-readable format would be sweet

          Though I'm still generally nervous that this means just opening the segments file will become quite a bit more costly. Apps that have many fields will be especially penalized, though, apps really should not be creating so many fields.

          We could also store the global map on disk?

          That's an interesting idea? That'd ensure stability on the bindings, even for pre-4.0 indices. This way a pre-4.0 index would gradually work itself towards being fully consistent...

          addIndexes() would have to ignore the global map from the external index(es).

          Well, addIndexes(IR[]) would get fully remapped to the correct bindings? (since it's a real merge).

          But, yes, addIndexes(Dir[]) would not – they are just file-copied. Hmm, they'd also presumably have a different global map, so if we stored the index global map in the Directory, how would we resolve conflicts on the incoming addIndexes...? I guess the local mapping for the incoming segments would override the global one, on conflict.

          Show
          Michael McCandless added a comment - I had the same thought while adding the ref to FieldInfos to SegmentInfo. Actually this is probably the right thing to do. At the same time we could switch to a human-readable format Human-readable format would be sweet Though I'm still generally nervous that this means just opening the segments file will become quite a bit more costly. Apps that have many fields will be especially penalized, though, apps really should not be creating so many fields. We could also store the global map on disk? That's an interesting idea? That'd ensure stability on the bindings, even for pre-4.0 indices. This way a pre-4.0 index would gradually work itself towards being fully consistent... addIndexes() would have to ignore the global map from the external index(es). Well, addIndexes(IR[]) would get fully remapped to the correct bindings? (since it's a real merge). But, yes, addIndexes(Dir[]) would not – they are just file-copied. Hmm, they'd also presumably have a different global map, so if we stored the index global map in the Directory, how would we resolve conflicts on the incoming addIndexes...? I guess the local mapping for the incoming segments would override the global one, on conflict.
          Hide
          Yonik Seeley added a comment -

          w.r.t. storing FielInfos in the segments file - don't we get some consistency benefit from the segments file being small (i.e. it's all in one disk block so it either gets written in it's entirety or not, and readers will see all of it, or not)?

          Show
          Yonik Seeley added a comment - w.r.t. storing FielInfos in the segments file - don't we get some consistency benefit from the segments file being small (i.e. it's all in one disk block so it either gets written in it's entirety or not, and readers will see all of it, or not)?
          Hide
          Simon Willnauer added a comment -

          Unfortunately, I'm still seeing the assert trip (that there are no non-bulk merges) in TestNRTThreads. Took beast a while to repro but eventually it did...

          this is a problem we have seen before. This is an exception where we write a single doc segment and fail inverting the doc after FI has been updated. Since hasVectors has been moved to FI out of SI this is inconsistent and the TV tries to open the files since hasVectors is true.

          This is also why buschmi added clearVectors (just as a workaround afaik)

          Also, I think we should understand why Solr's SimpleFacetsTest failed from the previous patch and hopefully make a standalone test showing the problem (and that we fixed it!).

          man I still try to trigger it in isolation but I can spend too much time on that right now. Got sucked into Tiered Flushing

          About the other comments (the big bulletpoint list) - in the meanwhile I think we should split this up in several smaller issues.

          • start with resetting the FI after flush to make the flags consistent for each segment
          • move FIs into SI with hasVectors etc still in SI
          • factor hasVectors out of SI into FI
          • introduce a global field num map (maybe on realtime first?) and / or store global map
          • ...
            We can still use this patch as a PoC does that make sense?
          Show
          Simon Willnauer added a comment - Unfortunately, I'm still seeing the assert trip (that there are no non-bulk merges) in TestNRTThreads. Took beast a while to repro but eventually it did... this is a problem we have seen before. This is an exception where we write a single doc segment and fail inverting the doc after FI has been updated. Since hasVectors has been moved to FI out of SI this is inconsistent and the TV tries to open the files since hasVectors is true. This is also why buschmi added clearVectors (just as a workaround afaik) Also, I think we should understand why Solr's SimpleFacetsTest failed from the previous patch and hopefully make a standalone test showing the problem (and that we fixed it!). man I still try to trigger it in isolation but I can spend too much time on that right now. Got sucked into Tiered Flushing About the other comments (the big bulletpoint list) - in the meanwhile I think we should split this up in several smaller issues. start with resetting the FI after flush to make the flags consistent for each segment move FIs into SI with hasVectors etc still in SI factor hasVectors out of SI into FI introduce a global field num map (maybe on realtime first?) and / or store global map ... We can still use this patch as a PoC does that make sense?
          Hide
          Michael Busch added a comment -

          w.r.t. storing FielInfos in the segments file - don't we get some consistency benefit from the segments file being small (i.e. it's all in one disk block so it either gets written in it's entirety or not, and readers will see all of it, or not)?

          I think we have three options here:
          1) Store FIs inside of compound file (current patch)
          2) Store FIs in SI
          3) Store FIs in own file outside of compound file

          Each has disadvantages:

          1) more expensive to open SegmentInfos, as it now has to open cfs files to load FIs
          2) SI files becomes bigger
          3) one more file descriptor per segment - but can be closed as soon as FIs was read into memory

          Show
          Michael Busch added a comment - w.r.t. storing FielInfos in the segments file - don't we get some consistency benefit from the segments file being small (i.e. it's all in one disk block so it either gets written in it's entirety or not, and readers will see all of it, or not)? I think we have three options here: 1) Store FIs inside of compound file (current patch) 2) Store FIs in SI 3) Store FIs in own file outside of compound file Each has disadvantages: 1) more expensive to open SegmentInfos, as it now has to open cfs files to load FIs 2) SI files becomes bigger 3) one more file descriptor per segment - but can be closed as soon as FIs was read into memory
          Hide
          Simon Willnauer added a comment -

          Attached next iteration.

          • merged FieldInfoBiMap into FieldInfos
          • fixed AutoBoxing issues
          • reverted the deprecation of SegmentInfo#hasVector / hasProx since they where causing many failures due to missing files on exceptions. This also allowed me to make SegmentInfo.clearFilesCache private again.
            *the FieldInfoBiMap now asserts that if you add a field via setIfNotSet that if either the number or the name exists they are the same as in the mapping.
          • added lazy loading of FieldInfos in SegmentInfo to prevent opening the CFS file while opening SegmentInfo. SR#CoreReader now passes in the already opened CFSReader to load the FieldInfos
          • when we open a IW we load the BiMap directly from the SegmentInfos
          • SegmentInfos now owns the FieldInfoBiMap and writes the map upon commit into an additional .fnx file outside of the CFS file. The SegmentInfos persist the bimap only if it has changed otherwise references the previously written map.

          I still have one nocommit since I want to add more tests for the persisted global map. Beside that all tests pass and I would appreciate a review though.
          Mike would you be so kind and let beast break it?

          Show
          Simon Willnauer added a comment - Attached next iteration. merged FieldInfoBiMap into FieldInfos fixed AutoBoxing issues reverted the deprecation of SegmentInfo#hasVector / hasProx since they where causing many failures due to missing files on exceptions. This also allowed me to make SegmentInfo.clearFilesCache private again. *the FieldInfoBiMap now asserts that if you add a field via setIfNotSet that if either the number or the name exists they are the same as in the mapping. added lazy loading of FieldInfos in SegmentInfo to prevent opening the CFS file while opening SegmentInfo. SR#CoreReader now passes in the already opened CFSReader to load the FieldInfos when we open a IW we load the BiMap directly from the SegmentInfos SegmentInfos now owns the FieldInfoBiMap and writes the map upon commit into an additional .fnx file outside of the CFS file. The SegmentInfos persist the bimap only if it has changed otherwise references the previously written map. I still have one nocommit since I want to add more tests for the persisted global map. Beside that all tests pass and I would appreciate a review though. Mike would you be so kind and let beast break it?
          Hide
          Michael McCandless added a comment -

          Mike would you be so kind and let beast break it?

          Beast ran Lucene's core+contrib tests 398 times!

          There was one failure, but I suspect it's unrelated. I'll open a separate issue...

          Show
          Michael McCandless added a comment - Mike would you be so kind and let beast break it? Beast ran Lucene's core+contrib tests 398 times! There was one failure, but I suspect it's unrelated. I'll open a separate issue...
          Hide
          Simon Willnauer added a comment -

          Beast ran Lucene's core+contrib tests 398 times!

          awesome. Thanks for letting it chew on that patch for a while.

          I added more tests to ensure field numbers are stable and some jdoc comments to package private apis. This patch also (hopefully) solves all indentation issues.

          Show
          Simon Willnauer added a comment - Beast ran Lucene's core+contrib tests 398 times! awesome. Thanks for letting it chew on that patch for a while. I added more tests to ensure field numbers are stable and some jdoc comments to package private apis. This patch also (hopefully) solves all indentation issues.
          Hide
          Michael McCandless added a comment -

          Patch looks much better – I think it's close! Comments:

          • I think we should put header (id + version, ie
            CodecUtil.write/readHeader) on fnx file?
          • When does FieldNumberBiMap init from another...? (We have ctor
            that takes a FieldNumberBiMap in... but I don't see who uses
            that).
          • Who uses BiMap.entries()? Looks like just for testing? Can you
            add comment saying "just for testing.."?
          • FieldInfos ctor that loads from index file name but makes a new
            bimap seems spooky...? Is this only used by tests now...?
          • The passed in "version" to SIS.readGlobalFieldMap is unused
          • I'm a little worried that we name the new file _X.fnx, because it
            will appear that this file 'belongs' to segment X, which is
            dangerous because in some recovery cases we will remove all
            files associated w/ a given segment (ie, _X.*). Maybe, we can
            name it without the leading _? Ie, 0.fnx, 1.fnx, etc.?
          • In IW.getGlobalFieldNumberMap... shouldn't that "legacy" logic be
            moved inside SegmentInfos? Ie it's a bit weird now how we first
            pull an (empty) biMap from the legacy SegmentInfos then we go and
            populate it ourselves...? Then we don't need this method in IW
            (we just ask the SIS for the bimap).
          • Should FieldInfo.putInternal(FieldInfo) assert that the global
            bimap "matches"? (Ie, it makes no effort to update the global
            bimap).
          • The addition of "si.hasProx = hasProx" in SegmentInfo.clone isn't
            needed; we pass hasProx to the ctor of that si.
          • Why do we default SegmentInfos.format now...? Seems spooky?
          • In SegmentInfos.rollbackCommit shouldn't we set the
            pendingMapVersion to -1?

          Should we backport this to 3.x (after sufficient aging)?

          Show
          Michael McCandless added a comment - Patch looks much better – I think it's close! Comments: I think we should put header (id + version, ie CodecUtil.write/readHeader) on fnx file? When does FieldNumberBiMap init from another...? (We have ctor that takes a FieldNumberBiMap in... but I don't see who uses that). Who uses BiMap.entries()? Looks like just for testing? Can you add comment saying "just for testing.."? FieldInfos ctor that loads from index file name but makes a new bimap seems spooky...? Is this only used by tests now...? The passed in "version" to SIS.readGlobalFieldMap is unused I'm a little worried that we name the new file _X.fnx, because it will appear that this file 'belongs' to segment X, which is dangerous because in some recovery cases we will remove all files associated w/ a given segment (ie, _X.*). Maybe, we can name it without the leading _? Ie, 0.fnx, 1.fnx, etc.? In IW.getGlobalFieldNumberMap... shouldn't that "legacy" logic be moved inside SegmentInfos? Ie it's a bit weird now how we first pull an (empty) biMap from the legacy SegmentInfos then we go and populate it ourselves...? Then we don't need this method in IW (we just ask the SIS for the bimap). Should FieldInfo.putInternal(FieldInfo) assert that the global bimap "matches"? (Ie, it makes no effort to update the global bimap). The addition of "si.hasProx = hasProx" in SegmentInfo.clone isn't needed; we pass hasProx to the ctor of that si. Why do we default SegmentInfos.format now...? Seems spooky? In SegmentInfos.rollbackCommit shouldn't we set the pendingMapVersion to -1? Should we backport this to 3.x (after sufficient aging)?
          Hide
          Simon Willnauer added a comment -

          I think we should put header (id + version, ie
          CodecUtil.write/readHeader) on fnx file?

          man, I told myself to add it about 10 times during that patch

          When does FieldNumberBiMap init from another...?

          ah legacy

          I'm a little worried that we name the new file _X.fnx, because it
          will appear that this file 'belongs' to segment X, which is
          dangerous because in some recovery cases we will remove all
          files associated w/ a given segment (ie, _X.*). Maybe, we can
          name it without the leading _? Ie, 0.fnx, 1.fnx, etc.?

          right, good catch! that we can simple remove and then it should be clear that its not a file belonging to a certain segment

          In IW.getGlobalFieldNumberMap... shouldn't that "legacy" logic be....

          yeah thats is where is should be though. I will move it in.

          The addition of "si.hasProx = hasProx" in SegmentInfo.clone isn't...

          true I will remove

          Why do we default SegmentInfos.format now...? Seems spooky?

          this hasn't been used in SIS before so I think it didn't matter before.
          Yet, I check the format in files() so if you create the SIS without reading it its set to 0. I can certainly make that work with default to 0 but it seemed just natural to have it assigned the current_format. I think its fine....

          In SegmentInfos.rollbackCommit shouldn't we set the pendingMapVersion to -1

          ah good catch! thanks

          I will fix those issues and upload another patch. Thanks mike for reviewing!!!!

          Show
          Simon Willnauer added a comment - I think we should put header (id + version, ie CodecUtil.write/readHeader) on fnx file? man, I told myself to add it about 10 times during that patch When does FieldNumberBiMap init from another...? ah legacy I'm a little worried that we name the new file _X.fnx, because it will appear that this file 'belongs' to segment X, which is dangerous because in some recovery cases we will remove all files associated w/ a given segment (ie, _X.*). Maybe, we can name it without the leading _? Ie, 0.fnx, 1.fnx, etc.? right, good catch! that we can simple remove and then it should be clear that its not a file belonging to a certain segment In IW.getGlobalFieldNumberMap... shouldn't that "legacy" logic be.... yeah thats is where is should be though. I will move it in. The addition of "si.hasProx = hasProx" in SegmentInfo.clone isn't... true I will remove Why do we default SegmentInfos.format now...? Seems spooky? this hasn't been used in SIS before so I think it didn't matter before. Yet, I check the format in files() so if you create the SIS without reading it its set to 0. I can certainly make that work with default to 0 but it seemed just natural to have it assigned the current_format. I think its fine.... In SegmentInfos.rollbackCommit shouldn't we set the pendingMapVersion to -1 ah good catch! thanks I will fix those issues and upload another patch. Thanks mike for reviewing!!!!
          Hide
          Simon Willnauer added a comment -

          Should we backport this to 3.x (after sufficient aging)?

          I think we should let it bake in first though. Maybe we can also factor out the hasVectors in another issues and then backport both once they have been random-tested for a little while.

          Show
          Simon Willnauer added a comment - Should we backport this to 3.x (after sufficient aging)? I think we should let it bake in first though. Maybe we can also factor out the hasVectors in another issues and then backport both once they have been random-tested for a little while.
          Hide
          Michael McCandless added a comment -

          Why do we default SegmentInfos.format now...? Seems spooky?

          this hasn't been used in SIS before so I think it didn't matter before.
          Yet, I check the format in files() so if you create the SIS without reading it its set to 0. I can certainly make that work with default to 0 but it seemed just natural to have it assigned the current_format. I think its fine....

          Ahh, I see: it's for the case where we make a new SIS() in RAM, because we'll now look @ the format in files(). OK this sounds right then.

          Should we backport this to 3.x (after sufficient aging)?

          I think we should let it bake in first though. Maybe we can also factor out the hasVectors in another issues and then backport both once they have been random-tested for a little while.

          Definitely let it bake!

          Also, I have lots of pending backports to 3.2... which this patch likely overlaps on, so we should try to do them "in order" to reduce conflicts I think.

          Show
          Michael McCandless added a comment - Why do we default SegmentInfos.format now...? Seems spooky? this hasn't been used in SIS before so I think it didn't matter before. Yet, I check the format in files() so if you create the SIS without reading it its set to 0. I can certainly make that work with default to 0 but it seemed just natural to have it assigned the current_format. I think its fine.... Ahh, I see: it's for the case where we make a new SIS() in RAM, because we'll now look @ the format in files(). OK this sounds right then. Should we backport this to 3.x (after sufficient aging)? I think we should let it bake in first though. Maybe we can also factor out the hasVectors in another issues and then backport both once they have been random-tested for a little while. Definitely let it bake! Also, I have lots of pending backports to 3.2... which this patch likely overlaps on, so we should try to do them "in order" to reduce conflicts I think.
          Hide
          Michael McCandless added a comment -

          Hmm, but: if we open a pre-4.0 SIS, make changes, write a new SIS (commit), do we change that new instance's format to 4.0 (in RAM)? Ie, so that files() is correct...?

          Show
          Michael McCandless added a comment - Hmm, but: if we open a pre-4.0 SIS, make changes, write a new SIS (commit), do we change that new instance's format to 4.0 (in RAM)? Ie, so that files() is correct...?
          Hide
          Simon Willnauer added a comment -

          if we open a pre-4.0 SIS, make changes, write a new SIS (commit), do we change that new instance's format to 4.0 (in RAM)? Ie, so that files() is correct...?

          hmm actually we don't while the DefaultSIWriter changes it upon write which is not reflected to ram. We might should do that either before we write in DSIWriter. The other thing is that we should carry over the format from the oldSegment in the case we open a specific commit point. I will write a test first that hopefully triggers a fail and fix that accordingly.

          Show
          Simon Willnauer added a comment - if we open a pre-4.0 SIS, make changes, write a new SIS (commit), do we change that new instance's format to 4.0 (in RAM)? Ie, so that files() is correct...? hmm actually we don't while the DefaultSIWriter changes it upon write which is not reflected to ram. We might should do that either before we write in DSIWriter. The other thing is that we should carry over the format from the oldSegment in the case we open a specific commit point. I will write a test first that hopefully triggers a fail and fix that accordingly.
          Hide
          Simon Willnauer added a comment -

          next iteration, I think we are ready to commit here. I added a couple of testcases regarding the fnx file and made sure they get deleted accordingly even if we fail with Exceptions during prepareCommit & finishCommit. Moved the BW-Compat code for building the initial global map to SegmentInfos and cleaned up assertions in FieldInfos (also added the suggested assertion to FIs#putInternal). To prevent that we miss a fnx file if we open an old index, write the fnx file and keep that SIS in memory with format set to some old version I removed the check if we are on 4.0 index but used the latestGlobalFieldNumberVersion which is kept consistent and only include the file if its not set to 0.

          I run whileTrue tests on this patch now for a while and things are looking good from my side. Mike if you have time let beast chew it again. If its fine I will commit tomorrow.

          Show
          Simon Willnauer added a comment - next iteration, I think we are ready to commit here. I added a couple of testcases regarding the fnx file and made sure they get deleted accordingly even if we fail with Exceptions during prepareCommit & finishCommit. Moved the BW-Compat code for building the initial global map to SegmentInfos and cleaned up assertions in FieldInfos (also added the suggested assertion to FIs#putInternal). To prevent that we miss a fnx file if we open an old index, write the fnx file and keep that SIS in memory with format set to some old version I removed the check if we are on 4.0 index but used the latestGlobalFieldNumberVersion which is kept consistent and only include the file if its not set to 0. I run whileTrue tests on this patch now for a while and things are looking good from my side. Mike if you have time let beast chew it again. If its fine I will commit tomorrow.
          Hide
          Michael McCandless added a comment -

          OK beast chewed on this for a few hours – ran all (Lucene + Solr)
          tests 144 times.

          The UIMAUpdateRequestProcessorTest had a number of failures, but I
          suspect they are unrelated. (They could also be due to how I run the
          test – I'm using the python runner from luceneutil).

          So I think it's good! I'll review the patch one more time...

          Show
          Michael McCandless added a comment - OK beast chewed on this for a few hours – ran all (Lucene + Solr) tests 144 times. The UIMAUpdateRequestProcessorTest had a number of failures, but I suspect they are unrelated. (They could also be due to how I run the test – I'm using the python runner from luceneutil). So I think it's good! I'll review the patch one more time...
          Hide
          Michael McCandless added a comment -

          Patch looks great! Only a few things:

          • We still need a header (id + version) on the fnx file?
          • FieldInfos ctor that loads from index file name but makes a new
            bimap seems spooky...? Is this only used by tests now...?
          • Can you add a comment where we create the N.fnx name, explaining
            why it has no leading _? (ie, because it's not a per-segment
            file, but rather a global file, shared by multiple segments)
          • In CHANGES entry, persistend is mis-spelled (need to drop the n);
            also remove the _ from _X.fnx, and add . after "successful commit".
          Show
          Michael McCandless added a comment - Patch looks great! Only a few things: We still need a header (id + version) on the fnx file? FieldInfos ctor that loads from index file name but makes a new bimap seems spooky...? Is this only used by tests now...? Can you add a comment where we create the N.fnx name, explaining why it has no leading _? (ie, because it's not a per-segment file, but rather a global file, shared by multiple segments) In CHANGES entry, persistend is mis-spelled (need to drop the n); also remove the _ from _X.fnx, and add . after "successful commit".
          Hide
          Simon Willnauer added a comment -

          We still need a header (id + version) on the fnx file?

          its there I guess you got the wrong patch.

          FieldInfos ctor that loads from index file name but makes a new
          bimap seems spooky...? Is this only used by tests now...?

          Well this on is only for the read case where we open a FIS. Yet, we need to do this since we store a fnx file per SIS and upon IW#addIndexed(Directory) we could have a FIS that has field number different to the global map. This is fine as long as we don't seed the FIS on read. I will open another issue to make this case more efficient and assert that the FIS is read-only once we created the FIS from a directory.

          I fixed all the remaining issues and will go ahead and commit now. Thanks mike

          Show
          Simon Willnauer added a comment - We still need a header (id + version) on the fnx file? its there I guess you got the wrong patch. FieldInfos ctor that loads from index file name but makes a new bimap seems spooky...? Is this only used by tests now...? Well this on is only for the read case where we open a FIS. Yet, we need to do this since we store a fnx file per SIS and upon IW#addIndexed(Directory) we could have a FIS that has field number different to the global map. This is fine as long as we don't seed the FIS on read. I will open another issue to make this case more efficient and assert that the FIS is read-only once we created the FIS from a directory. I fixed all the remaining issues and will go ahead and commit now. Thanks mike
          Hide
          Simon Willnauer added a comment -

          Committed to trunk - I will keep this open until RT and docvalues have synced up with it.

          Show
          Simon Willnauer added a comment - Committed to trunk - I will keep this open until RT and docvalues have synced up with it.
          Hide
          Simon Willnauer added a comment -

          merged to docvalues and realtime branch.

          Show
          Simon Willnauer added a comment - merged to docvalues and realtime branch.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development