Lucene - Core
  1. Lucene - Core
  2. LUCENE-5680

Allow updating multiple DocValues fields atomically

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This has come up on the list (http://markmail.org/message/2wmpvksuwc5t57pg) – it would be good if we can allow updating several doc-values fields, atomically. It will also improve/simplify our tests, where today we index two fields, e.g. the field itself and a control field. In some multi-threaded tests, since we cannot be sure which updates came through first, we limit the test such that each thread updates a different set of fields, otherwise they will collide and it will be hard to verify the index in the end.

      I was working on a patch and it looks pretty simple to do, will post a patch shortly.

      1. LUCENE-5680.patch
        34 kB
        Shai Erera
      2. LUCENE-5680.patch
        27 kB
        Shai Erera
      3. LUCENE-5680.patch
        23 kB
        Shai Erera
      4. LUCENE-5680.patch
        23 kB
        Shai Erera

        Activity

        Hide
        Shai Erera added a comment -

        Patch add IndexWriter.updateDocValues(DocValuesUpdate...) API which acts basically like delDocs(Term...) or delDocs(Query...) under the hood. I kept the sugar updateNumeric/Binary for convenience.

        What this allows you to do is e.g. update atomically a set of numeric and binary fields. Also, I started by adding update(Term,Map<String,Long>) but chose this method since it allows both mixing numeric/binary as well as is easier to code (e.g. look at tests and how it's easier than creating a map). As a side-effect, this allows you to atomically update fields of different sets of documents, since each DocValuesUpdate has its own Term. I think that could be nice, but I don't mind for now to make sure the Term of all updates is the same (though, I don't see a reason to prevent it..).

        I started to migrate tests to the new API (those that rely on a field and control-field to assert a value), but haven't finished - I want to get some feedback on the API first. Also, I want to add tests that update numeric/binary DVs atomically (to exercise this capability). And I want to add tests that update different sets of docs atomically (i.e. different Term in each update).

        I'd appreciate some comments before I move on w/ the tests.

        Show
        Shai Erera added a comment - Patch add IndexWriter.updateDocValues(DocValuesUpdate...) API which acts basically like delDocs(Term...) or delDocs(Query...) under the hood. I kept the sugar updateNumeric/Binary for convenience. What this allows you to do is e.g. update atomically a set of numeric and binary fields. Also, I started by adding update(Term,Map<String,Long>) but chose this method since it allows both mixing numeric/binary as well as is easier to code (e.g. look at tests and how it's easier than creating a map). As a side-effect, this allows you to atomically update fields of different sets of documents, since each DocValuesUpdate has its own Term. I think that could be nice, but I don't mind for now to make sure the Term of all updates is the same (though, I don't see a reason to prevent it..). I started to migrate tests to the new API (those that rely on a field and control-field to assert a value), but haven't finished - I want to get some feedback on the API first. Also, I want to add tests that update numeric/binary DVs atomically (to exercise this capability). And I want to add tests that update different sets of docs atomically (i.e. different Term in each update). I'd appreciate some comments before I move on w/ the tests.
        Hide
        Shai Erera added a comment -

        I made a slight change to the reduce the API: instead of exposing the public Numeric/BinaryDVUpdate, I added a factory DVUpdate.createNumeric/Binary. Otherwise, someone might confuse and think that by extend DVU to e.g. SortedDVU, he can automagically update SortedDV fields. I also changed DVUpdate ctor to private.

        Show
        Shai Erera added a comment - I made a slight change to the reduce the API: instead of exposing the public Numeric/BinaryDVUpdate, I added a factory DVUpdate.createNumeric/Binary. Otherwise, someone might confuse and think that by extend DVU to e.g. SortedDVU, he can automagically update SortedDV fields. I also changed DVUpdate ctor to private.
        Hide
        Robert Muir added a comment -

        I don't think DocValuesFieldUpdate should become public, nor should it become a parameter in IW's api.

        I think whats wrong here is that these methods on indexwriter need to use Field api or something instead. this way the user just calls IW.update(new Term("id", "12345"), new NumericDocValuesField("xyz", 6)) and its similar to how they build a document initally versus a totally separate API.

        Show
        Robert Muir added a comment - I don't think DocValuesFieldUpdate should become public, nor should it become a parameter in IW's api. I think whats wrong here is that these methods on indexwriter need to use Field api or something instead. this way the user just calls IW.update(new Term("id", "12345"), new NumericDocValuesField("xyz", 6)) and its similar to how they build a document initally versus a totally separate API.
        Hide
        Shai Erera added a comment -

        I think whats wrong here is that these methods on indexwriter need to use Field api or something instead

        I thought about it at some point, but I was worried it might suggest that you can pass whatever Field, even if the method is called updateDocValues(). If we have a DocValuesField base class for all DV field types, that might be better. Of course, even if we go w/ a Field..., we can document that only NDV and BDV fields are supported, and throw appropriate exception otherwise.

        I wanted to avoid having must-read-the-jdocs API, and DocValuesUpdate is the that or something instead that you mentioned. Note that it's NOT DocValuesFieldUpdate which is a heavier class, it's only DocValuesUpdate, which record only the term, field and value. What do you think?

        Show
        Shai Erera added a comment - I think whats wrong here is that these methods on indexwriter need to use Field api or something instead I thought about it at some point, but I was worried it might suggest that you can pass whatever Field, even if the method is called updateDocValues(). If we have a DocValuesField base class for all DV field types, that might be better. Of course, even if we go w/ a Field... , we can document that only NDV and BDV fields are supported, and throw appropriate exception otherwise. I wanted to avoid having must-read-the-jdocs API, and DocValuesUpdate is the that or something instead that you mentioned. Note that it's NOT DocValuesFieldUpdate which is a heavier class, it's only DocValuesUpdate , which record only the term, field and value. What do you think?
        Hide
        Robert Muir added a comment -

        I don't think we should add this DocValuesUpdate class, sorry I'm -1 here.

        I'm already unhappy with the current .document api: tons of abstractions etc, but yet really screwed up for docvalues (they are represented as 'stored' her). This will be a large enough challenge to fix before 5.0: we don't need a second api to deal with as well.

        Show
        Robert Muir added a comment - I don't think we should add this DocValuesUpdate class, sorry I'm -1 here. I'm already unhappy with the current .document api: tons of abstractions etc, but yet really screwed up for docvalues (they are represented as 'stored' her). This will be a large enough challenge to fix before 5.0: we don't need a second api to deal with as well.
        Hide
        Shai Erera added a comment -

        I started to implement an updateDocValues(Term, Field...), ensuring that the field's DVType is not null, and only NUMERIC and BINARY etc. and it worked, except for unsetting a value. Since NumericDocValuesField does not let you pass null as the value. We can add a ctor which takes a Long, and allow you to do that, but it looks odd I think? I haven't thought it through yet, perhaps you have another idea?

        Show
        Shai Erera added a comment - I started to implement an updateDocValues(Term, Field...) , ensuring that the field's DVType is not null, and only NUMERIC and BINARY etc. and it worked, except for unsetting a value. Since NumericDocValuesField does not let you pass null as the value. We can add a ctor which takes a Long , and allow you to do that, but it looks odd I think? I haven't thought it through yet, perhaps you have another idea?
        Hide
        Robert Muir added a comment -

        Whats the value in allowing users to unset things here? I'm not sure of the use case.

        Show
        Robert Muir added a comment - Whats the value in allowing users to unset things here? I'm not sure of the use case.
        Hide
        Shai Erera added a comment - - edited

        Well, first updateNumeric/BinaryDV() allows you to unset a value, and I think we should preserve that capability here. As for unsetting, this could be very useful e.g. for a "saleprice" field as well as any other field which is transient. I think you proposed the "unset" capability while I was working on LUCENE-5189, but I cannot find the reference .

        I agree we shouldn't bloat the API if unnecessary, and when I wrote the update(Term,Field...) version it looked very simple and tests even passed. So I think this direction is promising, but we should allow unsetting. Perhaps we can put somewhere a constant NumericDocValuesField UNSET = ...? Maybe it can be on IndexWriter .. the thing is, we don't need it for e.g. Binary, since they take a BytesRef and at least for now, allow passing a null value, but we can have a similar UNSET constant for binary too.

        Show
        Shai Erera added a comment - - edited Well, first updateNumeric/BinaryDV() allows you to unset a value, and I think we should preserve that capability here. As for unsetting, this could be very useful e.g. for a "saleprice" field as well as any other field which is transient. I think you proposed the "unset" capability while I was working on LUCENE-5189 , but I cannot find the reference . I agree we shouldn't bloat the API if unnecessary, and when I wrote the update(Term,Field...) version it looked very simple and tests even passed. So I think this direction is promising, but we should allow unsetting. Perhaps we can put somewhere a constant NumericDocValuesField UNSET = ... ? Maybe it can be on IndexWriter .. the thing is, we don't need it for e.g. Binary, since they take a BytesRef and at least for now, allow passing a null value, but we can have a similar UNSET constant for binary too.
        Hide
        Robert Muir added a comment -

        I don't think i proposed unsetting

        I don't see how it provides anything over an explicit 'onsale' and 'saleprice' that you update atomically here, which will work easier in general for the search APIs (ranking etc)

        So I really don't think this should drive the API at all, I'm still not seeing a use case.

        Show
        Robert Muir added a comment - I don't think i proposed unsetting I don't see how it provides anything over an explicit 'onsale' and 'saleprice' that you update atomically here, which will work easier in general for the search APIs (ranking etc) So I really don't think this should drive the API at all, I'm still not seeing a use case.
        Hide
        Shai Erera added a comment -

        Patch contains the Field... vs DocValuesUpdate... variants side-by-side, with support for unsetting a field's value. To support that with the Field... method, I added Field.setNumberValue(Number value) and a .createMissing static method to NumericDocValuesField. This now works.

        Now we have them side-by-side in a patch to review.

        Show
        Shai Erera added a comment - Patch contains the Field... vs DocValuesUpdate... variants side-by-side, with support for unsetting a field's value. To support that with the Field... method, I added Field.setNumberValue(Number value) and a .createMissing static method to NumericDocValuesField. This now works. Now we have them side-by-side in a patch to review.
        Hide
        Robert Muir added a comment -

        A few concerns:
        why do we have both MISSING and createMissing? It seems only one is necessary.

        How can we remove the confusion about MISSING, it wont work normally at index time. it only applies to updates right? This needs to either be clear in the API, or also recognized by XDocValuesWriter at index-time and treated as equivalent to null there, too.

        Show
        Robert Muir added a comment - A few concerns: why do we have both MISSING and createMissing? It seems only one is necessary. How can we remove the confusion about MISSING, it wont work normally at index time. it only applies to updates right? This needs to either be clear in the API, or also recognized by XDocValuesWriter at index-time and treated as equivalent to null there, too.
        Hide
        Robert Muir added a comment -

        I'm gonna restate: I don't think we should allow updating values to missing.

        There is no real use case for this, and it is absurdly complicated (it has completely hijacked this issue).

        Show
        Robert Muir added a comment - I'm gonna restate: I don't think we should allow updating values to missing. There is no real use case for this, and it is absurdly complicated (it has completely hijacked this issue).
        Hide
        Shai Erera added a comment -

        why do we have both MISSING and createMissing? It seems only one is necessary.

        It's because I didn't want to add a ctor to NDVField which takes a Long. And if the user will want to create that, by e.g. creating a new NDVField w/ dummy value, then call setNumberValue, it's possible. But then NDVField ctor will create a Long object out of the dummy value .. a waste.

        You know what, perhaps we can leave unsetting through the atomic API out of the picture for now, and handle it when the need arises. Maybe we should also nuke that from the other updateXXXDocValue? I.e. if the need arises, we fix all of them, and make them consistent?

        Show
        Shai Erera added a comment - why do we have both MISSING and createMissing? It seems only one is necessary. It's because I didn't want to add a ctor to NDVField which takes a Long. And if the user will want to create that, by e.g. creating a new NDVField w/ dummy value, then call setNumberValue, it's possible. But then NDVField ctor will create a Long object out of the dummy value .. a waste. You know what, perhaps we can leave unsetting through the atomic API out of the picture for now, and handle it when the need arises. Maybe we should also nuke that from the other updateXXXDocValue? I.e. if the need arises, we fix all of them, and make them consistent?
        Hide
        Shai Erera added a comment -

        Opened LUCENE-5706 to disallow unsetting values. I will handle it before this issue.

        Show
        Shai Erera added a comment - Opened LUCENE-5706 to disallow unsetting values. I will handle it before this issue.
        Hide
        Shai Erera added a comment -

        Patch ports all tests to use the atomic updates. This removed the complexity in TestIWExceptions that we've added recently, since now each set of updates is either atomically applied or not.

        Show
        Shai Erera added a comment - Patch ports all tests to use the atomic updates. This removed the complexity in TestIWExceptions that we've added recently, since now each set of updates is either atomically applied or not.
        Hide
        Shai Erera added a comment -

        I think it's ready so if there are no objections I'll commit later today.

        Show
        Shai Erera added a comment - I think it's ready so if there are no objections I'll commit later today.
        Hide
        ASF subversion and git services added a comment -

        Commit 1598272 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1598272 ]

        LUCENE-5680: add aotmic DocValues updates

        Show
        ASF subversion and git services added a comment - Commit 1598272 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1598272 ] LUCENE-5680 : add aotmic DocValues updates
        Hide
        ASF subversion and git services added a comment -

        Commit 1598277 from Shai Erera in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1598277 ]

        LUCENE-5680: add aotmic DocValues updates

        Show
        ASF subversion and git services added a comment - Commit 1598277 from Shai Erera in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1598277 ] LUCENE-5680 : add aotmic DocValues updates
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x.

        Show
        Shai Erera added a comment - Committed to trunk and 4x.

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development