Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today DVConsumer API requires the caller to pass a value for every document, where null means "this doc has no value". The Codec can then choose how to encode the values, i.e. whether it encodes a 0 for a numeric field, or encodes the sparse docs. In practice, from what I see, we choose to encode the 0s.

      I wonder if we e.g. added an Iterable<Number> to DVConsumer.addXYZField(), if that would make a better API. The caller only passes <doc,value> pairs and it's up to the Codec to decide how it wants to encode the missing values. Like, if a user's app truly has a sparse NDV, IndexWriter doesn't need to "fill the gaps" artificially. It's the job of the Codec.

      To be clear, I don't propose to change any Codec implementation in this issue (w.r.t. sparse encoding - yes/no), only change the API to reflect that sparseness. I think that if we'll ever want to encode sparse values, it will be a more convenient API.

      Thoughts? I volunteer to do this work, but want to get others' opinion before I start.

        Issue Links

          Activity

          Hide
          rcmuir Robert Muir added a comment -

          The codec can already decide how to encode the values. Making the API more complicated doesn't seem to buy us anything. I'm open to a benchmark showing this, but I'm not seeing it.

          Show
          rcmuir Robert Muir added a comment - The codec can already decide how to encode the values. Making the API more complicated doesn't seem to buy us anything. I'm open to a benchmark showing this, but I'm not seeing it.
          Hide
          shaie Shai Erera added a comment -

          I don't think it makes the API more complicated. To the users of the API we say "pass only docs with values". To the Codec developers we say "you are going to get only docs with values, so encode however you see fit such that you can later provide docsWithFields efficiently". It's not about performance yet, but about making the API clear (in my opinion) - stating that null denotes a missing value for a document is not better than just not passing the document in the first place.

          Show
          shaie Shai Erera added a comment - I don't think it makes the API more complicated. To the users of the API we say "pass only docs with values". To the Codec developers we say "you are going to get only docs with values, so encode however you see fit such that you can later provide docsWithFields efficiently". It's not about performance yet, but about making the API clear (in my opinion) - stating that null denotes a missing value for a document is not better than just not passing the document in the first place.
          Hide
          mikemccand Michael McCandless added a comment -

          Dup of LUCENE-7407. We now pass a DocValuesProducer to all the addXYZField when writing doc values.

          Show
          mikemccand Michael McCandless added a comment - Dup of LUCENE-7407 . We now pass a DocValuesProducer to all the addXYZField when writing doc values.

            People

            • Assignee:
              Unassigned
              Reporter:
              shaie Shai Erera
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development