Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Even though norms now have an iterator API, they are still always dense in practice since documents that do not have a value get assigned 0 as a norm value.

      1. LUCENE-7475.patch
        79 kB
        Adrien Grand
      2. LUCENE-7475.patch
        79 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          jpountz Adrien Grand added a comment -

          Here is a patch that:

          • fixes NormValuesWriter to support sparse norms
          • adds a new Lucene70NormsFormat that supports sparsity and only encodes norms for documents that have a norm
          • adds a codecSupportsSparsity method to BaseNormsFormatTestCase so that modern norms formats can get proper testing of the sparse case
          • fixes SimpleTextNormsFormat to support sparsity
          • moves Lucene53NormsFormat to the backward-codecs module

          Notes:

          • the current patch assigns a norm value of zero to fields that generate no tokens (can happen eg. with the empty string or if all tokens are stop words) and only considers that a document does not have norms if no text field were indexed at all. We could also decide that fields that generate no tokens are considered as missing too, I think both approaches can make sense.
          • the new Lucene70NormsFormat is only a first step, it can certainly be improved in further issues
          Show
          jpountz Adrien Grand added a comment - Here is a patch that: fixes NormValuesWriter to support sparse norms adds a new Lucene70NormsFormat that supports sparsity and only encodes norms for documents that have a norm adds a codecSupportsSparsity method to BaseNormsFormatTestCase so that modern norms formats can get proper testing of the sparse case fixes SimpleTextNormsFormat to support sparsity moves Lucene53NormsFormat to the backward-codecs module Notes: the current patch assigns a norm value of zero to fields that generate no tokens (can happen eg. with the empty string or if all tokens are stop words) and only considers that a document does not have norms if no text field were indexed at all. We could also decide that fields that generate no tokens are considered as missing too, I think both approaches can make sense. the new Lucene70NormsFormat is only a first step, it can certainly be improved in further issues
          Hide
          mikemccand Michael McCandless added a comment -

          Woops, I just pushed a small speedup to the old norms format (Lucene43NormsProducer) to avoid the wrapper class (over dense norms) before seeing this new issue

          Show
          mikemccand Michael McCandless added a comment - Woops, I just pushed a small speedup to the old norms format ( Lucene43NormsProducer ) to avoid the wrapper class (over dense norms) before seeing this new issue
          Hide
          jpountz Adrien Grand added a comment -

          No worries, it's a good change! I was just going to ask whether you would be against making longValue() throw an exception.

          Show
          jpountz Adrien Grand added a comment - No worries, it's a good change! I was just going to ask whether you would be against making longValue() throw an exception.
          Hide
          jpountz Adrien Grand added a comment -

          Rebased patch against Mike's last changes to LUCENE-7407.

          Show
          jpountz Adrien Grand added a comment - Rebased patch against Mike's last changes to LUCENE-7407 .
          Hide
          mikemccand Michael McCandless added a comment -

          This is a great change. I would almost call it fixing a "bug", in that
          it fixes the norms iteration to never iterate to a document that did
          not have that field. Sort of as if we had added docsWithField to
          norms, in the past.

          So if only 1 doc out of zillions is missing the value, we use the
          sparse form. We can improve how we encode it on future issues.

          And of course for very sparse fields, it will be a big win ("pay for
          what you actually use", like postings and (nearly) stored fields).

          I saw some minor things:

          • In Lucene70NormsProducer you can use
            DocValues.emptyNumeric instead of making your own?
          • You can let longValue directly throw IOException now, in
            Lucene70NormsProducer (it's still re-throwing as
            RuntimeException in a few places).

          The test improvements are wonderful.

          +1 to push!

          Show
          mikemccand Michael McCandless added a comment - This is a great change. I would almost call it fixing a "bug", in that it fixes the norms iteration to never iterate to a document that did not have that field. Sort of as if we had added docsWithField to norms, in the past. So if only 1 doc out of zillions is missing the value, we use the sparse form. We can improve how we encode it on future issues. And of course for very sparse fields, it will be a big win ("pay for what you actually use", like postings and (nearly) stored fields). I saw some minor things: In Lucene70NormsProducer you can use DocValues.emptyNumeric instead of making your own? You can let longValue directly throw IOException now, in Lucene70NormsProducer (it's still re-throwing as RuntimeException in a few places). The test improvements are wonderful. +1 to push!
          Hide
          jpountz Adrien Grand added a comment -

          We can improve how we encode it on future issues.

          Yes, we will need to improve the format indeed. The current sparse format uses a bitset to store docs with norms, so it is still wasteful in the very sparse case: if less than 1/32 docs have a values even storing the full 4-byte doc ids would be more efficent. On the other hand, if the norms are almost dense, there will be a performance hit so we might want to keep the dense encoding above a certain threshold of documents that have a value.

          Thanks for having a look. I'll address your comments and push.

          Show
          jpountz Adrien Grand added a comment - We can improve how we encode it on future issues. Yes, we will need to improve the format indeed. The current sparse format uses a bitset to store docs with norms, so it is still wasteful in the very sparse case: if less than 1/32 docs have a values even storing the full 4-byte doc ids would be more efficent. On the other hand, if the norms are almost dense, there will be a performance hit so we might want to keep the dense encoding above a certain threshold of documents that have a value. Thanks for having a look. I'll address your comments and push.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9128bdbaf547429667740cdc95370c7c606f83fc in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9128bdb ]

          LUCENE-7475: Make norms sparse.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9128bdbaf547429667740cdc95370c7c606f83fc in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9128bdb ] LUCENE-7475 : Make norms sparse.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit e1370d2c2060463da8baffa19719249db1aa1a7d in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e1370d2 ]

          LUCENE-7475: Make Lucene70NormsFormat's SparseDISI use the slice API rather than RandomAccessSlice.

          Show
          jira-bot ASF subversion and git services added a comment - Commit e1370d2c2060463da8baffa19719249db1aa1a7d in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e1370d2 ] LUCENE-7475 : Make Lucene70NormsFormat's SparseDISI use the slice API rather than RandomAccessSlice.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5394d29fca8546936dc8227f23c6561d6b386832 in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5394d29 ]

          LUCENE-7475: Remove one layer of abstraction in the Lucene70 norms impl.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5394d29fca8546936dc8227f23c6561d6b386832 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5394d29 ] LUCENE-7475 : Remove one layer of abstraction in the Lucene70 norms impl.

            People

            • Assignee:
              Unassigned
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development