Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.2, 5.5.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TokenStream.end() calls getAttribute(), which is pretty costly to do per-stream.

      It does its current hack, because in the ctor of TokenStream is "too early".

      Instead, we can just add a variant of clear(), called end() to AttributeImpl. For most attributes it defers to clear, but for PosIncAtt it can handle the special case.

        Activity

        Hide
        rcmuir Robert Muir added a comment -

        Here is one solution. There may be other ways to fix this.

        Show
        rcmuir Robert Muir added a comment - Here is one solution. There may be other ways to fix this.
        Hide
        mikemccand Michael McCandless added a comment -

        +1, thanks Robert Muir

        This is the apparent source of the very unexpected slowdown here: https://github.com/elastic/elasticsearch/pull/19867#issuecomment-240841821

        Was quite "fun" to track down

        I'm still not sure why this affects my 2-socket (72 core) box but not my single socket (8 core) boxes, but when indexing geonames data with Elasticsearch this dynamic attribute lookup per StringField was exceptionally costly.

        Show
        mikemccand Michael McCandless added a comment - +1, thanks Robert Muir This is the apparent source of the very unexpected slowdown here: https://github.com/elastic/elasticsearch/pull/19867#issuecomment-240841821 Was quite "fun" to track down I'm still not sure why this affects my 2-socket (72 core) box but not my single socket (8 core) boxes, but when indexing geonames data with Elasticsearch this dynamic attribute lookup per StringField was exceptionally costly.
        Hide
        mikemccand Michael McCandless added a comment -

        The performance hit caused by this is really surprising: see the nightly Elasticsearch indexing benchmark: https://benchmarks-old.elastic.co/index.html#indexing

        That drop on 8/9/2016 was from the "simple" change to Elasticsearch to pass binary terms (BytesRef) to Lucene instead of String terms via StringField.

        It dropped indexing performance for the fastsettings run from ~80 K docs/sec to ~39 K docs/sec, but Robert Muir's patch here gets the performance back to ~80 K docs/sec.

        Show
        mikemccand Michael McCandless added a comment - The performance hit caused by this is really surprising: see the nightly Elasticsearch indexing benchmark: https://benchmarks-old.elastic.co/index.html#indexing That drop on 8/9/2016 was from the "simple" change to Elasticsearch to pass binary terms ( BytesRef ) to Lucene instead of String terms via StringField . It dropped indexing performance for the fastsettings run from ~80 K docs/sec to ~39 K docs/sec, but Robert Muir 's patch here gets the performance back to ~80 K docs/sec.
        Hide
        mikemccand Michael McCandless added a comment -

        I think we should fix this for 6.2.0.

        Show
        mikemccand Michael McCandless added a comment - I think we should fix this for 6.2.0.
        Hide
        jpountz Adrien Grand added a comment -

        So if I understand correctly, the slow down was caused by the fact that binary terms use a different impl for their TermToBytesRefAttribute (BytesTermAttributeImpl rather than PackedTokenAttributeImpl) and this confuses hotspot when getAttribute is called (regardless of which attribute is looked up)?

        Should PackedTokenAttributeImpl only do positionIncrement = 0; after calling super.end();? The position increment seems to be the only attribute that deserves special handling? Or maybe you wanted to have explicit handling of all attributes that are wrapper in this über attribute impl?

        Show
        jpountz Adrien Grand added a comment - So if I understand correctly, the slow down was caused by the fact that binary terms use a different impl for their TermToBytesRefAttribute (BytesTermAttributeImpl rather than PackedTokenAttributeImpl) and this confuses hotspot when getAttribute is called (regardless of which attribute is looked up)? Should PackedTokenAttributeImpl only do positionIncrement = 0; after calling super.end(); ? The position increment seems to be the only attribute that deserves special handling? Or maybe you wanted to have explicit handling of all attributes that are wrapper in this über attribute impl?
        Hide
        rcmuir Robert Muir added a comment -

        That's not the issue as I see it.

        To me, the problem is that we do getAttribute per-stream here (rather than once in a ctor, like everywhere else). We should not be doing that.

        As far as the Packed impl, i did it the way i did to keep it simple and make it clear what its doing. If we were to call super.end(), which then would call close(), setting positionIncrement to 1, then we were to set it back to 0, i mean, that is totally unclear. I can't imagine why we would do it in that way, its not like we are paying a tax for lines of code. Even if we were, this would be worth it for the extra lines, because its way simpler not to do stuff like that!

        Show
        rcmuir Robert Muir added a comment - That's not the issue as I see it. To me, the problem is that we do getAttribute per-stream here (rather than once in a ctor, like everywhere else). We should not be doing that. As far as the Packed impl, i did it the way i did to keep it simple and make it clear what its doing. If we were to call super.end(), which then would call close(), setting positionIncrement to 1, then we were to set it back to 0, i mean, that is totally unclear. I can't imagine why we would do it in that way, its not like we are paying a tax for lines of code. Even if we were, this would be worth it for the extra lines, because its way simpler not to do stuff like that!
        Hide
        thetaphi Uwe Schindler added a comment -

        I agree, the latest patch is the only correct fix. The special case for PositionIncrement inside the TokenStream base class is just wrong + it adds overhead in getAttribute. I agree attributes should have the option to say on their own what they need to do on end!

        It would be good to find out when this problem in end() was added. To me thois code looked completely new, never have seen it before!

        Show
        thetaphi Uwe Schindler added a comment - I agree, the latest patch is the only correct fix. The special case for PositionIncrement inside the TokenStream base class is just wrong + it adds overhead in getAttribute. I agree attributes should have the option to say on their own what they need to do on end! It would be good to find out when this problem in end() was added. To me thois code looked completely new, never have seen it before!
        Hide
        rcmuir Robert Muir added a comment -

        I am sure the problem originated when factoring lots of +1/-1 logic outside of documentsinverter. The indexer used to have a lot of this "special case logic" itself, to handle multi-valued fields which was really confusing.

        Show
        rcmuir Robert Muir added a comment - I am sure the problem originated when factoring lots of +1/-1 logic outside of documentsinverter. The indexer used to have a lot of this "special case logic" itself, to handle multi-valued fields which was really confusing.
        Hide
        jpountz Adrien Grand added a comment -

        I was just curious to understand whit made hotspot unhappy since this the getAttribute call did not seem to be much of an issue is the case the field is a string rather than opaque bytes. I agree the removal of this per-stream getAttribute call is a good move regardless.

        Show
        jpountz Adrien Grand added a comment - I was just curious to understand whit made hotspot unhappy since this the getAttribute call did not seem to be much of an issue is the case the field is a string rather than opaque bytes. I agree the removal of this per-stream getAttribute call is a good move regardless.
        Hide
        rcmuir Robert Muir added a comment -

        As soon as mike pointed to this area of the code being problematic, i saw this getAttribute() call, said "that is bogus" and made a patch. From my perspective thats the whole story, i didn't investigate any further.

        Show
        rcmuir Robert Muir added a comment - As soon as mike pointed to this area of the code being problematic, i saw this getAttribute() call, said "that is bogus" and made a patch. From my perspective thats the whole story, i didn't investigate any further.
        Hide
        mikemccand Michael McCandless added a comment -

        I tested indexing all geonames, using lucene server (https://github.com/mikemccand/luceneserver) and this patch gets a ~5% speedup:

          before: Total: 426514.0 docs/sec for 26.1 sec
           after: Total: 447550.4 docs/sec for 24.8 sec
        

        This is on an 8 core box, so it's nice to confirm I can see a speedup not just on the crazy 72 core box.

        Show
        mikemccand Michael McCandless added a comment - I tested indexing all geonames, using lucene server ( https://github.com/mikemccand/luceneserver ) and this patch gets a ~5% speedup: before: Total: 426514.0 docs/sec for 26.1 sec   after: Total: 447550.4 docs/sec for 24.8 sec This is on an 8 core box, so it's nice to confirm I can see a speedup not just on the crazy 72 core box.
        Hide
        dsmiley David Smiley added a comment -

        Nice; +1 to the patch and to commit to v6.2

        Show
        dsmiley David Smiley added a comment - Nice; +1 to the patch and to commit to v6.2
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f8536ce72606af6c75cf9137f354da57bb0f3dbc in lucene-solr's branch refs/heads/master from Robert Muir
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8536ce ]

        LUCENE-7419: Don't lookup PositionIncrementAttribute every time in TokenStream.end()

        Show
        jira-bot ASF subversion and git services added a comment - Commit f8536ce72606af6c75cf9137f354da57bb0f3dbc in lucene-solr's branch refs/heads/master from Robert Muir [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8536ce ] LUCENE-7419 : Don't lookup PositionIncrementAttribute every time in TokenStream.end()
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 326db813fa1791a3ada5d5dafebfb47268f03632 in lucene-solr's branch refs/heads/branch_6x from Robert Muir
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=326db81 ]

        LUCENE-7419: Don't lookup PositionIncrementAttribute every time in TokenStream.end()

        Show
        jira-bot ASF subversion and git services added a comment - Commit 326db813fa1791a3ada5d5dafebfb47268f03632 in lucene-solr's branch refs/heads/branch_6x from Robert Muir [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=326db81 ] LUCENE-7419 : Don't lookup PositionIncrementAttribute every time in TokenStream.end()
        Hide
        rcmuir Robert Muir added a comment -

        thanks Michael McCandless for tracking this down!

        Show
        rcmuir Robert Muir added a comment - thanks Michael McCandless for tracking this down!
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.
        Hide
        mikemccand Michael McCandless added a comment -

        Reopening for eventual possible backport to a potential 5.5.5 release.

        Show
        mikemccand Michael McCandless added a comment - Reopening for eventual possible backport to a potential 5.5.5 release.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 4dbaed52a0a721b2b9668ee8074da42585fd54ea in lucene-solr's branch refs/heads/branch_5_5 from Robert Muir
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4dbaed5 ]

        LUCENE-7419: Don't lookup PositionIncrementAttribute every time in TokenStream.end()

        Show
        jira-bot ASF subversion and git services added a comment - Commit 4dbaed52a0a721b2b9668ee8074da42585fd54ea in lucene-solr's branch refs/heads/branch_5_5 from Robert Muir [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4dbaed5 ] LUCENE-7419 : Don't lookup PositionIncrementAttribute every time in TokenStream.end()
        Hide
        mikemccand Michael McCandless added a comment -

        OK I backported to 5.5.x branch so next release we do here will have it.

        Show
        mikemccand Michael McCandless added a comment - OK I backported to 5.5.x branch so next release we do here will have it.

          People

          • Assignee:
            Unassigned
            Reporter:
            rcmuir Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development