Lucene - Core
  1. Lucene - Core
  2. LUCENE-2548

Remove all interning of field names from flex API

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In previous versions of Lucene, interning of fields was important to minimize string comparison cost when iterating TermEnums, to detect changes in field name. As we separated field names from terms in flex, no query compares field names anymore, so the whole performance problematic interning can be removed. I will start with doing this, but we need to carefully review some places e.g. in preflex codec.

      Maybe before this issue we should remove the Term class completely. Robert?

      1. LUCENE-2548.patch
        91 kB
        Michael McCandless
      2. LUCENE-2548.patch
        70 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Maybe before this issue we should remove the Term class completely.

          Sounds great... but there is a lot of code (eg in contrib, Solr) to fix if you want to do this.
          I guess when i considered this option, i thought it was gonna be a ton of work.

          Show
          Robert Muir added a comment - Maybe before this issue we should remove the Term class completely. Sounds great... but there is a lot of code (eg in contrib, Solr) to fix if you want to do this. I guess when i considered this option, i thought it was gonna be a ton of work.
          Hide
          Shai Erera added a comment -

          Why remove Term? I think it's a nice API, and for most cases, a term will still be a Term and not a BytesRef + Field. Isn't it a convenient class? Is there an alternative one?

          Show
          Shai Erera added a comment - Why remove Term? I think it's a nice API, and for most cases, a term will still be a Term and not a BytesRef + Field. Isn't it a convenient class? Is there an alternative one?
          Hide
          Robert Muir added a comment -

          I think it's a nice API, and for most cases, a term will still be a Term and not a BytesRef + Field

          Even if a term is a Term, a Term now is always a BytesRef + field behind the scenes anyway.

          Isn't it a convenient class?

          Basically, this is why i didnt go this route of removing it (instead modifying Term class to work with bytesref).
          The problem I saw was: if we have to modify tons of code to get rid of it, so would users too on upgrading.

          Show
          Robert Muir added a comment - I think it's a nice API, and for most cases, a term will still be a Term and not a BytesRef + Field Even if a term is a Term, a Term now is always a BytesRef + field behind the scenes anyway. Isn't it a convenient class? Basically, this is why i didnt go this route of removing it (instead modifying Term class to work with bytesref). The problem I saw was: if we have to modify tons of code to get rid of it, so would users too on upgrading.
          Hide
          Uwe Schindler added a comment -

          I think the discussion about Term removal is not really related to this issue. Removing Term would only have the big advantage that we don't suddenly change Term to no longer intern() the field name and so maybe code outside Lucene using Terms and relying on the fact that the term field name is interned, may break. Removal of intern() must then be clearly noted in migration.

          Show
          Uwe Schindler added a comment - I think the discussion about Term removal is not really related to this issue. Removing Term would only have the big advantage that we don't suddenly change Term to no longer intern() the field name and so maybe code outside Lucene using Terms and relying on the fact that the term field name is interned, may break. Removal of intern() must then be clearly noted in migration.
          Hide
          Robert Muir added a comment -

          Uwe, but removing intern() from Term is almost just as bad as removing Term, because we at least have to review all uses (e.g. Solr) and see if it would cause incorrect code (e.g. == comparison that is suddenly wrong) or performance problems in containers sorting terms or anything of the like?

          Again, I don't personally have an opinion either way, I just mentioned why I didn't remove it, its like Token, still lots of code using it

          Show
          Robert Muir added a comment - Uwe, but removing intern() from Term is almost just as bad as removing Term, because we at least have to review all uses (e.g. Solr) and see if it would cause incorrect code (e.g. == comparison that is suddenly wrong) or performance problems in containers sorting terms or anything of the like? Again, I don't personally have an opinion either way, I just mentioned why I didn't remove it, its like Token, still lots of code using it
          Hide
          Shai Erera added a comment -

          I agree. Term is frequently used (at least in our apps) and the wrapping around BytesRef is nice too. One can still call text() or the like and get the string rep. of it which in most cases is what you put there in the first place.

          And I also agree about stopping interning field suddenly. What is the reason for stop doing that?

          Show
          Shai Erera added a comment - I agree. Term is frequently used (at least in our apps) and the wrapping around BytesRef is nice too. One can still call text() or the like and get the string rep. of it which in most cases is what you put there in the first place. And I also agree about stopping interning field suddenly. What is the reason for stop doing that?
          Hide
          Uwe Schindler added a comment -

          And I also agree about stopping interning field suddenly. What is the reason for stop doing that?

          I don't understand the question.

          The reason for removing interning is to remove the cost of doing this without need in trunk. The interning was solely done for speeding up typical TermEnum iteration where each term's field need to be compared to detect a change. As fields are now no longer coupled to terms and Term*s*Enums (TermEnum was removed) only iterate over one field, this is useless and the cost for creating terms does no retify to keep it.

          Show
          Uwe Schindler added a comment - And I also agree about stopping interning field suddenly. What is the reason for stop doing that? I don't understand the question. The reason for removing interning is to remove the cost of doing this without need in trunk. The interning was solely done for speeding up typical TermEnum iteration where each term's field need to be compared to detect a change. As fields are now no longer coupled to terms and Term*s*Enums (TermEnum was removed) only iterate over one field, this is useless and the cost for creating terms does no retify to keep it.
          Hide
          Shai Erera added a comment -

          Ohh, I see. I don't remember if I ever relied on interning for other purposes, but if that's the only reason, then I agree there's no point in interning anymore. But perhaps we should allow that through another API, in case someone relies on it elsewhere?

          Show
          Shai Erera added a comment - Ohh, I see. I don't remember if I ever relied on interning for other purposes, but if that's the only reason, then I agree there's no point in interning anymore. But perhaps we should allow that through another API, in case someone relies on it elsewhere?
          Hide
          Robert Muir added a comment -

          after seeing LUCENE-3105, i think we should take steps to remove this interning.

          it looks like this can probably be done safely, according to http://www.cs.umd.edu/~jfoster/papers/issre04.pdf , findbugs, PMD, and JLint all support looking for string equality with == or !=, so we should be able to review all occurrences.

          Show
          Robert Muir added a comment - after seeing LUCENE-3105 , i think we should take steps to remove this interning. it looks like this can probably be done safely, according to http://www.cs.umd.edu/~jfoster/papers/issre04.pdf , findbugs, PMD, and JLint all support looking for string equality with == or !=, so we should be able to review all occurrences.
          Hide
          Michael McCandless added a comment -

          Initial patch.

          Tests are passing, at least a few iterations (I'll beast it). There are still a few nocommits...

          I used PMD and findbugs to find == and Unable to render embedded object: File (= on strings, but surprisingly there are cases that these tools seem to miss. I also did various greps to try to find cases... but I'm sure I've missed some) not found.

          Show
          Michael McCandless added a comment - Initial patch. Tests are passing, at least a few iterations (I'll beast it). There are still a few nocommits... I used PMD and findbugs to find == and Unable to render embedded object: File (= on strings, but surprisingly there are cases that these tools seem to miss. I also did various greps to try to find cases... but I'm sure I've missed some) not found.
          Hide
          Robert Muir added a comment -

          is there any reason to keep Term.createTerm() after we do this? seems useless after interning is removed.

          Show
          Robert Muir added a comment - is there any reason to keep Term.createTerm() after we do this? seems useless after interning is removed.
          Hide
          Michael McCandless added a comment -

          I agree – I removed createTerm!

          And fixed the nocommits....

          Beast chewed on this for a while and didn't hit any failures except various Solr tests that still intermittently fail... I think it's ready!

          Show
          Michael McCandless added a comment - I agree – I removed createTerm! And fixed the nocommits.... Beast chewed on this for a while and didn't hit any failures except various Solr tests that still intermittently fail... I think it's ready!
          Hide
          Uwe Schindler added a comment -

          Yupee Juhee. I was on business trip whole day. Insane! Will review soon!

          Show
          Uwe Schindler added a comment - Yupee Juhee. I was on business trip whole day. Insane! Will review soon!
          Hide
          Uwe Schindler added a comment -

          Hi Mike,

          patch looks great, thanks for doing this hard work PreFlexCodec looks fine, see no problems there. Lucene code iterating TermsEnums was successfully cleaned up (the lovely MTQs) from T.createTerm and equals added at some places.

          I cannot check if there are comparisons missing, I wonder why PMD/Findbugs has bugs that it does not find all occurences, maybe because some SuppressWarnings also hiding those occurences? Can you explain shortly what "Unable to render embedded object: File" has to do with interning?

          Solr code is fine, I expected more to change. Some places in Solr still seems to use some "placeholder" terms (called idTerm and other names). We should maybe check if they are only field names in reality?

          GREAT WORK! I AM SO HAPPY, dumdidumm...!

          Show
          Uwe Schindler added a comment - Hi Mike, patch looks great, thanks for doing this hard work PreFlexCodec looks fine, see no problems there. Lucene code iterating TermsEnums was successfully cleaned up (the lovely MTQs) from T.createTerm and equals added at some places. I cannot check if there are comparisons missing, I wonder why PMD/Findbugs has bugs that it does not find all occurences, maybe because some SuppressWarnings also hiding those occurences? Can you explain shortly what "Unable to render embedded object: File" has to do with interning? Solr code is fine, I expected more to change. Some places in Solr still seems to use some "placeholder" terms (called idTerm and other names). We should maybe check if they are only field names in reality? GREAT WORK! I AM SO HAPPY, dumdidumm...!
          Hide
          Uwe Schindler added a comment -

          Can you explain shortly what "Unable to render embedded object: File" has to do with interning?

          That was just a JIRA formatting issue in Mike's comment I was referring to.

          Show
          Uwe Schindler added a comment - Can you explain shortly what "Unable to render embedded object: File" has to do with interning? That was just a JIRA formatting issue in Mike's comment I was referring to.
          Hide
          Michael McCandless added a comment -

          Woops – my comment was just saying that both == and ! = cases weren't always caught by PMD/findbugs. But maybe I somehow messed up running them!

          Show
          Michael McCandless added a comment - Woops – my comment was just saying that both == and ! = cases weren't always caught by PMD/findbugs. But maybe I somehow messed up running them!
          Hide
          Michael McCandless added a comment -

          Committed! Uwe, I think I fixed all the places where we were making a placeholder term just to hold a field...

          Show
          Michael McCandless added a comment - Committed! Uwe, I think I fixed all the places where we were making a placeholder term just to hold a field...

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development