Details

    • Type: Task Task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.9, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Often on the mailing list there is confusion about NOT_ANALYZED.

      Besides being useless (Just use KeywordAnalyzer instead), people trip up on this
      not being consistent at query time (you really need to configure KeywordAnalyzer for the field
      on your PerFieldAnalyzerWrapper so it will do the same thing at query time... oh wait
      once you've done that, you dont need NOT_ANALYZED).

      So I think StringField is a trap too for the same reasons, just under a
      different name, lets remove it.

      1. LUCENE-3792_javadocs_3x.patch
        0.8 kB
        Robert Muir
      2. LUCENE-3792_javadocs_3x.patch
        1.0 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          +1 for KeywordField

          For symmetry i would propose to add a KeywordQuery, too (that rewrites to new ConstantScoreQuery(new TermQuery())).

          Show
          Uwe Schindler added a comment - +1 for KeywordField For symmetry i would propose to add a KeywordQuery, too (that rewrites to new ConstantScoreQuery(new TermQuery())).
          Hide
          Robert Muir added a comment -

          NOT_ANALYZED has two variants - with and without norms.

          You are right, I forgot about this. For NOT_ANALYZED with norms, we should probably just throw CoderMalfunctionError()

          Show
          Robert Muir added a comment - NOT_ANALYZED has two variants - with and without norms. You are right, I forgot about this. For NOT_ANALYZED with norms, we should probably just throw CoderMalfunctionError()
          Hide
          Shai Erera added a comment -

          A couple of comments:

          • NOT_ANALYZED has two variants - with and without norms. I always uses NOT_ANALYZED_NO_NORMS. Is the KWD_ANALYZED suggestion meant to create two variants too? If not, we should somehow let the user be able to express a KWD analyzed field with and without norms.
          • How about renaming StringField to KeywordField? Same as Chris's idea, but removing Analyzed ... shorter name and still captures the essence of it.
          • And maybe TextField -> TokenizedField?
          Show
          Shai Erera added a comment - A couple of comments: NOT_ANALYZED has two variants - with and without norms. I always uses NOT_ANALYZED_NO_NORMS. Is the KWD_ANALYZED suggestion meant to create two variants too? If not, we should somehow let the user be able to express a KWD analyzed field with and without norms. How about renaming StringField to KeywordField? Same as Chris's idea, but removing Analyzed ... shorter name and still captures the essence of it. And maybe TextField -> TokenizedField?
          Hide
          Chris Male added a comment -

          one negative is that both StringField and TextField confusingly take String in their ctors, (I've chosen the wrong one myself before on accident).

          Would renaming StringField, as suggested earlier, alleviate this? KeywordAnalyzedField?

          Show
          Chris Male added a comment - one negative is that both StringField and TextField confusingly take String in their ctors, (I've chosen the wrong one myself before on accident). Would renaming StringField, as suggested earlier, alleviate this? KeywordAnalyzedField?
          Hide
          Robert Muir added a comment -

          Hossman I think KEYWORD_ANALYZED is the ideal name for 3.x actually. I think in combination with the javadocs it would be more clear.

          This still leaves the question for trunk (currently StringField):
          positives are that its actually a "nice" name, concise and to the point.
          another positive is that StringField omits things like positions, and in trunk we don't silently fail if you form a phrase from this.

          one negative is that both StringField and TextField confusingly take String in their ctors, (I've chosen the wrong one myself before on accident).

          Basically to me, this is a combination of traps. Trunk is somewhat better because it throws exceptions for positional queries if
          you actually excluded positions...

          in all cases in 3.x, the wrong 'configuration' here creates a situation where the user just 'does not get results' and they have
          no idea why... despite the fact they used the same Analyzer at query-time and index-time like a good user. thats what I find so frustrating.

          Show
          Robert Muir added a comment - Hossman I think KEYWORD_ANALYZED is the ideal name for 3.x actually. I think in combination with the javadocs it would be more clear. This still leaves the question for trunk (currently StringField): positives are that its actually a "nice" name, concise and to the point. another positive is that StringField omits things like positions, and in trunk we don't silently fail if you form a phrase from this. one negative is that both StringField and TextField confusingly take String in their ctors, (I've chosen the wrong one myself before on accident). Basically to me, this is a combination of traps. Trunk is somewhat better because it throws exceptions for positional queries if you actually excluded positions... in all cases in 3.x, the wrong 'configuration' here creates a situation where the user just 'does not get results' and they have no idea why... despite the fact they used the same Analyzer at query-time and index-time like a good user. thats what I find so frustrating.
          Hide
          Hoss Man added a comment -

          StrawMan suggestion off the top of my head:

          • rename NOT_ANALYZED to something like KEYWORD_ANALYZED
          • document KEYWORD_ANALYZED as being a convenience flag (and/or optimization?) for achieving equivalent behavior as using PerFieldAnalyzer with KeywordAnalyzer for this field, and keep using / re-word rmuir's patch warning to make it clear that if you use this at index time, any attempts to construct queries against it using the QueryParser will need KeywordAnalyzer.

          ...would that flag name == analyzer name equivalence help people remember not to get trapped by this?

          Show
          Hoss Man added a comment - StrawMan suggestion off the top of my head: rename NOT_ANALYZED to something like KEYWORD_ANALYZED document KEYWORD_ANALYZED as being a convenience flag (and/or optimization?) for achieving equivalent behavior as using PerFieldAnalyzer with KeywordAnalyzer for this field, and keep using / re-word rmuir's patch warning to make it clear that if you use this at index time, any attempts to construct queries against it using the QueryParser will need KeywordAnalyzer. ...would that flag name == analyzer name equivalence help people remember not to get trapped by this?
          Hide
          Robert Muir added a comment -

          OK, i think seriously it would take major work to do something here that would make everyone happy.

          I still don't like the situation, but unless there are serious objections, I'd like to commit the javadocs,
          just to hopefully reduce the amount of time this trap gets answered on the user list.

          Show
          Robert Muir added a comment - OK, i think seriously it would take major work to do something here that would make everyone happy. I still don't like the situation, but unless there are serious objections, I'd like to commit the javadocs, just to hopefully reduce the amount of time this trap gets answered on the user list.
          Hide
          Uwe Schindler added a comment -

          I am fine with that. As I said: We need correct documentation to prevent traps. But removing traps by removing possibilities is the wrong way to go. It will force users to have crazy code for simple things, thats my problem. I don't use QueryParser, so I don't want solutions prefering QueryParser over anything else. QueryParser is just a convenience API around new BQ()...new TQ... E.g., I have my own query parser (and I suggest everybody else to have the same for user-facing queries) that simply analyzes the text and creates a BQ out of all anaylzed terms. No syntax, nothing. And I only execute that one on the field in question. Thats very simplistic but another way to look on things.

          Show
          Uwe Schindler added a comment - I am fine with that. As I said: We need correct documentation to prevent traps. But removing traps by removing possibilities is the wrong way to go. It will force users to have crazy code for simple things, thats my problem. I don't use QueryParser, so I don't want solutions prefering QueryParser over anything else. QueryParser is just a convenience API around new BQ()...new TQ... E.g., I have my own query parser (and I suggest everybody else to have the same for user-facing queries) that simply analyzes the text and creates a BQ out of all anaylzed terms. No syntax, nothing. And I only execute that one on the field in question. Thats very simplistic but another way to look on things.
          Hide
          Robert Muir added a comment -

          Sorry, incomplete wording (I forgot to save before svn diff).

          Show
          Robert Muir added a comment - Sorry, incomplete wording (I forgot to save before svn diff).
          Hide
          Robert Muir added a comment -

          Its obvious Uwe and I aren't going to agree here immediately, so here is a patch adding a big warning to 3.x javadocs.

          For now I'd like to apply the same warning to StringField in trunk (I just made the patch against 3.x)

          Show
          Robert Muir added a comment - Its obvious Uwe and I aren't going to agree here immediately, so here is a patch adding a big warning to 3.x javadocs. For now I'd like to apply the same warning to StringField in trunk (I just made the patch against 3.x)
          Hide
          Robert Muir added a comment -

          Not everybody is using Lucene purely as a full-text engine.

          But we cannot let "non-fulltext" uses break the design for the intended use case (full-text).

          Show
          Robert Muir added a comment - Not everybody is using Lucene purely as a full-text engine. But we cannot let "non-fulltext" uses break the design for the intended use case (full-text).
          Hide
          Robert Muir added a comment -

          If you remove StringField completely, pleaese also remove NumericField and force users to use PerFieldAnalyzerWrapper with a NumericTokenStream.

          I actually am not sure this is such a bad idea?

          If we were to enforce such a thing, it would also be possible to add a modification to the queryparser (instanceof NumericTokenStream)
          so that numeric fields then work out of the box with the query parser nicely?

          Show
          Robert Muir added a comment - If you remove StringField completely, pleaese also remove NumericField and force users to use PerFieldAnalyzerWrapper with a NumericTokenStream. I actually am not sure this is such a bad idea? If we were to enforce such a thing, it would also be possible to add a modification to the queryparser (instanceof NumericTokenStream) so that numeric fields then work out of the box with the query parser nicely?
          Hide
          Uwe Schindler added a comment -

          Well we are at a standstill. We constantly get these problems on the users list from NOT_ANALYZED

          You cannot prevent users from doing the wrong thing. If you remove StringField completely, pleaese also remove NumericField and force users to use PerFieldAnalyzerWrapper with a NumericTokenStream. If you add a numeric field you cannot ask for it with query parser. If you add a StringField, you cann ask with QueryParser. Simple rule. It must just be clearly documented. And possible StringField renamed.

          People using primary keys or other untokenized values should simply not use QueryParser. Use a ComstantScoreTermyQuery and you are fine.

          This is all just a documentation problem, so I am completely against removing that. Not everybody is using Lucene purely as a full-text engine.

          Show
          Uwe Schindler added a comment - Well we are at a standstill. We constantly get these problems on the users list from NOT_ANALYZED You cannot prevent users from doing the wrong thing. If you remove StringField completely, pleaese also remove NumericField and force users to use PerFieldAnalyzerWrapper with a NumericTokenStream. If you add a numeric field you cannot ask for it with query parser. If you add a StringField, you cann ask with QueryParser. Simple rule. It must just be clearly documented. And possible StringField renamed. People using primary keys or other untokenized values should simply not use QueryParser. Use a ComstantScoreTermyQuery and you are fine. This is all just a documentation problem, so I am completely against removing that. Not everybody is using Lucene purely as a full-text engine.
          Hide
          Robert Muir added a comment -

          If you don't run those through a query analyzer (e.g. generally produce TermQuery directly) then you have lots of additional work.

          Thats not true, because keywordanalyzer does nothing to the terms, you can continue to produce termquery directly and it will work.
          So expert users are fine.

          This issue isnt about expert users, its about how our API traps people that are not expert users.

          Show
          Robert Muir added a comment - If you don't run those through a query analyzer (e.g. generally produce TermQuery directly) then you have lots of additional work. Thats not true, because keywordanalyzer does nothing to the terms, you can continue to produce termquery directly and it will work. So expert users are fine. This issue isnt about expert users, its about how our API traps people that are not expert users.
          Hide
          Uwe Schindler added a comment -

          I said call it different.

          Show
          Uwe Schindler added a comment - I said call it different.
          Hide
          Robert Muir added a comment -

          Well we are at a standstill. We constantly get these problems on the users list from NOT_ANALYZED
          and I don't like reintroducing the trap again.

          So I'm -1 to StringField in 4.0

          Show
          Robert Muir added a comment - Well we are at a standstill. We constantly get these problems on the users list from NOT_ANALYZED and I don't like reintroducing the trap again. So I'm -1 to StringField in 4.0
          Hide
          Uwe Schindler added a comment - - edited

          The backside of this is now, that you need to explicitely use a KeywordAnalyzer now for Primary Key fields. If you don't run those through a query analyzer (e.g. generally produce TermQuery directly) then you have lots of additional work. For simple "lookup" queries and indexing a PK key, this is a no go.

          -1 on removing that completely, it should simply called different. We should maybe add PKQuery (a ConstantScore TermQuery) and PKField to have a symmetry.

          Show
          Uwe Schindler added a comment - - edited The backside of this is now, that you need to explicitely use a KeywordAnalyzer now for Primary Key fields. If you don't run those through a query analyzer (e.g. generally produce TermQuery directly) then you have lots of additional work. For simple "lookup" queries and indexing a PK key, this is a no go. -1 on removing that completely, it should simply called different. We should maybe add PKQuery (a ConstantScore TermQuery) and PKField to have a symmetry.
          Hide
          Robert Muir added a comment -

          On 3.x, I'd like to deprecate NOT_ANALYZED for the same reasons. This at
          least discourages people from running into that trap there and using
          KeywordAnalyzer instead.

          Show
          Robert Muir added a comment - On 3.x, I'd like to deprecate NOT_ANALYZED for the same reasons. This at least discourages people from running into that trap there and using KeywordAnalyzer instead.
          Hide
          Robert Muir added a comment -

          Setting this as blocker (sorry).

          Its a huge trap when someone sets the same Analyzer on IndexWriter and QueryParser
          but the analysis isn't "actually the same".

          Show
          Robert Muir added a comment - Setting this as blocker (sorry). Its a huge trap when someone sets the same Analyzer on IndexWriter and QueryParser but the analysis isn't "actually the same".

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development