Lucene - Core
  1. Lucene - Core
  2. LUCENE-6212

Remove IndexWriter's per-document analyzer add/updateDocument APIs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      IndexWriter already takes an analyzer up-front (via
      IndexWriterConfig), but it also allows you to specify a different one
      for each add/updateDocument.

      I think this is quite dangerous/trappy since it means you can easily
      index tokens for that document that don't match at search-time based
      on the search-time analyzer.

      I think we should remove this trap in 5.0.

      1. LUCENE-6212.patch
        80 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch, I think it's ready.

        Show
        Michael McCandless added a comment - Patch, I think it's ready.
        Hide
        Uwe Schindler added a comment -

        +1 to get this in 5.0

        Show
        Uwe Schindler added a comment - +1 to get this in 5.0
        Hide
        Ryan Ernst added a comment -

        +1

        Show
        Ryan Ernst added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1656272 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1656272 ]

        LUCENE-6212: remove per-doc analyzers

        Show
        ASF subversion and git services added a comment - Commit 1656272 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1656272 ] LUCENE-6212 : remove per-doc analyzers
        Hide
        ASF subversion and git services added a comment -

        Commit 1656273 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1656273 ]

        LUCENE-6212: remove per-doc analyzers

        Show
        ASF subversion and git services added a comment - Commit 1656273 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1656273 ] LUCENE-6212 : remove per-doc analyzers
        Hide
        ASF subversion and git services added a comment -

        Commit 1656274 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1656274 ]

        LUCENE-6212: remove per-doc analyzers

        Show
        ASF subversion and git services added a comment - Commit 1656274 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1656274 ] LUCENE-6212 : remove per-doc analyzers
        Hide
        ASF subversion and git services added a comment -

        Commit 1656276 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1656276 ]

        LUCENE-6212: deprecate per-doc analyzers

        Show
        ASF subversion and git services added a comment - Commit 1656276 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1656276 ] LUCENE-6212 : deprecate per-doc analyzers
        Hide
        Shai Erera added a comment -

        How do you index multi-lingual documents in one index then? We used to do it by pulling the correct Analyzer per the document's language and call addDoc(doc, langAnazlyer). What's the alternative without that API? Is there any easy alternative, or should we add all fields to a document with a language-specific TokenStream, which is much less convenient, but still an alternative.

        Is it worth having a CHANGES / MIGRATION entry for this? I think if users depend on that API for good reasons (i.e. it's not a 'trap' for them), it should be mentioned somewhere..

        Show
        Shai Erera added a comment - How do you index multi-lingual documents in one index then? We used to do it by pulling the correct Analyzer per the document's language and call addDoc(doc, langAnazlyer). What's the alternative without that API? Is there any easy alternative, or should we add all fields to a document with a language-specific TokenStream, which is much less convenient, but still an alternative. Is it worth having a CHANGES / MIGRATION entry for this? I think if users depend on that API for good reasons (i.e. it's not a 'trap' for them), it should be mentioned somewhere..
        Hide
        Uwe Schindler added a comment -

        PerFieldAnalyzerWrapper?

        Show
        Uwe Schindler added a comment - PerFieldAnalyzerWrapper?
        Hide
        Shai Erera added a comment -

        That doesn't help. If all your documents have a 'title' and 'body' fields (with an additional 'language'), you want the content to be indexed under the 'title' and 'body' fields, and not 'title_en' and 'title_de'. Well maybe you do/should but the point is that you have a single schema for your documents. The only thing that changes is how they are tokenized, and that's on a per-document basis, depending on its language.

        Show
        Shai Erera added a comment - That doesn't help. If all your documents have a 'title' and 'body' fields (with an additional 'language'), you want the content to be indexed under the 'title' and 'body' fields, and not 'title_en' and 'title_de'. Well maybe you do/should but the point is that you have a single schema for your documents. The only thing that changes is how they are tokenized, and that's on a per-document basis, depending on its language.
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.
        Hide
        Christopher Cudennec added a comment -

        Hi! Do you have any updates on this issue? We have just tried to upgrade from 4.3.1 to 5.0.0 and have exactly the same problem as Shai Erera.

        Show
        Christopher Cudennec added a comment - Hi! Do you have any updates on this issue? We have just tried to upgrade from 4.3.1 to 5.0.0 and have exactly the same problem as Shai Erera .
        Hide
        Ryan Ernst added a comment -

        you want the content to be indexed under the 'title' and 'body' fields, and not 'title_en' and 'title_de'. Well maybe you do/should but the point is that you have a single schema for your documents.

        Shai Erera Christopher Cudennec That is exactly the problem. It wasn't really a single schema. It was a trappy API that required deciding at query time which analyzer to use. It also means term statistics can be skewed, so the results could be skewed. Using separate fields for each language is much better. It's not really anymore work, since you would have had separate analyzers for each of those languages anyways.

        Show
        Ryan Ernst added a comment - you want the content to be indexed under the 'title' and 'body' fields, and not 'title_en' and 'title_de'. Well maybe you do/should but the point is that you have a single schema for your documents. Shai Erera Christopher Cudennec That is exactly the problem. It wasn't really a single schema. It was a trappy API that required deciding at query time which analyzer to use. It also means term statistics can be skewed, so the results could be skewed. Using separate fields for each language is much better. It's not really anymore work, since you would have had separate analyzers for each of those languages anyways.
        Hide
        ASF subversion and git services added a comment -

        Commit 1675278 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1675278 ]

        LUCENE-6212: add MIGRATE.txt entry

        Show
        ASF subversion and git services added a comment - Commit 1675278 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1675278 ] LUCENE-6212 : add MIGRATE.txt entry
        Hide
        Sanne Grinovero added a comment -

        Hello,
        I understand there are good reasons to prevent this for the "average user" but I would beg you to restore the functionality for those who know what they are doing.

        There are perfectly valid use cases to use a different Analyzer at query time rather than indexing time, for example when handling synonyms at indexing time you don't need to apply the substitutions again at query time.
        Beyond synonyms, it's also possible to have text of different sources which has been pre-processed in different ways, so needs to be tokenized differently to get a consistent output.

        I love the idea of Lucene to become more strict regarding to consistent schema choices, but I would hope we could stick to field types and encoding, while Analyzer mappings can use a bit more flexibility?

        Would you accept a patch to overload

        org.apache.lucene.index.IndexWriter.updateDocument(Term, Iterable<? extends IndexableField>)

        with the expert version:

        org.apache.lucene.index.IndexWriter.updateDocument(Term, Iterable<? extends IndexableField>, Analyzer overrideAnalyzer)

        ?

        That would greatly help me to migrate to Lucene 5. My alternatives are to close/open the IndexWriter for each Analyzer change but that would have a significant performance impact; I'd rather cheat and pass an Analyzer instance which is mutable, even if that would prevent me from using the IndexWriter concurrently.

        Show
        Sanne Grinovero added a comment - Hello, I understand there are good reasons to prevent this for the "average user" but I would beg you to restore the functionality for those who know what they are doing. There are perfectly valid use cases to use a different Analyzer at query time rather than indexing time, for example when handling synonyms at indexing time you don't need to apply the substitutions again at query time. Beyond synonyms, it's also possible to have text of different sources which has been pre-processed in different ways, so needs to be tokenized differently to get a consistent output. I love the idea of Lucene to become more strict regarding to consistent schema choices, but I would hope we could stick to field types and encoding, while Analyzer mappings can use a bit more flexibility? Would you accept a patch to overload org.apache.lucene.index.IndexWriter.updateDocument(Term, Iterable<? extends IndexableField>) with the expert version: org.apache.lucene.index.IndexWriter.updateDocument(Term, Iterable<? extends IndexableField>, Analyzer overrideAnalyzer) ? That would greatly help me to migrate to Lucene 5. My alternatives are to close/open the IndexWriter for each Analyzer change but that would have a significant performance impact; I'd rather cheat and pass an Analyzer instance which is mutable, even if that would prevent me from using the IndexWriter concurrently.
        Hide
        Adrien Grand added a comment -

        There are perfectly valid use cases to use a different Analyzer at query time rather than indexing time

        This change doesn't force you to use the same analyzer at index time and search time, just to always use the same analyzer at index time.

        it's also possible to have text of different sources which has been pre-processed in different ways, so needs to be tokenized differently to get a consistent output

        One way that this feature was misused was to handle multi-lingual content, but this would break term statistics as different words could be filtered to the same stem and a single word could be filtered to two different stems depending on the language. In general, if different analysis chains are required, it's better to just use different fields or even different indices.

        Show
        Adrien Grand added a comment - There are perfectly valid use cases to use a different Analyzer at query time rather than indexing time This change doesn't force you to use the same analyzer at index time and search time, just to always use the same analyzer at index time. it's also possible to have text of different sources which has been pre-processed in different ways, so needs to be tokenized differently to get a consistent output One way that this feature was misused was to handle multi-lingual content, but this would break term statistics as different words could be filtered to the same stem and a single word could be filtered to two different stems depending on the language. In general, if different analysis chains are required, it's better to just use different fields or even different indices.
        Hide
        Sanne Grinovero added a comment -

        Hi Adrien, thanks for replying!
        Yes I agree with you that in general this could be abused and I understand the caveats, still I would like to do it. Since Lucene is a library for developers and it's not an "end user product" I would prefer it could give me a bit more flexibility.

        Show
        Sanne Grinovero added a comment - Hi Adrien, thanks for replying! Yes I agree with you that in general this could be abused and I understand the caveats, still I would like to do it. Since Lucene is a library for developers and it's not an "end user product" I would prefer it could give me a bit more flexibility.
        Hide
        Hoss Man added a comment -

        Since Lucene is a library for developers and it's not an "end user product" I would prefer it could give me a bit more flexibility.

        Unless i'm missunderstanding the context of your concern, totally flexability in the terms indexed is still available because you can index Documents containing IndexableFields that produce whatever TokenStream you want – ignoring the Analyzer specified on the IndexWriter if you so choose.

        What this change did is make "the uncommon and easy to mess up case" (ask indexwriter to analyze your text using a diff analyzer for each doc) impossible – but meanwhile both "the simple common case" (same analyzer for all docs) and "the expert level case" (i want to produce an arbitrary set of terms for each field and each document) are both still possible and easy.


        In any event – trying to have a discussion about this in the comments of a Jira that's been closed for several months is a really bad idea – if you have questions/concerns about how to use the API, or how to upgrade your existing code, please address those to the java-user@lucene list where the entire community can help you (not just the handful of devs watching every jira issue)

        Show
        Hoss Man added a comment - Since Lucene is a library for developers and it's not an "end user product" I would prefer it could give me a bit more flexibility. Unless i'm missunderstanding the context of your concern, totally flexability in the terms indexed is still available because you can index Documents containing IndexableFields that produce whatever TokenStream you want – ignoring the Analyzer specified on the IndexWriter if you so choose. What this change did is make "the uncommon and easy to mess up case" (ask indexwriter to analyze your text using a diff analyzer for each doc) impossible – but meanwhile both "the simple common case" (same analyzer for all docs) and "the expert level case" (i want to produce an arbitrary set of terms for each field and each document) are both still possible and easy. In any event – trying to have a discussion about this in the comments of a Jira that's been closed for several months is a really bad idea – if you have questions/concerns about how to use the API, or how to upgrade your existing code, please address those to the java-user@lucene list where the entire community can help you (not just the handful of devs watching every jira issue)

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development