Details

    • Type: Task
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2, 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This was mostly superseded by the new points API so should we remove auto-prefix terms?

      1. LUCENE-7317.patch
        91 kB
        Adrien Grand

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Sorry for the late reply. Yep min=1/max=2B is not a reasonable setting but I have similar results with min=1/max=20 so I think it is worth investigating.
        I opened https://issues.apache.org/jira/browse/LUCENE-7423 which re-implements the auto prefix in a new PostingsFormat that builds the prefixes in two pass like the previous implementation. The nice thing is that it avoids the combinatorial explosion that affected the previous implementation where we needed to visit all the matching terms for each prefix.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Sorry for the late reply. Yep min=1/max=2B is not a reasonable setting but I have similar results with min=1/max=20 so I think it is worth investigating. I opened https://issues.apache.org/jira/browse/LUCENE-7423 which re-implements the auto prefix in a new PostingsFormat that builds the prefixes in two pass like the previous implementation. The nice thing is that it avoids the combinatorial explosion that affected the previous implementation where we needed to visit all the matching terms for each prefix.
        Hide
        rcmuir Robert Muir added a comment -

        I don't think min=1/max=2B is a reasonable setting for n-grams.

        Also, keep in mind this feature is not just a standalone codec. It had tentacles, including in places like TermQuery.

        Show
        rcmuir Robert Muir added a comment - I don't think min=1/max=2B is a reasonable setting for n-grams. Also, keep in mind this feature is not just a standalone codec. It had tentacles, including in places like TermQuery.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        I wanted to see what we're loosing with the removal of the AutoPrefix so I ran a small test with English Wikipedia title.

        I indexed the 12M titles in three indices:

        • default: keyword analyzer and the default postings format
        • auto_prefix: keyword analyzer and the AutoPrefixPostings format with minAutoPrefix=24, maxAutoPrefix=Integer.MAX
        • edge: edge ngram analyzer with minGram=1,maxGram=Integer.MAX and the default postings format.
        index default auto_prefix edge
        size in MB 231MB 274 MB 1600MB

        This table shows the size that each index takes on disk in bytes. As you can see the auto_prefix is very close to the size of the default one even though we compute all the prefix with more than 24 terms. Compared to the edge_ngram which multiplies the index size by a factor 7, the auto prefix seems to be a good trade off for fields where prefix queries are the norm. I didn't compare the query time but any prefix with more than 24 terms could be resolved by one inverted list in the auto_prefix index so it is equivalent to the edge_ngram index.
        The downside of the auto_prefix seems to be the merge, it takes more than 1 minute to optimize, this is 10 times slower than the default index. Though this is expected since the default index uses a keyword analyzer.

        I understand that the new points APIs is better for numeric prefix/range queries but the auto prefix seems to be a good fit for prefix string queries. It saves a lot of spaces compared to edge ngram and the indexation is faster. I am not saying we should restore the functionality inside the default BlockTreeTerms but maybe we could create a separate postings format that exposes this feature ?

        Show
        jim.ferenczi Jim Ferenczi added a comment - I wanted to see what we're loosing with the removal of the AutoPrefix so I ran a small test with English Wikipedia title. I indexed the 12M titles in three indices: default : keyword analyzer and the default postings format auto_prefix : keyword analyzer and the AutoPrefixPostings format with minAutoPrefix=24, maxAutoPrefix=Integer.MAX edge : edge ngram analyzer with minGram=1,maxGram=Integer.MAX and the default postings format. index default auto_prefix edge size in MB 231MB 274 MB 1600MB This table shows the size that each index takes on disk in bytes. As you can see the auto_prefix is very close to the size of the default one even though we compute all the prefix with more than 24 terms. Compared to the edge_ngram which multiplies the index size by a factor 7, the auto prefix seems to be a good trade off for fields where prefix queries are the norm. I didn't compare the query time but any prefix with more than 24 terms could be resolved by one inverted list in the auto_prefix index so it is equivalent to the edge_ngram index. The downside of the auto_prefix seems to be the merge, it takes more than 1 minute to optimize, this is 10 times slower than the default index. Though this is expected since the default index uses a keyword analyzer. I understand that the new points APIs is better for numeric prefix/range queries but the auto prefix seems to be a good fit for prefix string queries. It saves a lot of spaces compared to edge ngram and the indexation is faster. I am not saying we should restore the functionality inside the default BlockTreeTerms but maybe we could create a separate postings format that exposes this feature ?
        Hide
        jpountz Adrien Grand added a comment -

        Auto-prefix terms proved hard to integrate with other APIs (eg. is it right to require all postings formats to support auto prefixes? is it fine to expose fake terms to other APIs? (answer to the latter is no as we had to add a couple checks to make sure we never create terms queries on fake terms)) and at the same time we got points / bkd trees working, which solved the prefix/range problem too, and added support for multiple dimensions so it superseded the auto-prefix efforts.

        Even with auto-prefix gone, it is still possible to index prefixes, it just has to be done up-front by indexing prefixes as the edge n-gram filter would. It is just a bit less optimized since since we cannot compute some optimal automatic prefixes based on the data.

        Show
        jpountz Adrien Grand added a comment - Auto-prefix terms proved hard to integrate with other APIs (eg. is it right to require all postings formats to support auto prefixes? is it fine to expose fake terms to other APIs? (answer to the latter is no as we had to add a couple checks to make sure we never create terms queries on fake terms)) and at the same time we got points / bkd trees working, which solved the prefix/range problem too, and added support for multiple dimensions so it superseded the auto-prefix efforts. Even with auto-prefix gone, it is still possible to index prefixes, it just has to be done up-front by indexing prefixes as the edge n-gram filter would. It is just a bit less optimized since since we cannot compute some optimal automatic prefixes based on the data.
        Hide
        ljcollins25 Lance Collins added a comment - - edited

        What is the motivation for removing if it still has some utility of its own? I was planning to utilize it reduce my index size for our search as you type experience.

        Show
        ljcollins25 Lance Collins added a comment - - edited What is the motivation for removing if it still has some utility of its own? I was planning to utilize it reduce my index size for our search as you type experience.
        Hide
        jpountz Adrien Grand added a comment -

        Yes. For the record however, this optimization was not enabled by default and required to use a custom PostingsFormat that was not supported in terms of backwards compatibility.

        Show
        jpountz Adrien Grand added a comment - Yes. For the record however, this optimization was not enabled by default and required to use a custom PostingsFormat that was not supported in terms of backwards compatibility.
        Hide
        ljcollins25 Lance Collins added a comment -

        Does this mean that prefix queries can no longer take advantage of this optimization?

        Show
        ljcollins25 Lance Collins added a comment - Does this mean that prefix queries can no longer take advantage of this optimization?
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit bac521d1aac6d5d7bbdfde195286c5a50e653364 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bac521d ]

        LUCENE-7317: Remove auto-prefix terms.

        Show
        jira-bot ASF subversion and git services added a comment - Commit bac521d1aac6d5d7bbdfde195286c5a50e653364 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bac521d ] LUCENE-7317 : Remove auto-prefix terms.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit dc95f6d62a192018522caada139008fe57d6126d in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dc95f6d ]

        LUCENE-7317: Remove auto-prefix terms.

        Show
        jira-bot ASF subversion and git services added a comment - Commit dc95f6d62a192018522caada139008fe57d6126d in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dc95f6d ] LUCENE-7317 : Remove auto-prefix terms.
        Hide
        mikemccand Michael McCandless added a comment -

        +1. How much easier it is to remove than it was to add!!

        Show
        mikemccand Michael McCandless added a comment - +1. How much easier it is to remove than it was to add!!
        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch. It removes writes of auto prefix terms in the block tree writer and the AutoPrefixTermsPostingsFormat.

        Show
        jpountz Adrien Grand added a comment - Here is a patch. It removes writes of auto prefix terms in the block tree writer and the AutoPrefixTermsPostingsFormat.

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development