Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10351

Add analyze Stream Evaluator to support streaming NLP

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Resolved
    • Affects Version/s: None
    • Fix Version/s: 6.6, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:

      Description

      The analyze Stream Evaluator uses a Solr analyzer to return a collection of tokens from a text field. The collection of tokens can then be streamed out by the cartesianProduct Streaming Expression or attached to documents as multi-valued fields by the select Streaming Expression.

      This allows Streaming Expressions to leverage all the existing tokenizers and filters and provides a place for future NLP analyzers to be added to Streaming Expressions.

      Sample syntax:

      cartesianProduct(expr, analyze(analyzerField, textField) as outfield )
      
      select(expr, analyze(analyzerField, textField) as outfield )
      

      Combined with Solr's batch text processing capabilities this provides an entire parallel NLP framework. Solr's batch processing capabilities are described here:

      Batch jobs, Parallel ETL and Streaming Text Transformation
      http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

      1. SOLR-10351.patch
        21 kB
        Joel Bernstein
      2. SOLR-10351.patch
        21 kB
        Joel Bernstein
      3. SOLR-10351.patch
        18 kB
        Joel Bernstein
      4. SOLR-10351.patch
        9 kB
        Joel Bernstein

        Activity

        Hide
        joel.bernstein Joel Bernstein added a comment -

        Patch with the basic implementation. Test to follow.

        Show
        joel.bernstein Joel Bernstein added a comment - Patch with the basic implementation. Test to follow.
        Hide
        joel.bernstein Joel Bernstein added a comment -

        Added a very basic test. Expanded tests still to come.

        Show
        joel.bernstein Joel Bernstein added a comment - Added a very basic test. Expanded tests still to come.
        Hide
        joel.bernstein Joel Bernstein added a comment -

        More tests

        Show
        joel.bernstein Joel Bernstein added a comment - More tests
        Hide
        joel.bernstein Joel Bernstein added a comment - - edited

        Added a test with the select function

        Show
        joel.bernstein Joel Bernstein added a comment - - edited Added a test with the select function
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 6c2155c02434bfae2ff5aa62c9ffe57318063626 in lucene-solr's branch refs/heads/master from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c2155c ]

        SOLR-10351: Add analyze Stream Evaluator to support streaming NLP

        Show
        jira-bot ASF subversion and git services added a comment - Commit 6c2155c02434bfae2ff5aa62c9ffe57318063626 in lucene-solr's branch refs/heads/master from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c2155c ] SOLR-10351 : Add analyze Stream Evaluator to support streaming NLP
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit bdd0c7e32087f534de04657fb3ef1b3afa93cc68 in lucene-solr's branch refs/heads/master from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bdd0c7e ]

        SOLR-10351: Fix pre-commit

        Show
        jira-bot ASF subversion and git services added a comment - Commit bdd0c7e32087f534de04657fb3ef1b3afa93cc68 in lucene-solr's branch refs/heads/master from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bdd0c7e ] SOLR-10351 : Fix pre-commit
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 434a61e1edcf425ae24213b4fddb2a6e4ed741be in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=434a61e ]

        SOLR-10351: Add analyze Stream Evaluator to support streaming NLP

        Show
        jira-bot ASF subversion and git services added a comment - Commit 434a61e1edcf425ae24213b4fddb2a6e4ed741be in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=434a61e ] SOLR-10351 : Add analyze Stream Evaluator to support streaming NLP
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 8fcf55634cd1e7335eed1c220c5ab628bbea8202 in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8fcf556 ]

        SOLR-10351: Fix pre-commit

        Show
        jira-bot ASF subversion and git services added a comment - Commit 8fcf55634cd1e7335eed1c220c5ab628bbea8202 in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8fcf556 ] SOLR-10351 : Fix pre-commit
        Hide
        dpgove Dennis Gove added a comment -

        What's the purpose of a StreamContext in the evaluators?

        Show
        dpgove Dennis Gove added a comment - What's the purpose of a StreamContext in the evaluators?
        Hide
        joel.bernstein Joel Bernstein added a comment -

        The SolrCore is passed in through the StreamContext. And we get the analyzer through the core.

        Show
        joel.bernstein Joel Bernstein added a comment - The SolrCore is passed in through the StreamContext. And we get the analyzer through the core.
        Hide
        joel.bernstein Joel Bernstein added a comment -

        So right now only the AnalyzeEvaluator needs the StreamContext. But since all Streams get the StreamContext, it probably makes sense to pass it to the Evaluators as well.

        Show
        joel.bernstein Joel Bernstein added a comment - So right now only the AnalyzeEvaluator needs the StreamContext. But since all Streams get the StreamContext, it probably makes sense to pass it to the Evaluators as well.
        Hide
        dpgove Dennis Gove added a comment -

        That makes sense and I agree that it probably makes sense to include it in evaluators. In the most recent patch for SOLR-10356 (planning to commit to master and branch_6x tonight) I've refactored it a little bit to move the implementation

        public void setStreamContext(StreamContext streamContext)

        back a level into ComplexEvaluator.

        Show
        dpgove Dennis Gove added a comment - That makes sense and I agree that it probably makes sense to include it in evaluators. In the most recent patch for SOLR-10356 (planning to commit to master and branch_6x tonight) I've refactored it a little bit to move the implementation public void setStreamContext(StreamContext streamContext) back a level into ComplexEvaluator.
        Hide
        dsmiley David Smiley added a comment -

        Wouldn't the NLP processing as advertised in the title of this issue be most likely to put it's processing into analysis attributes? This stream evaluator only emits the character data attribute.

        BTW Please use try-finally (even try-with-resources style) to close token-streams wherever possible. Analyzer internal parts are internally shared in thread-locals and the ramifications can be nasty on the entire Solr node if at any time one filter has a bug or something on a particular value. Your Solr node then becomes poisoned in a sense and only a restart will fix the ailment.

        Show
        dsmiley David Smiley added a comment - Wouldn't the NLP processing as advertised in the title of this issue be most likely to put it's processing into analysis attributes ? This stream evaluator only emits the character data attribute. BTW Please use try-finally (even try-with-resources style) to close token-streams wherever possible. Analyzer internal parts are internally shared in thread-locals and the ramifications can be nasty on the entire Solr node if at any time one filter has a bug or something on a particular value. Your Solr node then becomes poisoned in a sense and only a restart will fix the ailment.
        Hide
        joel.bernstein Joel Bernstein added a comment - - edited

        Wouldn't the NLP processing as advertised in the title of this issue be most likely to put it's processing into analysis attributes? This stream evaluator only emits the character data attribute.

        Possibly. I definitely have much to learn about the analysis chain. In the first pass I was mostly interested in getting the token stream from the analysis chain. What I had envisioned in the future was having analysis chains that perform sentence chunking, entity extraction, noun phrase extraction etc... I was seeing these as a finished token streams. But exposing the analysis attributes would seem to make sense in the future.

        BTW Please use try-finally (even try-with-resources style) to close token-streams wherever possible. Analyzer internal parts are internally shared in thread-locals and the ramifications can be nasty on the entire Solr node if at any time one filter has a bug or something on a particular value. Your Solr node then becomes poisoned in a sense and only a restart will fix the ailment.

        Will do.

        Show
        joel.bernstein Joel Bernstein added a comment - - edited Wouldn't the NLP processing as advertised in the title of this issue be most likely to put it's processing into analysis attributes? This stream evaluator only emits the character data attribute. Possibly. I definitely have much to learn about the analysis chain. In the first pass I was mostly interested in getting the token stream from the analysis chain. What I had envisioned in the future was having analysis chains that perform sentence chunking, entity extraction, noun phrase extraction etc... I was seeing these as a finished token streams. But exposing the analysis attributes would seem to make sense in the future. BTW Please use try-finally (even try-with-resources style) to close token-streams wherever possible. Analyzer internal parts are internally shared in thread-locals and the ramifications can be nasty on the entire Solr node if at any time one filter has a bug or something on a particular value. Your Solr node then becomes poisoned in a sense and only a restart will fix the ailment. Will do.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e872dc7913036c81b9ef48cf35c3456321b758b7 in lucene-solr's branch refs/heads/master from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e872dc7 ]

        SOLR-10351: Add try-with-resources clause around TokenStream

        Show
        jira-bot ASF subversion and git services added a comment - Commit e872dc7913036c81b9ef48cf35c3456321b758b7 in lucene-solr's branch refs/heads/master from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e872dc7 ] SOLR-10351 : Add try-with-resources clause around TokenStream
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 7d00d50f6cf7f759e5a3b5863ae1c4395daa3b54 in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d00d50 ]

        SOLR-10351: Add try-with-resources clause around TokenStream

        Show
        jira-bot ASF subversion and git services added a comment - Commit 7d00d50f6cf7f759e5a3b5863ae1c4395daa3b54 in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d00d50 ] SOLR-10351 : Add try-with-resources clause around TokenStream
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 8bfe70fbeaf8bbea5d60b9ecb81c6cbc9924dea0 in lucene-solr's branch refs/heads/master from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8bfe70f ]

        SOLR-10351: Update CHANGES.txt

        Show
        jira-bot ASF subversion and git services added a comment - Commit 8bfe70fbeaf8bbea5d60b9ecb81c6cbc9924dea0 in lucene-solr's branch refs/heads/master from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8bfe70f ] SOLR-10351 : Update CHANGES.txt
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c5a29c12876c8b000015bae24fc2dc2a42de9889 in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c5a29c1 ]

        SOLR-10351: Update CHANGES.txt

        Show
        jira-bot ASF subversion and git services added a comment - Commit c5a29c12876c8b000015bae24fc2dc2a42de9889 in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c5a29c1 ] SOLR-10351 : Update CHANGES.txt
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f631c986772a148c15ea34ad9a30b256a256afaa in lucene-solr's branch refs/heads/master from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f631c98 ]

        SOLR-10351: Add documention

        Show
        jira-bot ASF subversion and git services added a comment - Commit f631c986772a148c15ea34ad9a30b256a256afaa in lucene-solr's branch refs/heads/master from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f631c98 ] SOLR-10351 : Add documention
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit afe14bf751ee76bfd19972775fb0d484571fbb3e in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=afe14bf ]

        SOLR-10351: Add documention

        Show
        jira-bot ASF subversion and git services added a comment - Commit afe14bf751ee76bfd19972775fb0d484571fbb3e in lucene-solr's branch refs/heads/branch_6x from Joel Bernstein [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=afe14bf ] SOLR-10351 : Add documention

          People

          • Assignee:
            joel.bernstein Joel Bernstein
            Reporter:
            joel.bernstein Joel Bernstein
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development