Lucene - Core
  1. Lucene - Core
  2. LUCENE-2899

Add OpenNLP Analysis capabilities as a module

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does:

      • Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens)
      • NamedEntity recognition as a TokenFilter

      We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position.

      I'd propose it go under:
      modules/analysis/opennlp

      1. OpenNLPTokenizer.java
        6 kB
        Em
      2. OpenNLPFilter.java
        8 kB
        Em
      3. LUCENE-2899-RJN.patch
        317 kB
        Rene Nederhand
      4. LUCENE-2899.patch
        247 kB
        Lance Norskog

        Issue Links

          Activity

          Hide
          Joern Kottmann added a comment -

          The first release is now out. I guess you will use maven for dependency management, you can find here how to add the released version as a dependency:
          http://incubator.apache.org/opennlp/maven-dependency.html

          Show
          Joern Kottmann added a comment - The first release is now out. I guess you will use maven for dependency management, you can find here how to add the released version as a dependency: http://incubator.apache.org/opennlp/maven-dependency.html
          Hide
          Lance Norskog added a comment - - edited

          This is a patch for the trunk (as of a few days ago) that supplies the OpenNLP Sentence Detector, Tokenizer, Parts-of-Speech, Chunking and Named Entity Recognition tools.

          This has nothing to do with the code mentioned above.

          Show
          Lance Norskog added a comment - - edited This is a patch for the trunk (as of a few days ago) that supplies the OpenNLP Sentence Detector, Tokenizer, Parts-of-Speech, Chunking and Named Entity Recognition tools. This has nothing to do with the code mentioned above.
          Hide
          Lance Norskog added a comment -

          Notes for a Wiki page:

          OpenNLP Integration

          What is the integration? The first integration is a Tokenizer and three Filters.

          • The OpenNLPTokenizer uses the OpenNLP SentenceDetector and Tokenizer tools instead of the standard Lucene Tokenizers. This requires statistical model files. One quirk of these is that all punctuation is maintained.
          • The OpenNLPFilter implements Parts-of-Speech tagging, Chunking (finding noun/verb phrases), and Named Entity Recognition (tagging people, place names etc.). This filter will add all tags as payload attributes to the tokens.
          • The FilterPayloadsFilter removes tokens by checking the payloads. Given a list of payloads, it will either keep only tokens with one of those payloads, or remove only matching tokens and keep the rest. (This filter maintains position increments correctly.)
          • The StripPayloadsFilter removes payloads from Tokens.

          How do I get going?

          Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles against the libraries and uses the model files.
          Next, run 'ant example', cd to the example directory and run 'java -Dsolr.solr.home=opennlp -jar start.jar'
          You now should start without any Exceptions. At this point, go to the Schema analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead of text. If you would like this, then go vote on SOLR-3493.

          Show
          Lance Norskog added a comment - Notes for a Wiki page: OpenNLP Integration What is the integration? The first integration is a Tokenizer and three Filters. The OpenNLPTokenizer uses the OpenNLP SentenceDetector and Tokenizer tools instead of the standard Lucene Tokenizers. This requires statistical model files. One quirk of these is that all punctuation is maintained. The OpenNLPFilter implements Parts-of-Speech tagging, Chunking (finding noun/verb phrases), and Named Entity Recognition (tagging people, place names etc.). This filter will add all tags as payload attributes to the tokens. The FilterPayloadsFilter removes tokens by checking the payloads. Given a list of payloads, it will either keep only tokens with one of those payloads, or remove only matching tokens and keep the rest. (This filter maintains position increments correctly.) The StripPayloadsFilter removes payloads from Tokens. How do I get going? pull the latest trunk apply the patch download these models to contrib/opennlp/src/test-* files/opennlp/solr/conf/opennlp/ http://opennlp.sourceforge.net/models-1.5/ Everything that starts with 'en' download the OpenNLP distribution from http://opennlp.apache.org/cgi-bin/download.cgi Currently it is apache-opennlp-1.5.2-incubating-bin.tar.gz unpack this and copy the jar files from lib/ to solr/contrib/opennlp/lib Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles against the libraries and uses the model files. Next, run 'ant example', cd to the example directory and run 'java -Dsolr.solr.home=opennlp -jar start.jar' You now should start without any Exceptions. At this point, go to the Schema analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead of text. If you would like this, then go vote on SOLR-3493 .
          Hide
          Lance Norskog added a comment -

          About the build-

          1. This should be a Lucene module. I got lost trying to make the build work copying jars around, so it will ended up in Solr/contrib.
          2. Downloading the jars. I don't know how to put together license validation with the OpenNLP Maven build. I think it takes some upgrading in the OpenNLP project.
          3. Why download the models from a separate place? The models are not Apache licensed. They are binaries derived from GNU- and otherwise licensed training data. The OpenNLP people archived them on Sourceforge.
          Show
          Lance Norskog added a comment - About the build- This should be a Lucene module. I got lost trying to make the build work copying jars around, so it will ended up in Solr/contrib. Downloading the jars. I don't know how to put together license validation with the OpenNLP Maven build. I think it takes some upgrading in the OpenNLP project. Why download the models from a separate place? The models are not Apache licensed. They are binaries derived from GNU- and otherwise licensed training data. The OpenNLP people archived them on Sourceforge.
          Hide
          Lance Norskog added a comment -

          I consider the code and feature set mostly cooked as a first release. The toolkit as is lets you do two things:

          1. Do named entity recognition and filter out names for an autosuggest dictionary
          2. Pick nouns and verbs out of text and only index those. This gives you a field with a smaller more focused set of terms. MoreLikeThis might work better.

          Please review it for design, bugs, code nits, whatever.

          Show
          Lance Norskog added a comment - I consider the code and feature set mostly cooked as a first release. The toolkit as is lets you do two things: Do named entity recognition and filter out names for an autosuggest dictionary Pick nouns and verbs out of text and only index those. This gives you a field with a smaller more focused set of terms. MoreLikeThis might work better. Please review it for design, bugs, code nits, whatever.
          Hide
          Lance Norskog added a comment -

          An explanation about the OpenNLPUtil factory class: the statistical models are several megabytes apiece. This class loads them and caches them by file name. It does not reload them across commits.

          The models are immutable objects. The factory class creates another object that consults the model. There is one of these for each field analysis.

          The models are large enough that if the different unit tests load them all at once, it needs more than the default ram. Therefore, the unit tests unload all models between tests, and only run single-threaded.

          Show
          Lance Norskog added a comment - An explanation about the OpenNLPUtil factory class: the statistical models are several megabytes apiece. This class loads them and caches them by file name. It does not reload them across commits. The models are immutable objects. The factory class creates another object that consults the model. There is one of these for each field analysis. The models are large enough that if the different unit tests load them all at once, it needs more than the default ram. Therefore, the unit tests unload all models between tests, and only run single-threaded.
          Hide
          Lance Norskog added a comment -

          License-ready.
          Ivy-ready.
          OpenNLP libraries available through Ivy.
          You still have to download jwnl-1.3.3 from http://sourceforge.net/projects/jwordnet/files/

          And of course download the model files. But this is committable to the Solr side.

          Show
          Lance Norskog added a comment - License-ready. Ivy-ready. OpenNLP libraries available through Ivy. You still have to download jwnl-1.3.3 from http://sourceforge.net/projects/jwordnet/files/ And of course download the model files. But this is committable to the Solr side.
          Hide
          Grant Ingersoll added a comment -

          Very cool Lance. The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all. I wonder how hard it would be to create much smaller ones based on training just a few things.

          Show
          Grant Ingersoll added a comment - Very cool Lance. The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all. I wonder how hard it would be to create much smaller ones based on training just a few things.
          Hide
          Tommaso Teofili added a comment -

          I wonder how hard it would be to create much smaller ones based on training just a few things.

          there was the idea of using the OpenNLP CorpusServer with some wikinews articles to train them (back to OPENNLP-385)

          Show
          Tommaso Teofili added a comment - I wonder how hard it would be to create much smaller ones based on training just a few things. there was the idea of using the OpenNLP CorpusServer with some wikinews articles to train them (back to OPENNLP-385 )
          Hide
          Joern Kottmann added a comment -

          I am using this mentioned Corpus Server together with the Apache UIMA Cas Editor for labeling projects. If someone wants to set something up to label data we (OpenNLP people) are happy to help with that!

          Show
          Joern Kottmann added a comment - I am using this mentioned Corpus Server together with the Apache UIMA Cas Editor for labeling projects. If someone wants to set something up to label data we (OpenNLP people) are happy to help with that!
          Hide
          Grant Ingersoll added a comment -

          Cool!

          I think if we could just get a very small model that can be checked in and used for testing purposes, that is all that would be needed. We don't really need to test OpenNLP, we just need to test that the code properly interfaces with OpenNLP, so a really small model should be fine.

          Show
          Grant Ingersoll added a comment - Cool! I think if we could just get a very small model that can be checked in and used for testing purposes, that is all that would be needed. We don't really need to test OpenNLP, we just need to test that the code properly interfaces with OpenNLP, so a really small model should be fine.
          Hide
          Grant Ingersoll added a comment -

          This really should just be a part of the analysis modules (with the exception of the Solr example parts). I don't know exactly how we are handling Solr examples anymore, but I seem to recall the general consensus was to not proliferate them. Can we just expose the functionality in the main one?

          I'll update the patch to move this to the module for starters. Not sure on what to do w/ the example part.

          Show
          Grant Ingersoll added a comment - This really should just be a part of the analysis modules (with the exception of the Solr example parts). I don't know exactly how we are handling Solr examples anymore, but I seem to recall the general consensus was to not proliferate them. Can we just expose the functionality in the main one? I'll update the patch to move this to the module for starters. Not sure on what to do w/ the example part.
          Hide
          Joern Kottmann added a comment -

          For a test you can run OpenNLP just over a piece of training data, even when trained on a tiny amount of data this will give good results. It does not test OpenNLP, but is sufficient for the desired interface testing.

          Show
          Joern Kottmann added a comment - For a test you can run OpenNLP just over a piece of training data, even when trained on a tiny amount of data this will give good results. It does not test OpenNLP, but is sufficient for the desired interface testing.
          Hide
          Lance Norskog added a comment -

          This really should just be a part of the analysis modules (with the exception of the Solr example parts). I don't know exactly how we are handling Solr examples anymore, but I seem to recall the general consensus was to not proliferate them. Can we just expose the functionality in the main one?

          A lot of Solr/Lucene features are only demoed in solrconfig/schema unit test files (DIH for example). That is fine.

          The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all.

          D'oh! Forgot about that. If we have tagged data in the project, it helps show the other parts of an NLP suite. It's hard to get a full picture of the jigsaw puzzle if you don't know NLP software.

          Show
          Lance Norskog added a comment - This really should just be a part of the analysis modules (with the exception of the Solr example parts). I don't know exactly how we are handling Solr examples anymore, but I seem to recall the general consensus was to not proliferate them. Can we just expose the functionality in the main one? A lot of Solr/Lucene features are only demoed in solrconfig/schema unit test files (DIH for example). That is fine. The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all. D'oh! Forgot about that. If we have tagged data in the project, it helps show the other parts of an NLP suite. It's hard to get a full picture of the jigsaw puzzle if you don't know NLP software.
          Hide
          Lance Norskog added a comment -

          Wiki page is up! http://wiki.apache.org/solr/OpenNLP

          Also, the Solr fancy toolkits had no links from the Solr front page, so I added 'Advanced Tools' with links to UIMA and this.

          Show
          Lance Norskog added a comment - Wiki page is up! http://wiki.apache.org/solr/OpenNLP Also, the Solr fancy toolkits had no links from the Solr front page, so I added 'Advanced Tools' with links to UIMA and this.
          Hide
          Lance Norskog added a comment -

          The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all.

          I have mini training data for sentence detection, tokenization, POS and chunking. The purpose is to make the matching unit tests pass. The data and build script are in a new (unattached) patch.

          NER is proving a tougher nut to crack. I tried annotating several hundred lines of Reuters but no go.

          How would I make an NER dataset that will make OpenNLP spit out one or two tags? Is there a large NER dataset that is Apache-friendly?

          Show
          Lance Norskog added a comment - The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all. I have mini training data for sentence detection, tokenization, POS and chunking. The purpose is to make the matching unit tests pass. The data and build script are in a new (unattached) patch. NER is proving a tougher nut to crack. I tried annotating several hundred lines of Reuters but no go. How would I make an NER dataset that will make OpenNLP spit out one or two tags? Is there a large NER dataset that is Apache-friendly?
          Hide
          Joern Kottmann added a comment -

          For NER you should try the perceptron and a cutoff of zero. For NER with a cutoff of 5 you need otherwise much more training data.

          Show
          Joern Kottmann added a comment - For NER you should try the perceptron and a cutoff of zero. For NER with a cutoff of 5 you need otherwise much more training data.
          Hide
          Lance Norskog added a comment -

          For NER you should try the perceptron and a cutoff of zero.

          Thanks! This patch generates all models needed by tests, and the tests are rewritten to use the poor quality data from the models. To make the models, go to solr/contrib/opennlp/src/test-files/training and run bin/training.sh. This populates solr/contrib/opennlp/src/test-files/opennlp/conf/opennlp. I don't have windows anymore so I can't make a .bat version.

          Show
          Lance Norskog added a comment - For NER you should try the perceptron and a cutoff of zero. Thanks! This patch generates all models needed by tests, and the tests are rewritten to use the poor quality data from the models. To make the models, go to solr/contrib/opennlp/src/test-files/training and run bin/training.sh . This populates solr/contrib/opennlp/src/test-files/opennlp/conf/opennlp . I don't have windows anymore so I can't make a .bat version.
          Hide
          Lance Norskog added a comment -

          General status:

          • At this point you have to download 1 library (jwnl) and run a script to make the unit tests work.
          • You have to download several model files from sourceforge to do real work. There is no script to help.
          • The tokenizer and filter are in solr/ not lucene/

          What is missing to make this a full package:

          • Payload handling
            • TokenFilter to parse TAG/term or term_TAG into term/payload.
            • Output code in Solr for the reverse.
            • Payload query for tags.
            • Similarity scoring algorithms for tags.
          • Tag handling
            • There is a universal set of 12 parts-of-speech tags, with mappings for many language tagsets (Treebank etc.) into 12 common tags. Multi-language sites would benefit from this. I persuaded the authors to switch from GNU to Apache licensing.

          What NLP apps would be useful for search? Coordinate expansion, for example.

          Show
          Lance Norskog added a comment - General status: At this point you have to download 1 library (jwnl) and run a script to make the unit tests work. You have to download several model files from sourceforge to do real work. There is no script to help. The tokenizer and filter are in solr/ not lucene/ What is missing to make this a full package: Payload handling TokenFilter to parse TAG/term or term_TAG into term/payload. Output code in Solr for the reverse. Payload query for tags. Similarity scoring algorithms for tags. Tag handling There is a universal set of 12 parts-of-speech tags, with mappings for many language tagsets (Treebank etc.) into 12 common tags. Multi-language sites would benefit from this. I persuaded the authors to switch from GNU to Apache licensing. A Universal Part-of-Speech Tagset What NLP apps would be useful for search? Coordinate expansion, for example.
          Hide
          Lance Norskog added a comment -

          This is about finished. The Tokenizer and TokenFilters are moved over into lucene/analysis/opennlp. They do not have unit tests in lucene/ because of the difficulty in supplying model data. They are unit-tested by the factories in solr/contrib/opennlp.

          The solr/example/opennlp directory is gone, as per request. Possible field types are documented in the solrconfig.xml in the unit test resources.

          All jars are downloaded via ivy. The jwnl library is one rev after what this was compiled with. It is only used in collocation, which is not exposed in this release.

          To build, test and commit, there is a boostrap sequence. In the top-level directory:

            ant clean compile
          

          This downloads the OpenNLP jars

          cd solr/contrib/opennlp/test-files/training
          sh bin/training.sh
          

          This creates low-quality model files in solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp. In the trunk/solr directory, run

           
          ant example test-contrib
          

          You now have committable binary models. They are small, and only there to run the OpenNLP unit tests. They generate results that are objectively bogus, but the unit tests are matched to the results. If you want real models, you have to download them from sourceforge.

          Show
          Lance Norskog added a comment - This is about finished. The Tokenizer and TokenFilters are moved over into lucene/analysis/opennlp. They do not have unit tests in lucene/ because of the difficulty in supplying model data. They are unit-tested by the factories in solr/contrib/opennlp. The solr/example/opennlp directory is gone, as per request. Possible field types are documented in the solrconfig.xml in the unit test resources. All jars are downloaded via ivy. The jwnl library is one rev after what this was compiled with. It is only used in collocation, which is not exposed in this release. To build, test and commit, there is a boostrap sequence. In the top-level directory: ant clean compile This downloads the OpenNLP jars cd solr/contrib/opennlp/test-files/training sh bin/training.sh This creates low-quality model files in solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp . In the trunk/solr directory, run ant example test-contrib You now have committable binary models. They are small, and only there to run the OpenNLP unit tests. They generate results that are objectively bogus, but the unit tests are matched to the results. If you want real models, you have to download them from sourceforge.
          Hide
          Lance Norskog added a comment -

          Oops- remove solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp/.gitignore. This will prevent you from committing the models.

          Show
          Lance Norskog added a comment - Oops- remove solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp/.gitignore . This will prevent you from committing the models.
          Hide
          Lance Norskog added a comment - - edited

          dev-tools needs updating. I don't have IntelliJ and don't feel comfortable making the right Eclipse files.

          This patch works on both trunk and 4.x. I made a few changes in the build files where modules were out of alphabetic order. Also, the reams of copied code in module-build.xml had blocks out of order. I can't easily see where, but it seems like some of them are missing a few lines that others have.

          Show
          Lance Norskog added a comment - - edited dev-tools needs updating. I don't have IntelliJ and don't feel comfortable making the right Eclipse files. This patch works on both trunk and 4.x. I made a few changes in the build files where modules were out of alphabetic order. Also, the reams of copied code in module-build.xml had blocks out of order. I can't easily see where, but it seems like some of them are missing a few lines that others have.
          Hide
          Lance Norskog added a comment -

          The Wiki is updated for testing and committing this patch: http://wiki.apache.org/solr/OpenNLP.

          Show
          Lance Norskog added a comment - The Wiki is updated for testing and committing this patch: http://wiki.apache.org/solr/OpenNLP .
          Hide
          Lance Norskog added a comment -

          There is a regression in Solr which causes this to not work in a Solr example: SOLR-3625. Until this is fixed, you have to copy the Lucene opennlp jar, the Solr opennlp jar, and the solr/contrib/opennlp/lib jars into the solr war.

          Show
          Lance Norskog added a comment - There is a regression in Solr which causes this to not work in a Solr example: SOLR-3625 . Until this is fixed, you have to copy the Lucene opennlp jar, the Solr opennlp jar, and the solr/contrib/opennlp/lib jars into the solr war.
          Hide
          Lance Norskog added a comment -

          SOLR-3623 should give a final answer for how to build contribs and Lucene libraries and external dependencies. I've found it a little confusing.

          Show
          Lance Norskog added a comment - SOLR-3623 should give a final answer for how to build contribs and Lucene libraries and external dependencies. I've found it a little confusing.
          Hide
          Lance Norskog added a comment -

          New patch for current build system on trunk & 4.x.

          Show
          Lance Norskog added a comment - New patch for current build system on trunk & 4.x.
          Hide
          Lance Norskog added a comment - - edited

          As it turns out, building is still confused: solr/example/solr-webapps comes and goes.

          This build parks the lucene-analyzer-opennlp jar in solr/contrib/opennlp/lucene-libs. example/..../solrconfig.xml includes a reference to ../....../contrib/opennlp/lib and lucene-libs and ../...../dist.

          A jar-of-jars or a fully repacked jar in dist/ is the best way to ship this.

          Bug status: payloads added by this filter do not get written to the index!

          Build-fiddling status: forbidden api checks fail. checksums and licenses validate. rat-sources validate. No dev-tools changes.

          If you want this committed, I'm quite happy to do the last mile.

          Show
          Lance Norskog added a comment - - edited As it turns out, building is still confused: solr/example/solr-webapps comes and goes. This build parks the lucene-analyzer-opennlp jar in solr/contrib/opennlp/lucene-libs. example/..../solrconfig.xml includes a reference to ../....../contrib/opennlp/lib and lucene-libs and ../...../dist. A jar-of-jars or a fully repacked jar in dist/ is the best way to ship this. Bug status: payloads added by this filter do not get written to the index! Build-fiddling status: forbidden api checks fail. checksums and licenses validate. rat-sources validate. No dev-tools changes. If you want this committed, I'm quite happy to do the last mile.
          Hide
          alexey added a comment -

          Yes, please, it would be awesome if someone could make this last effort and commit this issue. Many thanks!

          Show
          alexey added a comment - Yes, please, it would be awesome if someone could make this last effort and commit this issue. Many thanks!
          Hide
          Lance Norskog added a comment -

          Committable except for dev-tools/ and production builds. I've updated dev-tools/eclipse, I don't have IntelliJ. These dev-tools build files contain 'uima' and so need parallel work for 'opennlp':

          dev-tools/maven/lucene/analysis/pom.xml.template
          dev-tools/maven/lucene/analysis/uima/pom.xml.template
          dev-tools/maven/pom.xml.template
          dev-tools/maven/solr/contrib/pom.xml.template
          dev-tools/maven/solr/contrib/uima/pom.xml.template
          dev-tools/scripts/SOLR-2452.patch.hack.pl
            - this one seems to be dead
          
          Show
          Lance Norskog added a comment - Committable except for dev-tools/ and production builds. I've updated dev-tools/eclipse, I don't have IntelliJ. These dev-tools build files contain 'uima' and so need parallel work for 'opennlp': dev-tools/maven/lucene/analysis/pom.xml.template dev-tools/maven/lucene/analysis/uima/pom.xml.template dev-tools/maven/pom.xml.template dev-tools/maven/solr/contrib/pom.xml.template dev-tools/maven/solr/contrib/uima/pom.xml.template dev-tools/scripts/SOLR-2452.patch.hack.pl - this one seems to be dead
          Hide
          Lance Norskog added a comment -

          The latest patch is tested fully and painfully in trunk. I'm sure it works as-is in 4.x, but it is not going into 4.0, so I'm not spending time on that

          Show
          Lance Norskog added a comment - The latest patch is tested fully and painfully in trunk. I'm sure it works as-is in 4.x, but it is not going into 4.0, so I'm not spending time on that
          Hide
          Em added a comment -

          Could you please create a new Patch for the current Trunk? I had some problems on applying it to my working copy...

          I am not entirely sure whether its the Trunk or your Code, but it seems like your OpenNLP-code only works for the first request.

          As far as I was able to debug, the create()-method of the TokenFilterFactory is only called every now and again (are created TokenFilters reused for longer than one call in Solr?).

          If create() of your FilterFactory was called, everything works. However if the TokenFilter is somehow reused, it fails.

          Is this a bug of Solr or of your Patch?

          Show
          Em added a comment - Could you please create a new Patch for the current Trunk? I had some problems on applying it to my working copy... I am not entirely sure whether its the Trunk or your Code, but it seems like your OpenNLP-code only works for the first request. As far as I was able to debug, the create()-method of the TokenFilterFactory is only called every now and again (are created TokenFilters reused for longer than one call in Solr?). If create() of your FilterFactory was called, everything works. However if the TokenFilter is somehow reused, it fails. Is this a bug of Solr or of your Patch?
          Hide
          Em added a comment - - edited

          Some Attributes were not reset (i.e. "first"-Attribute in OpenNLPTokenizer and "indexToken" in OpenNLPFilter) correctly.

          Since I had trouble applying your patch, I'd like to provide the working source code. Please, create a patch for the current Trunk.

          Show
          Em added a comment - - edited Some Attributes were not reset (i.e. "first"-Attribute in OpenNLPTokenizer and "indexToken" in OpenNLPFilter) correctly. Since I had trouble applying your patch, I'd like to provide the working source code. Please, create a patch for the current Trunk.
          Hide
          Lance Norskog added a comment -

          Thank you!

          This worked when I posted it. There have been many changes in 4.x and trunk since then. For example, all of the tokenizer and filter factories moved to Lucene from Solr. I'm waiting until 4.0 is finished before I redo this patch.

          Show
          Lance Norskog added a comment - Thank you! This worked when I posted it. There have been many changes in 4.x and trunk since then. For example, all of the tokenizer and filter factories moved to Lucene from Solr. I'm waiting until 4.0 is finished before I redo this patch.
          Hide
          Phani Vempaty added a comment -

          Would there be a patch for 4.0 as it is released.

          Show
          Phani Vempaty added a comment - Would there be a patch for 4.0 as it is released.
          Hide
          Patricia Gorla added a comment -

          Thanks for this patch!

          I'm able to get the posTagger working, yet I still have not found a way to incorporate either the Chunker or the NER Models into my Solr project.

          Setting posTagger by itself works, but when I add a link to the chunkerModel (or even just the chunkerModel by itself), I obtain only the tokenized text.

          <fieldType name="text_opennlp_pos" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
             <tokenizer class="solr.OpenNLPTokenizerFactory"
                tokenizerModel="opennlp/en-token.bin" />
             <filter class="solr.OpenNLPFilterFactory" 
                chunkerModel="opennlp/en-chunking.bin"/>
          </analyzer>
          </fieldType>
          

          I'm new to OpenNLP, so any pointers in the right direction would be greatly appreciated.

          Show
          Patricia Gorla added a comment - Thanks for this patch! I'm able to get the posTagger working, yet I still have not found a way to incorporate either the Chunker or the NER Models into my Solr project. Setting posTagger by itself works, but when I add a link to the chunkerModel (or even just the chunkerModel by itself), I obtain only the tokenized text. <fieldType name= "text_opennlp_pos" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <tokenizer class= "solr.OpenNLPTokenizerFactory" tokenizerModel= "opennlp/en-token.bin" /> <filter class= "solr.OpenNLPFilterFactory" chunkerModel= "opennlp/en-chunking.bin" /> </analyzer> </fieldType> I'm new to OpenNLP, so any pointers in the right direction would be greatly appreciated.
          Hide
          Lance Norskog added a comment -

          Wow, someone tried it! I apologize for not noticing your question.

          I'm able to get the posTagger working, yet I still have not found a way to incorporate either the Chunker or the NER Models into my Solr project.

          The schema.xml file includes samples for all of the models:

          /lusolr_4x_opennlp/solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/schema.xml

          This is for the chunker. The chunker works from parts-of-speech tags, not the original words. The chunker needs a parts-of-speech model as well as a chunker model. This should throw an error if the parts-of-speech model is not there. I will fix this.

           <filter class="solr.OpenNLPFilterFactory" 
                    posTaggerModel="opennlp/en-test-pos-maxent.bin"
                    chunkerModel="opennlp/en-test-chunker.bin"
                  />
          

          Is the NER configuration still not working?

          Show
          Lance Norskog added a comment - Wow, someone tried it! I apologize for not noticing your question. I'm able to get the posTagger working, yet I still have not found a way to incorporate either the Chunker or the NER Models into my Solr project. The schema.xml file includes samples for all of the models: /lusolr_4x_opennlp/solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/schema.xml This is for the chunker. The chunker works from parts-of-speech tags, not the original words. The chunker needs a parts-of-speech model as well as a chunker model. This should throw an error if the parts-of-speech model is not there. I will fix this. <filter class= "solr.OpenNLPFilterFactory" posTaggerModel= "opennlp/en-test-pos-maxent.bin" chunkerModel= "opennlp/en-test-chunker.bin" /> Is the NER configuration still not working?
          Hide
          Kai Gülzau added a comment -

          The patch seems to be a bit out of date.
          Applying it to branch_4x or trunk fails (build scripts).

          Show
          Kai Gülzau added a comment - The patch seems to be a bit out of date. Applying it to branch_4x or trunk fails (build scripts).
          Hide
          Kai Gülzau added a comment - - edited

          End of OpenNLPTokenizer.fillBuffer() should be:

          while(length == size) {
            offset += size;
            fullText = Arrays.copyOf(fullText, offset + size);
            length = input.read(fullText, offset, size);
          }
          if (length == -1) {
            length = 0;
          }
          fullText = Arrays.copyOf(fullText, offset + length);
          
          Show
          Kai Gülzau added a comment - - edited End of OpenNLPTokenizer.fillBuffer() should be: while (length == size) { offset += size; fullText = Arrays.copyOf(fullText, offset + size); length = input.read(fullText, offset, size); } if (length == -1) { length = 0; } fullText = Arrays.copyOf(fullText, offset + length);
          Hide
          Lance Norskog added a comment -

          Thank you. Have you tried this on the trunk? The Solr components did not work, they could not find the OpenNLP jars.

          Show
          Lance Norskog added a comment - Thank you. Have you tried this on the trunk? The Solr components did not work, they could not find the OpenNLP jars.
          Hide
          Kai Gülzau added a comment - - edited

          I have applied the Patch to trunk, modified the build scripts manually (ignoring javadoc tasks) and built the opennlp jars.
          Jars are running in a vanilla Solr 4.1 environment.

          • solr_server4.1\solr\lib\opennlp\
            • jwnl-1.4_rc3.jar
            • lucene-analyzers-opennlp-5.0-SNAPSHOT.jar (build with patch)
            • opennlp-maxent-3.0.2-incubating.jar
            • opennlp-tools-1.5.2-incubating.jar
            • solr-opennlp-5.0-SNAPSHOT.jar (build with patch)

          with <lib dir="../lib/opennlp" /> in solrconfig.xml

          Works for me: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201301.mbox/%3CB65DA877C3F93B4FB39EA49A1A03C95CC27AB1%40email.novomind.com%3E

          edit: removed jwnl*.jar as stated by Joern

          Show
          Kai Gülzau added a comment - - edited I have applied the Patch to trunk, modified the build scripts manually (ignoring javadoc tasks) and built the opennlp jars. Jars are running in a vanilla Solr 4.1 environment. solr_server4.1\solr\lib\opennlp\ jwnl-1.4_rc3.jar lucene-analyzers-opennlp-5.0-SNAPSHOT.jar (build with patch) opennlp-maxent-3.0.2-incubating.jar opennlp-tools-1.5.2-incubating.jar solr-opennlp-5.0-SNAPSHOT.jar (build with patch) with <lib dir="../lib/opennlp" /> in solrconfig.xml Works for me: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201301.mbox/%3CB65DA877C3F93B4FB39EA49A1A03C95CC27AB1%40email.novomind.com%3E edit : removed jwnl*.jar as stated by Joern
          Hide
          Joern Kottmann added a comment -

          The jwnl library is only needed if you use the OpenNLP Coreference component, otherwise its safe to exclude it. The 1.4_rc3 version is not tested anyway and its likely that the Coreferencer does not probably run with it.

          Show
          Joern Kottmann added a comment - The jwnl library is only needed if you use the OpenNLP Coreference component, otherwise its safe to exclude it. The 1.4_rc3 version is not tested anyway and its likely that the Coreferencer does not probably run with it.
          Hide
          Rene Nederhand added a comment -

          New patch for both trunk and 4.1 stable. Tested on revision 1450998.

          ant compile
          cd solr/contrib/src/test-files/training
          sh bin/trainall.sh
          cd ../../../../../../solr
          ant example test-contrib
          

          Hope this helps more people in testing OpenNLP integration with Solr.

          TODO:

          • Implementing dev-tools
          • Include references to javadocs
          Show
          Rene Nederhand added a comment - New patch for both trunk and 4.1 stable. Tested on revision 1450998. ant compile cd solr/contrib/src/test-files/training sh bin/trainall.sh cd ../../../../../../solr ant example test-contrib Hope this helps more people in testing OpenNLP integration with Solr. TODO: Implementing dev-tools Include references to javadocs
          Hide
          Rene Nederhand added a comment -

          New patch for both trunk and 4.1 stable. Tested on revision 1450998.

          ant compile
          cd solr/contrib/src/test-files/training
          sh bin/trainall.sh
          cd ../../../../../../solr
          ant example test-contrib
          

          Hope this helps more people in testing OpenNLP integration with Solr.

          TODO:

          • Implementing dev-tools
          • Include references to javadocs
          Show
          Rene Nederhand added a comment - New patch for both trunk and 4.1 stable. Tested on revision 1450998. ant compile cd solr/contrib/src/test-files/training sh bin/trainall.sh cd ../../../../../../solr ant example test-contrib Hope this helps more people in testing OpenNLP integration with Solr. TODO: Implementing dev-tools Include references to javadocs
          Hide
          Maciej Lizewski added a comment -

          why don't you prepare this as separate project that produces some jars and config files with instructions on how to add it in solr configuration instead of publishing all changes as patches to solr sources? I am interested in doing some tests with your library but setting all things up seems quite complicated and hard to maintain in future... it is just a thought.

          Show
          Maciej Lizewski added a comment - why don't you prepare this as separate project that produces some jars and config files with instructions on how to add it in solr configuration instead of publishing all changes as patches to solr sources? I am interested in doing some tests with your library but setting all things up seems quite complicated and hard to maintain in future... it is just a thought.
          Hide
          Zack Zullick added a comment -

          Some information for those wanting to try this after fighting it for a day: the latest patch posted, LUCENE-2899-RJN.patch for 4.1 does not have Em's OpenNLPFilter.java and OpenNLPTokenizer.java fixed applied. So after applying the patch, make sure to replace those classes with Em's version or the bug that causes the NLP system to only be utilized on the first request will still be present. I was also able to successfully apply this patch to 4.2.1 with minor modification (mostly to the build/ivy xml files).

          Show
          Zack Zullick added a comment - Some information for those wanting to try this after fighting it for a day: the latest patch posted, LUCENE-2899 -RJN.patch for 4.1 does not have Em's OpenNLPFilter.java and OpenNLPTokenizer.java fixed applied. So after applying the patch, make sure to replace those classes with Em's version or the bug that causes the NLP system to only be utilized on the first request will still be present. I was also able to successfully apply this patch to 4.2.1 with minor modification (mostly to the build/ivy xml files).
          Hide
          Lance Norskog added a comment -

          Maciej- This is a good point. This package needs changes in a lot of places and it might be easier to package it the way you say.

          Zack- The "churn" in the APIs is a major problem in the Lucene code management. The original patch worked in the 4.x branch and trunk when it was posted. What Em fixed is in an area which is very very basic to Lucene. The API changed with no notice and no change in versions or method names.

          Everyone- It's great that this has gained some interest. Please create a new master patch with whatever changes are needed for the current code base.

          Lucene grand masters- Please don't say "hey kids, write plugins, they're cool!" and then make subtle incompatible changes in APIs.

          Show
          Lance Norskog added a comment - Maciej- This is a good point. This package needs changes in a lot of places and it might be easier to package it the way you say. Zack- The "churn" in the APIs is a major problem in the Lucene code management. The original patch worked in the 4.x branch and trunk when it was posted. What Em fixed is in an area which is very very basic to Lucene. The API changed with no notice and no change in versions or method names. Everyone- It's great that this has gained some interest. Please create a new master patch with whatever changes are needed for the current code base. Lucene grand masters- Please don't say "hey kids, write plugins, they're cool!" and then make subtle incompatible changes in APIs.
          Hide
          Lance Norskog added a comment -

          I'm updating the patches for 4.x and trunk. Kai's fix works. The unit tests did not attempt to analyse text that is longer than the fixed size temp buffer, and thus the code for copying successive buffers was never exercised. Kai's fix handles this problem. I've added a unit test.

          Em: the Lucene Tokenizer lifecyle is that the Tokenizer is created with a Reader, and each call to incrementToken() walks the input. When incrementToken() returns false, that is all- the Tokenizer is finished. TokenStream can support a 'stateful' token stream: with OpenNLPFilter, you call incrementToken() until it returns false, and then you can call 'reset' and it will start over from the beginning. The unit tests include a check that reset() works. The changes you made support a feature that is not supported by Lucene. Also, the changes break most of the unit tests. Please create a unit test that shows the bug, and fix the existing unit tests. No unit test = no bug report.

          I'm posting a patch for the current 4.x and trunk. It includes some changes for TokenStream/TokenFilter method signatures, some refactoring in the unit tests, a little tightening in the Tokenizer & Filter, and Kai's fix. There are unit tests for the problem Kai found, and also a test that has TokenizerFactory create multiple Tokenizer streams. If there is a bug in this patch, please write a unit test which demonstrates it.

          The patch is called LUCENE-2899-current.patch. It is tested against the current 4.x branch and the current trunk.

          Thanks for your interest and hard work- I know it is really tedious to understand this code

          Lance Norskog

          Show
          Lance Norskog added a comment - I'm updating the patches for 4.x and trunk. Kai's fix works. The unit tests did not attempt to analyse text that is longer than the fixed size temp buffer, and thus the code for copying successive buffers was never exercised. Kai's fix handles this problem. I've added a unit test. Em: the Lucene Tokenizer lifecyle is that the Tokenizer is created with a Reader, and each call to incrementToken() walks the input. When incrementToken() returns false, that is all- the Tokenizer is finished. TokenStream can support a 'stateful' token stream: with OpenNLPFilter, you call incrementToken() until it returns false, and then you can call 'reset' and it will start over from the beginning. The unit tests include a check that reset() works. The changes you made support a feature that is not supported by Lucene. Also, the changes break most of the unit tests. Please create a unit test that shows the bug, and fix the existing unit tests. No unit test = no bug report. I'm posting a patch for the current 4.x and trunk. It includes some changes for TokenStream/TokenFilter method signatures, some refactoring in the unit tests, a little tightening in the Tokenizer & Filter, and Kai's fix. There are unit tests for the problem Kai found, and also a test that has TokenizerFactory create multiple Tokenizer streams. If there is a bug in this patch, please write a unit test which demonstrates it. The patch is called LUCENE-2899 -current.patch. It is tested against the current 4.x branch and the current trunk. Thanks for your interest and hard work- I know it is really tedious to understand this code Lance Norskog
          Hide
          Lance Norskog added a comment -

          I found the problem with multiple documents. The API for reusing Tokenizers changed something more sensible, but I only noticed and implemented part of the change. The result was than when you upload multiple documents, it just re-processed the first document.

          File LUCENE-2899-x.patch has this fix. It applies against the 4.x branch and the trunk. It does not apply against Lucene 4.0, 4.1, 4.2 or 4.3. For all released Solr versions you want LUCENE-2899.patch from August 27, 2012. There are no new features since that release.

          Show
          Lance Norskog added a comment - I found the problem with multiple documents. The API for reusing Tokenizers changed something more sensible, but I only noticed and implemented part of the change. The result was than when you upload multiple documents, it just re-processed the first document. File LUCENE-2899 -x.patch has this fix. It applies against the 4.x branch and the trunk. It does not apply against Lucene 4.0, 4.1, 4.2 or 4.3. For all released Solr versions you want LUCENE-2899 .patch from August 27, 2012. There are no new features since that release.
          Hide
          Joern Kottmann added a comment -

          Lance, does the patch gets jwnl form our old SourceForge page? This page is often overloaded and probably makes your build unstable. To solve this issue (see OPENNLP-510) we moved jwnl for 1.5.3 to the central repo. Anyway as long as you don't use the coreference component you can exclude this dependency.

          Show
          Joern Kottmann added a comment - Lance, does the patch gets jwnl form our old SourceForge page? This page is often overloaded and probably makes your build unstable. To solve this issue (see OPENNLP-510 ) we moved jwnl for 1.5.3 to the central repo. Anyway as long as you don't use the coreference component you can exclude this dependency.
          Hide
          Lance Norskog added a comment -

          Yup- upgrading to 1.5.3 is next on the list.

          Show
          Lance Norskog added a comment - Yup- upgrading to 1.5.3 is next on the list.
          Hide
          Lance Norskog added a comment - - edited

          I did not make the right changes to OpenNLPFilter.java to handle the API changes. I have attached a fixed version of this to this issue. Please try it and see if it fixes what you see.

          A-a-a-a-a-a-n-n-n-n-d chunking is broken. Oy.

          Show
          Lance Norskog added a comment - - edited I did not make the right changes to OpenNLPFilter.java to handle the API changes. I have attached a fixed version of this to this issue. Please try it and see if it fixes what you see. A-a-a-a-a-a-n-n-n-n-d chunking is broken. Oy.
          Hide
          Lance Norskog added a comment -

          Fixed the Chunker problem. I switched to the new released version of the OpenNLP packages. The MaxEnt implementation (statistical modeling) for chunking changed slightly, and my test data now produces different noun&verb phrase chunks for the sample text.

          At this point the only problems I know of are that the licenses are slightly wrong, and so
          'ant validate' fails.

          These comments only apply to LUCENE-2899x.patch, which applies to the current 4.x and trunk codelines. LUCENE-2899.patch applies to the release 4.0>4.3 releases. It is not upgraded to the new OpenNLP release.

          Show
          Lance Norskog added a comment - Fixed the Chunker problem. I switched to the new released version of the OpenNLP packages. The MaxEnt implementation (statistical modeling) for chunking changed slightly, and my test data now produces different noun&verb phrase chunks for the sample text. At this point the only problems I know of are that the licenses are slightly wrong, and so 'ant validate' fails. These comments only apply to LUCENE-2899 x.patch, which applies to the current 4.x and trunk codelines. LUCENE-2899 .patch applies to the release 4.0 >4.3 releases. It is not upgraded to the new OpenNLP release.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Andrew Janowczyk added a comment -

          A little bit of a shameless plug, but we just wrote a blog post here about using the stanford library for NER as a processor factory / request handler for Solr. It seems applicable to the audience on this ticket, is it worth contributing it to the community via a patch of some sort?

          Show
          Andrew Janowczyk added a comment - A little bit of a shameless plug, but we just wrote a blog post here about using the stanford library for NER as a processor factory / request handler for Solr. It seems applicable to the audience on this ticket, is it worth contributing it to the community via a patch of some sort?
          Hide
          Lance Norskog added a comment -

          Yup! Another NER is always helpful. But the big problem with NLP software is not the code but the models- do you have a good source of free models?

          Show
          Lance Norskog added a comment - Yup! Another NER is always helpful. But the big problem with NLP software is not the code but the models- do you have a good source of free models?
          Hide
          Joern Kottmann added a comment -

          Stanford NLP is licensed under GPLv2, this license is not compatible with the AL 2.0 and therefore such a component can't be contributed to an Apache project directly.

          Show
          Joern Kottmann added a comment - Stanford NLP is licensed under GPLv2, this license is not compatible with the AL 2.0 and therefore such a component can't be contributed to an Apache project directly.
          Hide
          Andrew Janowczyk added a comment -

          ahhh thanks for the info. i found a relevant link discussing the licenses which clearly explains the details here. oh well, it was worth a try

          Show
          Andrew Janowczyk added a comment - ahhh thanks for the info. i found a relevant link discussing the licenses which clearly explains the details here . oh well, it was worth a try
          Hide
          Joern Kottmann added a comment -

          @Lance Norskog we now have support in OpenNLP to train the name finder on a corpus in the Brat [1] data format, that makes it much easier to annotate custom data within a couple of days/weeks.

          [1] http://brat.nlplab.org/

          Show
          Joern Kottmann added a comment - @ Lance Norskog we now have support in OpenNLP to train the name finder on a corpus in the Brat [1] data format, that makes it much easier to annotate custom data within a couple of days/weeks. [1] http://brat.nlplab.org/
          Hide
          Lance Norskog added a comment -

          Wow! Brat looks bitchin! Looking forward to using it.

          Show
          Lance Norskog added a comment - Wow! Brat looks bitchin! Looking forward to using it.
          Hide
          rashi gandhi added a comment -

          Hi,
          I have applied this patch successfully on SOLR latest branch 4.x. But now I am not getting how to perform contextual searches on the data I have. I need to perform search on text field using some NLP process. I am new to NLP so need some help on how do I proceed further. How to train model using this integrated solr ? Do I need to study some thing else before moving ahead with this ?

          I designed a analyzer and tried indexing data. But the results are weird and inconsistent. Kindly provide some pointers to move ahead

          Thanks in advance.

          Show
          rashi gandhi added a comment - Hi, I have applied this patch successfully on SOLR latest branch 4.x. But now I am not getting how to perform contextual searches on the data I have. I need to perform search on text field using some NLP process. I am new to NLP so need some help on how do I proceed further. How to train model using this integrated solr ? Do I need to study some thing else before moving ahead with this ? I designed a analyzer and tried indexing data. But the results are weird and inconsistent. Kindly provide some pointers to move ahead Thanks in advance.
          Hide
          rashi gandhi added a comment -

          Hi,

          I designed an analyzer using OpenNLP filters and indexed some data on it.

          <fieldType name="text_opennlp_nvf" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
          <analyzer type="query">
          <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="opennlp/en-sent.bin" tokenizerModel="opennlp/en-token.bin"/>
          <filter class="solr.OpenNLPFilterFactory" posTaggerModel="opennlp/en-pos-maxent.bin"/>
          <filter class="solr.FilterPayloadsFilterFactory" payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW" keepPayloads="true"/>
          <filter class="solr.StripPayloadsFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
          </analyzer>
          </fieldType>

          <field name="Detail_Nvf" type="text_opennlp_nvf" indexed="true" stored="true" omitNorms="true" omitPositions="true"/>

          My problem is:While searching, SOLR sometimes return result and sometimes not ( but documents are there).
          for example: if i search for Detail_Nvf:brett ,it returns a document
          and after sometime again if i fire the same query, it returns Zero document
          Iam not getting why SOLR results are unstable.
          Please help me on this.

          Thanks in Advance

          Show
          rashi gandhi added a comment - Hi, I designed an analyzer using OpenNLP filters and indexed some data on it. <fieldType name="text_opennlp_nvf" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="opennlp/en-sent.bin" tokenizerModel="opennlp/en-token.bin"/> <filter class="solr.OpenNLPFilterFactory" posTaggerModel="opennlp/en-pos-maxent.bin"/> <filter class="solr.FilterPayloadsFilterFactory" payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW" keepPayloads="true"/> <filter class="solr.StripPayloadsFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> </analyzer> </fieldType> <field name="Detail_Nvf" type="text_opennlp_nvf" indexed="true" stored="true" omitNorms="true" omitPositions="true"/> My problem is:While searching, SOLR sometimes return result and sometimes not ( but documents are there). for example: if i search for Detail_Nvf:brett ,it returns a document and after sometime again if i fire the same query, it returns Zero document Iam not getting why SOLR results are unstable. Please help me on this. Thanks in Advance
          Hide
          Zack Zullick added a comment -

          I have seen this behavior before (look up to previous comments, especially from user Em and his previous fix) and I am experiencing similar results with the latest patch uploaded (Jun-16-2013) on 4.4/branch_4x. In my case, the OpenNLP system is only working when indexing the first document then no longer working thereafter. It seems you are having a similar issue, with the exception that yours is happening on the query end rather than the indexing. I sent out an email to Lance to see if he has any advice for us.

          Show
          Zack Zullick added a comment - I have seen this behavior before (look up to previous comments, especially from user Em and his previous fix) and I am experiencing similar results with the latest patch uploaded (Jun-16-2013) on 4.4/branch_4x. In my case, the OpenNLP system is only working when indexing the first document then no longer working thereafter. It seems you are having a similar issue, with the exception that yours is happening on the query end rather than the indexing. I sent out an email to Lance to see if he has any advice for us.
          Hide
          rashi gandhi added a comment -

          Thanks Zack

          Waiting for a reply from Lance

          Show
          rashi gandhi added a comment - Thanks Zack Waiting for a reply from Lance
          Hide
          simon raphael added a comment -

          Hi,

          I'm new to Solr and Opennlp.
          I have followed the tutorial to install this patch. I have downloaded the branch_4x, then i download and apply the LUCENE-2899-current.patch. Then i do "ant compile".

          Everything works fine, but no opennlp folder in /solr/contrib/ is created.

          What I am doing wrong?

          Thanks for your help

          Show
          simon raphael added a comment - Hi, I'm new to Solr and Opennlp. I have followed the tutorial to install this patch. I have downloaded the branch_4x, then i download and apply the LUCENE-2899 -current.patch. Then i do "ant compile". Everything works fine, but no opennlp folder in /solr/contrib/ is created. What I am doing wrong? Thanks for your help
          Hide
          Lance Norskog added a comment -

          Hi-

          The latest patch is LUCENE-2899-x.patch, pls try that. Also, apply it with:
          patch -p0 < patchfile

          Lance

          Show
          Lance Norskog added a comment - Hi- The latest patch is LUCENE-2899 -x.patch, pls try that. Also, apply it with: patch -p0 < patchfile Lance
          Hide
          Lance Norskog added a comment - - edited

          This patch includes a fix for the problem where searching twice doesn't work. The file is LUCENE-2899.patch
          It has been tested with trunk, branch_4x and the 4.5.1 release.

          I do not know of any outstanding issues. To avoid confusion, I have removed all old patches.

          Show
          Lance Norskog added a comment - - edited This patch includes a fix for the problem where searching twice doesn't work. The file is LUCENE-2899 .patch It has been tested with trunk, branch_4x and the 4.5.1 release. I do not know of any outstanding issues. To avoid confusion, I have removed all old patches.
          Hide
          simon raphael added a comment -

          Hi,

          I have a problem after installing the patch. I can't launch Solr anymore. I've got the following error :

          Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.OpenNLPTokenizerFactory'

          Though the opennlp*.jar files are correctly added :

          Adding 'file:/var/www/lucene_solr_4_5_1/solr/contrib/opennlp/lib/opennlp-tools-1.5.3.jar' to classloader
          5453 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrResourceLoader – Adding 'file:/var/www/lucene_solr_4_5_1/solr/contrib/opennlp/lib/opennlp-maxent-3.0.3.jar' to classloader

          Any idea of what I am doing wrong ?

          Thank you

          Show
          simon raphael added a comment - Hi, I have a problem after installing the patch. I can't launch Solr anymore. I've got the following error : Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.OpenNLPTokenizerFactory' Though the opennlp*.jar files are correctly added : Adding 'file:/var/www/lucene_solr_4_5_1/solr/contrib/opennlp/lib/opennlp-tools-1.5.3.jar' to classloader 5453 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrResourceLoader – Adding 'file:/var/www/lucene_solr_4_5_1/solr/contrib/opennlp/lib/opennlp-maxent-3.0.3.jar' to classloader Any idea of what I am doing wrong ? Thank you
          Hide
          Lance Norskog added a comment -

          The solrconfig.xml file should have these lines in the library set:

          <lib dir="../../../contrib/opennlp/lib" regex=".*\.jar" />
          <lib dir="../../../dist/" regex="solr-opennlp-\d.*\.jar" />

          Also, you have to copy lucene/build/analysis/opennlp/lucene-analyzers-opennlp*.jar to {{solr/contrib/opennlp/lib/} .

          This last problem was a mess. I have not followed these issues: SOLR-3664, LUCENE-5249, LUCENE-5257. I don't know if they handle the problem I described. Shipping this thing as a Lucene/Solr contrib module patch was a mistake- it intersects the build&code structure in too many places.

          Show
          Lance Norskog added a comment - The solrconfig.xml file should have these lines in the library set: <lib dir="../../../contrib/opennlp/lib" regex=".*\.jar" /> <lib dir="../../../dist/" regex="solr-opennlp-\d.*\.jar" /> Also, you have to copy lucene/build/analysis/opennlp/lucene-analyzers-opennlp*.jar to {{solr/contrib/opennlp/lib/} . This last problem was a mess. I have not followed these issues: SOLR-3664 , LUCENE-5249 , LUCENE-5257 . I don't know if they handle the problem I described. Shipping this thing as a Lucene/Solr contrib module patch was a mistake- it intersects the build&code structure in too many places.
          Hide
          Markus Jelsma added a comment -

          Hi - any change this is going to get committed some day?

          Show
          Markus Jelsma added a comment - Hi - any change this is going to get committed some day?
          Hide
          Robert Muir added a comment -

          Hi Markus: I haven't looked at this patch. I'll review it now and give my thoughts.

          Show
          Robert Muir added a comment - Hi Markus: I haven't looked at this patch. I'll review it now and give my thoughts.
          Hide
          Robert Muir added a comment -

          Just some thoughts:

          I think it would be best to split out the different functionality here into subtasks for each piece, and figure out how each should best be integrated.

          The current patch does strange things to try to deal with some impedence mismatch due to the design here, such as the tokenfilter which consumes the entire analysis chain and then replays the whole thing back with POS or NER as payloads. Is it really necessary to give this thing more scope than a single setnence? typically such tagging models (at least the ones ive worked with) tend to be trained only within sentence scope.

          Also payloads should not be used internally, instead things like TypeAttribute should be used for POSTags, if someone wants to filter out certain POS or maintain certain POS they can use already existing stuff like TypeTokenFilter, if they want to index Type as a payload, they can use TypeAsPayloadTokenFilter, and so on.

          While I can see this POS-tagging being useful inside the analysis chain: the NER case is much less clear, I think its more important to e.g. be integrated outside of the analysis chain so that named entities/mentions can be faceted on, added to separate fields for search (likely with a different analysis chain for that), etc. So for lucene that would be an easier way to add these as facets, for solr it probably makes more sense as UpdateProcessor than as analysis chain.

          Finally: I'm confused as to what benefit we get from using OpenNLP directly, versus integrating with it via opennlp-uima? Our UIMA integration at various levels (analysis chain/update processor) is already there, so I'm just wondering if thats a much shorter way path.

          Show
          Robert Muir added a comment - Just some thoughts: I think it would be best to split out the different functionality here into subtasks for each piece, and figure out how each should best be integrated. The current patch does strange things to try to deal with some impedence mismatch due to the design here, such as the tokenfilter which consumes the entire analysis chain and then replays the whole thing back with POS or NER as payloads. Is it really necessary to give this thing more scope than a single setnence? typically such tagging models (at least the ones ive worked with) tend to be trained only within sentence scope. Also payloads should not be used internally, instead things like TypeAttribute should be used for POSTags, if someone wants to filter out certain POS or maintain certain POS they can use already existing stuff like TypeTokenFilter, if they want to index Type as a payload, they can use TypeAsPayloadTokenFilter, and so on. While I can see this POS-tagging being useful inside the analysis chain: the NER case is much less clear, I think its more important to e.g. be integrated outside of the analysis chain so that named entities/mentions can be faceted on, added to separate fields for search (likely with a different analysis chain for that), etc. So for lucene that would be an easier way to add these as facets, for solr it probably makes more sense as UpdateProcessor than as analysis chain. Finally: I'm confused as to what benefit we get from using OpenNLP directly, versus integrating with it via opennlp-uima? Our UIMA integration at various levels (analysis chain/update processor) is already there, so I'm just wondering if thats a much shorter way path.
          Hide
          Benson Margulies added a comment -

          I know of an NER model that looks at the entire text to bias towards consistent tagging of entities in larger units. However, I agree that crocks are bad. Perhaps this is an opportunity to think about how to expand the analysis protocol to support this sort of thing more smoothly?

          It would be desirable if this integration were to start with a set of Token Attributes that could be used in any number of analysis components, inside or outside of Lucene, that were in a position to deliver similar items. I suppose I'm late to ask for this, as the UIMA component must pose the same question.

          In some languages, NER is very clumsy as a token filter, because entities don't obey token boundaries very well. Also, in my experience, entities aren't useful as additional tokens in the same field as their source text, but rather in their own field (where they can be facetted upon, for example). Is there any appetite to look at Lucene support for a stream that delivers to more than one field? Or is there such a thing and I've missed it?

          I agree with Rob about UIMA because I think that Lucene analysis attributes are a weak data model for interconnecting NLP modules and flowing data through them – and one frequently needs to do that.

          Show
          Benson Margulies added a comment - I know of an NER model that looks at the entire text to bias towards consistent tagging of entities in larger units. However, I agree that crocks are bad. Perhaps this is an opportunity to think about how to expand the analysis protocol to support this sort of thing more smoothly? It would be desirable if this integration were to start with a set of Token Attributes that could be used in any number of analysis components, inside or outside of Lucene, that were in a position to deliver similar items. I suppose I'm late to ask for this, as the UIMA component must pose the same question. In some languages, NER is very clumsy as a token filter, because entities don't obey token boundaries very well. Also, in my experience, entities aren't useful as additional tokens in the same field as their source text, but rather in their own field (where they can be facetted upon, for example). Is there any appetite to look at Lucene support for a stream that delivers to more than one field? Or is there such a thing and I've missed it? I agree with Rob about UIMA because I think that Lucene analysis attributes are a weak data model for interconnecting NLP modules and flowing data through them – and one frequently needs to do that.
          Hide
          Robert Muir added a comment -

          I don't think we should expand the analysis protocol: I think its actually already more complicated than it needs to be.

          It doesnt need to work across multiple fields or support things like NER.

          I know people disagree, but i dont care (typically they dont do a lot of work to maintain this code).

          I'll fight it to the death: Lucene's analysis is about doing information retrieval (search and query), and its already overly complex. It should stay per-field, it should stay like a state machine it is.

          Stuff like this NER should NOT be in the analysis chain. as i said, its more useful in the "document build" phase anyway.

          Show
          Robert Muir added a comment - I don't think we should expand the analysis protocol: I think its actually already more complicated than it needs to be. It doesnt need to work across multiple fields or support things like NER. I know people disagree, but i dont care (typically they dont do a lot of work to maintain this code). I'll fight it to the death: Lucene's analysis is about doing information retrieval (search and query), and its already overly complex. It should stay per-field, it should stay like a state machine it is. Stuff like this NER should NOT be in the analysis chain. as i said, its more useful in the "document build" phase anyway.
          Hide
          Benson Margulies added a comment -

          Fair enough. Solr URP's do this very well upstream of analysis. ES doesn't have the concept, perhaps it should. It clarifies the situation nicely to think of Lucene as serial token operations.

          Show
          Benson Margulies added a comment - Fair enough. Solr URP's do this very well upstream of analysis. ES doesn't have the concept, perhaps it should. It clarifies the situation nicely to think of Lucene as serial token operations.
          Hide
          Christian Moen added a comment -

          Stuff like this NER should NOT be in the analysis chain. as i said, its more useful in the "document build" phase anyway.

          +1

          Benson, as far as I understand, ES doesn't have the concept by design.

          Show
          Christian Moen added a comment - Stuff like this NER should NOT be in the analysis chain. as i said, its more useful in the "document build" phase anyway. +1 Benson, as far as I understand, ES doesn't have the concept by design.
          Hide
          Joern Kottmann added a comment -

          UIMA based NLP pipelines can use components like Solrcas or Lucas to write their results to an index. This works really well in my experience.

          Show
          Joern Kottmann added a comment - UIMA based NLP pipelines can use components like Solrcas or Lucas to write their results to an index. This works really well in my experience.
          Hide
          rashi gandhi added a comment -

          Hi,

          I have successfully applied LUCENE-2899.patch to SOLR-4.5.1 and its working properly.
          Now , my requirement is to combine OpenNLP with jwnl.
          Is it possible to combine OpenNLP with jwnl and what are the changes required in SOLR schema.xml for the same?
          Kindly provide some pointers to move ahead.

          Thanks in Advance

          Show
          rashi gandhi added a comment - Hi, I have successfully applied LUCENE-2899 .patch to SOLR-4 .5.1 and its working properly. Now , my requirement is to combine OpenNLP with jwnl. Is it possible to combine OpenNLP with jwnl and what are the changes required in SOLR schema.xml for the same? Kindly provide some pointers to move ahead. Thanks in Advance
          Hide
          Lance Norskog added a comment - - edited

          All fair criticisms.

          About UIMA: clearly it is much more advanced than this design, but I'm not smart enough to use it . I've tried to put together something useful (a few times) and each time was completely confused. I learn by example, and the examples are limited. Also there is very little traffic on the mailing lists etc. about UIMA.

          About payloads v.s. internal attributes: the examples don't use this feature, but payloads are stored in the index. This supports a question-answering system. Add PERSON payloads with all records, then search for "word X AND 'payload PERSON anywhere'" when someone says "who is X". This does the tagging during indexing, but not searching. A better design would be to add PERSON as a synonym rather than a payload. I also don't see much traffic about payloads.

          About doing this in the analysis pipeline v.s. upstream: yes, upstream request processors are the right place for this. In Solr. URPs don't exist in ES or just plain Lucene coding.

          Show
          Lance Norskog added a comment - - edited All fair criticisms. About UIMA: clearly it is much more advanced than this design, but I'm not smart enough to use it . I've tried to put together something useful (a few times) and each time was completely confused. I learn by example, and the examples are limited. Also there is very little traffic on the mailing lists etc. about UIMA. About payloads v.s. internal attributes: the examples don't use this feature, but payloads are stored in the index. This supports a question-answering system. Add PERSON payloads with all records, then search for "word X AND 'payload PERSON anywhere'" when someone says "who is X". This does the tagging during indexing, but not searching. A better design would be to add PERSON as a synonym rather than a payload. I also don't see much traffic about payloads. About doing this in the analysis pipeline v.s. upstream: yes, upstream request processors are the right place for this. In Solr. URPs don't exist in ES or just plain Lucene coding.
          Hide
          Lance Norskog added a comment -

          JWNL is WordNet. Lucene has a WordNet parser for use as a synonym filter.
          http://lucene.apache.org/core/4_0_0/analyzers-common/index.html?org/apache/lucene/analysis/synonym/SynonymMap.html

          I don't know how to use this from a Solr filter factory. Please ask this on the Solr mailing list.

          Show
          Lance Norskog added a comment - JWNL is WordNet. Lucene has a WordNet parser for use as a synonym filter. http://lucene.apache.org/core/4_0_0/analyzers-common/index.html?org/apache/lucene/analysis/synonym/SynonymMap.html I don't know how to use this from a Solr filter factory. Please ask this on the Solr mailing list.
          Hide
          rashi gandhi added a comment - - edited

          ok, thanks Lance. One more Question
          I wanted to design an analyzer that can support location containment relationship ,
          For example Europe->France->Paris
          My requirement is like: when user search for any country , then results must have the documents having that country , as well as the documents having states and cities which comes under that country.
          But , documents with country name must have high relevancy.
          It must obeys containment relationship up to 4 levels .i.e. Continent->Country->State->City
          I wanted to know , is there any way in OpenNLP that can support this type of search.
          Can location tagger model can be used for this?
          Please provide me some pointers to move ahead

          Thanks in Advance

          Show
          rashi gandhi added a comment - - edited ok, thanks Lance. One more Question I wanted to design an analyzer that can support location containment relationship , For example Europe->France->Paris My requirement is like: when user search for any country , then results must have the documents having that country , as well as the documents having states and cities which comes under that country. But , documents with country name must have high relevancy. It must obeys containment relationship up to 4 levels .i.e. Continent->Country->State->City I wanted to know , is there any way in OpenNLP that can support this type of search. Can location tagger model can be used for this? Please provide me some pointers to move ahead Thanks in Advance
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          rashi gandhi added a comment -

          Hi,

          I have one running solr core with some data indexed on solr deployed on Tomcat.
          This core is designed to provide OpenNLP functionalities for indexing and searching.
          So I have kept following binary models at this location: \apache-tomcat-7.0.53\solr\collection1\conf\opennlp
          • en-sent.bin
          • en-token.bin
          • en-pos-maxent.bin
          • en-ner-person.bin
          • en-ner-location.bin

          My Problem is: When I unload the running core, and try to delete conf directory from it.
          It is not allowing me to delete directory with prompt that en-sent.bin and en-token.bin is in use.
          All other files in conf directory are getting deleted except en-sent.bin and en-token.bin.
          If I have unloaded core, then why it is not unlocking the connection with core?
          Is this a known issue with OpenNLP Binaries?
          How can I release the connection between unloaded core and conf directory. (Specially binary models)

          Please provide me some pointers on this.
          Thanks in Advance

          Show
          rashi gandhi added a comment - Hi, I have one running solr core with some data indexed on solr deployed on Tomcat. This core is designed to provide OpenNLP functionalities for indexing and searching. So I have kept following binary models at this location: \apache-tomcat-7.0.53\solr\collection1\conf\opennlp • en-sent.bin • en-token.bin • en-pos-maxent.bin • en-ner-person.bin • en-ner-location.bin My Problem is: When I unload the running core, and try to delete conf directory from it. It is not allowing me to delete directory with prompt that en-sent.bin and en-token.bin is in use. All other files in conf directory are getting deleted except en-sent.bin and en-token.bin. If I have unloaded core, then why it is not unlocking the connection with core? Is this a known issue with OpenNLP Binaries? How can I release the connection between unloaded core and conf directory. (Specially binary models) Please provide me some pointers on this. Thanks in Advance
          Hide
          vivek added a comment -

          I followed this link to integrate https://wiki.apache.org/solr/OpenNLP to integrate

          Installation

          For English language testing: Until LUCENE-2899 is committed:

          1.pull the latest trunk or 4.0 branch

          2.apply the latest LUCENE-2899 patch
          3.do 'ant compile'
          cd solr/contrib/opennlp/src/test-files/training
          .
          .
          .
          i followed first two steps but got the following error while executing 3rd point

          common.compile-core:
          [javac] Compiling 10 source files to /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/build/analysis/opennlp/classes/java

          [javac] warning: [path] bad path element "/home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/lib/jwnl-1.3.3.jar": no such file or directory

          [javac] /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/FilterPayloadsFilter.java:43: error: cannot find symbol

          [javac] super(Version.LUCENE_44, input);

          [javac] ^
          [javac] symbol: variable LUCENE_44
          [javac] location: class Version
          [javac] /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/OpenNLPTokenizer.java:56: error: no suitable constructor found for Tokenizer(Reader)
          [javac] super(input);
          [javac] ^
          [javac] constructor Tokenizer.Tokenizer(AttributeFactory) is not applicable
          [javac] (actual argument Reader cannot be converted to AttributeFactory by method invocation conversion)
          [javac] constructor Tokenizer.Tokenizer() is not applicable
          [javac] (actual and formal argument lists differ in length)
          [javac] 2 errors
          [javac] 1 warning

          Im really stuck how to passthough this step. I wasted my entire day to fix this but couldn't move a bit. Please someone help me..?

          Show
          vivek added a comment - I followed this link to integrate https://wiki.apache.org/solr/OpenNLP to integrate Installation For English language testing: Until LUCENE-2899 is committed: 1.pull the latest trunk or 4.0 branch 2.apply the latest LUCENE-2899 patch 3.do 'ant compile' cd solr/contrib/opennlp/src/test-files/training . . . i followed first two steps but got the following error while executing 3rd point common.compile-core: [javac] Compiling 10 source files to /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/build/analysis/opennlp/classes/java [javac] warning: [path] bad path element "/home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/lib/jwnl-1.3.3.jar": no such file or directory [javac] /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/FilterPayloadsFilter.java:43: error: cannot find symbol [javac] super(Version.LUCENE_44, input); [javac] ^ [javac] symbol: variable LUCENE_44 [javac] location: class Version [javac] /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/OpenNLPTokenizer.java:56: error: no suitable constructor found for Tokenizer(Reader) [javac] super(input); [javac] ^ [javac] constructor Tokenizer.Tokenizer(AttributeFactory) is not applicable [javac] (actual argument Reader cannot be converted to AttributeFactory by method invocation conversion) [javac] constructor Tokenizer.Tokenizer() is not applicable [javac] (actual and formal argument lists differ in length) [javac] 2 errors [javac] 1 warning Im really stuck how to passthough this step. I wasted my entire day to fix this but couldn't move a bit. Please someone help me..?

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              21 Vote for this issue
              Watchers:
              40 Start watching this issue

              Dates

              • Created:
                Updated:

                Development