Nutch
  1. Nutch
  2. NUTCH-961

Expose Tika's boilerpipe support

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

      1. BoilerpipeExtractorRepository.java
        3 kB
        Markus Jelsma
      2. NUTCH-961-1.3-3.patch
        2 kB
        Markus Jelsma
      3. NUTCH-961-1.3-tikaparser.patch
        2 kB
        Markus Jelsma
      4. NUTCH-961-1.3-tikaparser1.patch
        3 kB
        Gabriele Kahlout
      5. NUTCH-961-1.4-dombuilder-1.patch
        0.6 kB
        Markus Jelsma
      6. NUTCH-961-1.5-1.patch
        7 kB
        Markus Jelsma
      7. NUTCH-961-1.8-1.patch
        7 kB
        Markus Jelsma
      8. NUTCH-961-2.1-v1.patch
        7 kB
        Roland von Herget
      9. NUTCH-961-2.1-v2.patch
        7 kB
        Roland von Herget
      10. NUTCH-961v2.patch
        3 kB
        Gabriele Kahlout

        Issue Links

          Activity

          Hide
          Julien Nioche added a comment -

          Tika 0.8 has some issues with PDF parsing, it would be better to use the next release instead. This won't be done as part of the 1.3 release as this is a new functionality and not a bugfix

          Show
          Julien Nioche added a comment - Tika 0.8 has some issues with PDF parsing, it would be better to use the next release instead. This won't be done as part of the 1.3 release as this is a new functionality and not a bugfix
          Hide
          Markus Jelsma added a comment -

          Boilerpipe comes with several algorithms for stripping away the boilerplate content. Although the ArticleExtractor is recommended, it certainly fails for many types of pages. Pages such as news overviews with blocks and lists are much better extracted with the CanolaExtractor instead. This poses a problem, we cannot have just one single configuration directive telling the parser which extractor to use for a whole crawl.

          Some thoughts on how to deal with it:

          • use Boilerpipe's estimator to automatically determine which extractor to use
          • have a facility to override false positives returned by the estimator and hardcode which extractor to use for URL groups (not unlike the subcollection plugin)
          Show
          Markus Jelsma added a comment - Boilerpipe comes with several algorithms for stripping away the boilerplate content. Although the ArticleExtractor is recommended, it certainly fails for many types of pages. Pages such as news overviews with blocks and lists are much better extracted with the CanolaExtractor instead. This poses a problem, we cannot have just one single configuration directive telling the parser which extractor to use for a whole crawl. Some thoughts on how to deal with it: use Boilerpipe's estimator to automatically determine which extractor to use have a facility to override false positives returned by the estimator and hardcode which extractor to use for URL groups (not unlike the subcollection plugin)
          Hide
          Markus Jelsma added a comment -

          Here's a WIP for 1.3 adding a repository (or factory) and patching pars-tika. Use the following settings to enable:

          tika.use_boilerpipe=true
          tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

          Test with bin/nutch org.apache.nutch.parse.ParserChecker -dumpText <url>

          There is an issue with extracting anchors of outlinks from the source text. There may also be issues with the repository of which im currently unaware of.

          Show
          Markus Jelsma added a comment - Here's a WIP for 1.3 adding a repository (or factory) and patching pars-tika. Use the following settings to enable: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor Test with bin/nutch org.apache.nutch.parse.ParserChecker -dumpText <url> There is an issue with extracting anchors of outlinks from the source text. There may also be issues with the repository of which im currently unaware of.
          Hide
          Gabriele Kahlout added a comment -

          @Markus - BoilerpipeExtractorRepository.java == NUTCH-961-1.3-tikaparser.patch, content-wise.

          Show
          Gabriele Kahlout added a comment - @Markus - BoilerpipeExtractorRepository.java == NUTCH-961 -1.3-tikaparser.patch, content-wise.
          Hide
          Markus Jelsma added a comment -

          Here's the correct file.

          Show
          Markus Jelsma added a comment - Here's the correct file.
          Hide
          Gabriele Kahlout added a comment -

          @Markus - Thank you.

          Watch out for [1] in parse-plugins.xml. .html pages may indeed by xhtml. You can safely delete alla parse-html mimeType associations, as long as you have [2] (and you want to use parse-tika instead of parse-html ).

          [1]
          <mimeType name="application/xhtml+xml">
          <plugin id="parse-html" />
          </mimeType>

          [2]
          <!-- by default if the mimeType is set to *, or
          if it can't be determined, use parse-tika -->
          <mimeType name="*">
          <plugin id="parse-tika" />
          </mimeType>

          Show
          Gabriele Kahlout added a comment - @Markus - Thank you. Watch out for [1] in parse-plugins.xml. .html pages may indeed by xhtml. You can safely delete alla parse-html mimeType associations, as long as you have [2] (and you want to use parse-tika instead of parse-html ). [1] <mimeType name="application/xhtml+xml"> <plugin id="parse-html" /> </mimeType> [2] <!-- by default if the mimeType is set to *, or if it can't be determined, use parse-tika --> <mimeType name="*"> <plugin id="parse-tika" /> </mimeType>
          Hide
          Markus Jelsma added a comment -

          Not safely, there are still issues regarding HTML parsing with Tika, even without this nasty boilerpipe hack.

          Show
          Markus Jelsma added a comment - Not safely, there are still issues regarding HTML parsing with Tika, even without this nasty boilerpipe hack.
          Hide
          Gabriele Kahlout added a comment -

          yeah, I was looking for an issue i think was called to replace parse-html with parse-tika as the default but I found only NUTCH-869[1]. It have just been mentioned in the mailing list (by Julien) and I thought an issue was filed for it.

          [1] https://issues.apache.org/jira/browse/NUTCH-869

          Show
          Gabriele Kahlout added a comment - yeah, I was looking for an issue i think was called to replace parse-html with parse-tika as the default but I found only NUTCH-869 [1] . It have just been mentioned in the mailing list (by Julien) and I thought an issue was filed for it. [1] https://issues.apache.org/jira/browse/NUTCH-869
          Hide
          Gabriele Kahlout added a comment -

          Same as NUTCH-961-1.3-tikaparser.patch by Markus but adds necessary configuration to nutch-default.xml (Unable to render embedded object: File (nutch-site.xml) not found.) as discussed on the mailing list or privately time ago.

          Show
          Gabriele Kahlout added a comment - Same as NUTCH-961 -1.3-tikaparser.patch by Markus but adds necessary configuration to nutch-default.xml ( Unable to render embedded object: File (nutch-site.xml) not found. ) as discussed on the mailing list or privately time ago.
          Hide
          Gabriele Kahlout added a comment -

          Modified to include necessary changes to parse-plugins.xml also.

          Show
          Gabriele Kahlout added a comment - Modified to include necessary changes to parse-plugins.xml also.
          Hide
          Gabriele Kahlout added a comment -

          Tested the patch against a checkout of 1.3 branch at revision 1101540, and made some trivial changes to TikaParser code.
          More interestingly I've also removed the following from parse-plugins.xml:

          • <mimeType name="application/xhtml+xml">
          • <plugin id="parse-html" />
          • </mimeType>
            -
          Show
          Gabriele Kahlout added a comment - Tested the patch against a checkout of 1.3 branch at revision 1101540, and made some trivial changes to TikaParser code. More interestingly I've also removed the following from parse-plugins.xml: <mimeType name="application/xhtml+xml"> <plugin id="parse-html" /> </mimeType> -
          Hide
          Gabriele Kahlout added a comment -

          cleaned up patch.
          To reproduce:

          export NUTCH_HOME=`pwd`"/nutch"; svn co -r 1101540 http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 $NUTCH_HOME
          cp $MR_HOME/BoilerpipeExtractorRepository.java $NUTCH_HOME/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
          cd $NUTCH_HOME; patch -p0 -ui $MR_HOME/NUTCH-961v2.patch
          ant
          
          Show
          Gabriele Kahlout added a comment - cleaned up patch. To reproduce: export NUTCH_HOME=`pwd` "/nutch" ; svn co -r 1101540 http: //svn.apache.org/repos/asf/nutch/branches/branch-1.3 $NUTCH_HOME cp $MR_HOME/BoilerpipeExtractorRepository.java $NUTCH_HOME/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java cd $NUTCH_HOME; patch -p0 -ui $MR_HOME/NUTCH-961v2.patch ant
          Hide
          Gabriele Kahlout added a comment -

          BTW, have you considered a more general patch to support (rather than expose) all of tika's options? I'm just thinking that perhaps no special Boilerpipe per-se support should (for the sake of code maintainability) be exposed at the Nutch level, but only an ability to pass parameters to tika. So at the nutch level one sets properties in nutch-site.xml (or even tika-site.xml) and those are forwarded to tika to the tika-delegating parser plugin.
          There should therefore be no need for any Boilerpipe testing for example, but rather tika integration testing.
          I'm just thinking out loud (w/o any patch).

          Show
          Gabriele Kahlout added a comment - BTW, have you considered a more general patch to support (rather than expose) all of tika's options? I'm just thinking that perhaps no special Boilerpipe per-se support should (for the sake of code maintainability) be exposed at the Nutch level, but only an ability to pass parameters to tika. So at the nutch level one sets properties in nutch-site.xml (or even tika-site.xml) and those are forwarded to tika to the tika-delegating parser plugin. There should therefore be no need for any Boilerpipe testing for example, but rather tika integration testing. I'm just thinking out loud (w/o any patch).
          Hide
          Markus Jelsma added a comment -

          This is not a general patch and won't be. It can, however, be a dependacy if for a broader Tika patch but i haven't seen other tickets as of yet.

          This patch cannot work by just passing parameters to Tika as it needs to use a different ContentHandler in parse-tika itself.

          Show
          Markus Jelsma added a comment - This is not a general patch and won't be. It can, however, be a dependacy if for a broader Tika patch but i haven't seen other tickets as of yet. This patch cannot work by just passing parameters to Tika as it needs to use a different ContentHandler in parse-tika itself.
          Hide
          Gabriele Kahlout added a comment -

          it needs to use a different ContentHandler in parse-tika itself.

          [Documentation opportunity] why?

          My intuition is that the default sax ContentHandler returns the full page and then Tika handles it, this time with the boilerpipe option.

          Show
          Gabriele Kahlout added a comment - it needs to use a different ContentHandler in parse-tika itself. [Documentation opportunity] why? My intuition is that the default sax ContentHandler returns the full page and then Tika handles it, this time with the boilerpipe option.
          Hide
          Ken Krugler added a comment -

          The way that Boilerpipe in Tika works is that it acts as a delegate, processing the SAX events generated by the default content handler that knows how to help clean up broken HTML.

          So it's incremental processing (you don't need to get the full page first).

          Separate note: Tika's Boilerpipe support now has an option to return HTML markup, so you could run it in this mode to get anchors/anchor text.

          Show
          Ken Krugler added a comment - The way that Boilerpipe in Tika works is that it acts as a delegate, processing the SAX events generated by the default content handler that knows how to help clean up broken HTML. So it's incremental processing (you don't need to get the full page first). Separate note: Tika's Boilerpipe support now has an option to return HTML markup, so you could run it in this mode to get anchors/anchor text.
          Hide
          Markus Jelsma added a comment -

          Ah, that's great! Is this in 0.9 or trunk? We still bind with 0.9. This may be useful because this patch doesn't add anchors to the detected outlinks. The last anchor(s) may contain the complete BP body! =D

          Show
          Markus Jelsma added a comment - Ah, that's great! Is this in 0.9 or trunk? We still bind with 0.9. This may be useful because this patch doesn't add anchors to the detected outlinks. The last anchor(s) may contain the complete BP body! =D
          Hide
          Markus Jelsma added a comment - - edited

          Patch to include mark up from Tika. Anchors are now detected but less outlinks are found! Anyone has a good suggestion on where to fetch our outlinks with the anchors from?

          Show
          Markus Jelsma added a comment - - edited Patch to include mark up from Tika. Anchors are now detected but less outlinks are found! Anyone has a good suggestion on where to fetch our outlinks with the anchors from?
          Hide
          Markus Jelsma added a comment -

          With BP enabled you can get an java.util.EmptyStackException from DOMBuilder. This is fixed in this patch by adding another check around the peek 'n pop methods.

          http://mail-archives.apache.org/mod_mbox/nutch-user/201107.mbox/%3C201107151523.18511.markus.jelsma@openindex.io%3E

          There is no answer yet to why this can occur yet i think checking before pop or peek is good anyway.

          Show
          Markus Jelsma added a comment - With BP enabled you can get an java.util.EmptyStackException from DOMBuilder. This is fixed in this patch by adding another check around the peek 'n pop methods. http://mail-archives.apache.org/mod_mbox/nutch-user/201107.mbox/%3C201107151523.18511.markus.jelsma@openindex.io%3E There is no answer yet to why this can occur yet i think checking before pop or peek is good anyway.
          Hide
          Markus Jelsma added a comment -

          It works in production but is still a big hack when dealing with outlinks. Mark as 1.5

          Show
          Markus Jelsma added a comment - It works in production but is still a big hack when dealing with outlinks. Mark as 1.5
          Hide
          Markus Jelsma added a comment -

          Here's a working patch we use in production. This includes a nasty work around in TikeParsers to collect all outlinks. Without it, only outlinks from the extracted text are collected.

          This is a bit nasty and i'd appreciate if anyone with a bit more experience with Tika can shed some light on this.

          Show
          Markus Jelsma added a comment - Here's a working patch we use in production. This includes a nasty work around in TikeParsers to collect all outlinks. Without it, only outlinks from the extracted text are collected. This is a bit nasty and i'd appreciate if anyone with a bit more experience with Tika can shed some light on this.
          Hide
          Markus Jelsma added a comment -

          Fixed already. See NUTCH-1233 for a patch!

          Show
          Markus Jelsma added a comment - Fixed already. See NUTCH-1233 for a patch!
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          kiran added a comment -

          Markus, do you think this patch can also work for 2.x Series ? If not, is it easy to port to 2.x ? Please let me know your suggestions.

          Show
          kiran added a comment - Markus, do you think this patch can also work for 2.x Series ? If not, is it easy to port to 2.x ? Please let me know your suggestions.
          Hide
          Markus Jelsma added a comment -

          Should work fine, parse plugins have not changed that much. Keep in mind that you may need bp1.2.0 and keep an eye on link extraction. See related issues.

          Show
          Markus Jelsma added a comment - Should work fine, parse plugins have not changed that much. Keep in mind that you may need bp1.2.0 and keep an eye on link extraction. See related issues.
          Hide
          Roland von Herget added a comment -

          Kiran, did you already start porting it to 2.x?

          Show
          Roland von Herget added a comment - Kiran, did you already start porting it to 2.x?
          Hide
          kiran added a comment -

          No Roland, not yet. I just switched to using 1.x series, but i will give a try at porting this to 2.x this week

          Show
          kiran added a comment - No Roland, not yet. I just switched to using 1.x series, but i will give a try at porting this to 2.x this week
          Hide
          Roland von Herget added a comment -

          Status:

          • ported
          • compiles
          • yields same results as stock 2.1 if disabled (tika.use_boilerpipe=false)

          more tests needed

          Show
          Roland von Herget added a comment - Status: ported compiles yields same results as stock 2.1 if disabled (tika.use_boilerpipe=false) more tests needed
          Hide
          Roland von Herget added a comment -
          • now with working config options
          • cleanup (removed unused useBoilerpipeEstimator)
          Show
          Roland von Herget added a comment - now with working config options cleanup (removed unused useBoilerpipeEstimator)
          Hide
          Miles Rowland added a comment -

          Roland, thanks for porting to 2.1. I'm having an issue where nutch is only successfully parsing the first fetched url, and all other urls fail to parse with a warning "unable to successfully parse content [website] of type [x]". If I run parseChecker on that url the parse runs successfully using tika/boilerplate, so it seems to be an issue that only occurs when trying to run the second parse or more in a batch job.

          I'm running Nutch 2.1 with MySQL. The problem occurs with both bp1.1.0 and 1.2.0.

          Show
          Miles Rowland added a comment - Roland, thanks for porting to 2.1. I'm having an issue where nutch is only successfully parsing the first fetched url, and all other urls fail to parse with a warning "unable to successfully parse content [website] of type [x] ". If I run parseChecker on that url the parse runs successfully using tika/boilerplate, so it seems to be an issue that only occurs when trying to run the second parse or more in a batch job. I'm running Nutch 2.1 with MySQL. The problem occurs with both bp1.1.0 and 1.2.0.
          Hide
          Markus Jelsma added a comment -

          Updated patch for trunk. Estimator code has been removed. Parser still relies on reparsing without BP for it to obtain all outlinks. See NUTCH-1233!

          Show
          Markus Jelsma added a comment - Updated patch for trunk. Estimator code has been removed. Parser still relies on reparsing without BP for it to obtain all outlinks. See NUTCH-1233 !
          Hide
          Tien Nguyen Manh added a comment -

          I used patch NUTCH-961-2.1-v2.patch for nutch-2.2.1
          i found that the text parsed by nutch-tika (with boilerpipe support) is different from text parsed by demo site http://boilerpipe-web.appspot.com
          I did upgrade to boilerpipe 1.2.0 to be match with demo site.

          The url i tested is http://www.medhelp.org/posts/Eye-Care/EYE/show/1199003

          The text from nutch-tika (i use ArticleExtractor)

          EYE - Eye Care - MedHelp Experts My MedHelp Login or Signup Eye Care Community EYE Post a Question « Back to Community About This Community: This patient support community is for discussions relating to eye care, cataracts , glaucoma , retinal detachment , eye infections, misaligned eyes , intra-ocular implants, refractive surgery ( LASIK and CK), glasses, contact lenses, amblyopia , eye injuries, dry eyes , ocular allergy, eye pain and discomfort, pediatric eye disorders, eyelid and tearduct surgery, poor eyesight, and eye surgery. View community archives Font Size: A A ABackground: Search this Community: Go 3 Comments EYE My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ? Watch this discussion Tweet Related Discussions How to decide if glasses are needed for children? (8 replies):How can a Doctor tell if a child has amblyopia? Is t... [more] Astigmatism (1 replies):My 5 year old son has severe astigmatism. He wears glass... [more] Can someone help me in regards to my sons eyes? (6 replies):I had noticed my son had, had an eye issue when he was a... [more] Blurred vision with glasses (2 replies):Hi, I recently got new glasses and but the vision in my ... [more] Eyesight getting worse (2 replies):Hello! So here's the story. My eyesight had never been ... [more]

          AND from demo

          3 Comments
          EYE
          My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ?

          the result from demo is much better for this url.
          So the parse-tike/boilerpipe not only extract main content from page but also include title and other node content.
          Is it expected?

          Show
          Tien Nguyen Manh added a comment - I used patch NUTCH-961 -2.1-v2.patch for nutch-2.2.1 i found that the text parsed by nutch-tika (with boilerpipe support) is different from text parsed by demo site http://boilerpipe-web.appspot.com I did upgrade to boilerpipe 1.2.0 to be match with demo site. The url i tested is http://www.medhelp.org/posts/Eye-Care/EYE/show/1199003 The text from nutch-tika (i use ArticleExtractor) EYE - Eye Care - MedHelp Experts My MedHelp Login or Signup Eye Care Community EYE Post a Question « Back to Community About This Community: This patient support community is for discussions relating to eye care, cataracts , glaucoma , retinal detachment , eye infections, misaligned eyes , intra-ocular implants, refractive surgery ( LASIK and CK), glasses, contact lenses, amblyopia , eye injuries, dry eyes , ocular allergy, eye pain and discomfort, pediatric eye disorders, eyelid and tearduct surgery, poor eyesight, and eye surgery. View community archives Font Size: A A ABackground: Search this Community: Go 3 Comments EYE My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ? Watch this discussion Tweet Related Discussions How to decide if glasses are needed for children? (8 replies):How can a Doctor tell if a child has amblyopia? Is t... [more] Astigmatism (1 replies):My 5 year old son has severe astigmatism. He wears glass... [more] Can someone help me in regards to my sons eyes? (6 replies):I had noticed my son had, had an eye issue when he was a... [more] Blurred vision with glasses (2 replies):Hi, I recently got new glasses and but the vision in my ... [more] Eyesight getting worse (2 replies):Hello! So here's the story. My eyesight had never been ... [more] AND from demo 3 Comments EYE My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ? the result from demo is much better for this url. So the parse-tike/boilerpipe not only extract main content from page but also include title and other node content. Is it expected?
          Hide
          Otis Gospodnetic added a comment -

          Looks like Ken Krugler is offering to help with publishing Boilerpipe to a Sonatype Maven repo in TIKA-676 (this Nutch issue apparently depends on this Tika issue) - thanks Ken!

          But note that simply moving Nutch to Boilerpipe 1.2.0 won't fix the issue Tien Nguyen Manh just reported.
          Markus Jelsma, if Tien Nguyen Manh provides a patch that makes Nutch Boilerpipe output match that of the Boilerpipe demo, could you commit it to 2.x?

          Show
          Otis Gospodnetic added a comment - Looks like Ken Krugler is offering to help with publishing Boilerpipe to a Sonatype Maven repo in TIKA-676 (this Nutch issue apparently depends on this Tika issue) - thanks Ken! But note that simply moving Nutch to Boilerpipe 1.2.0 won't fix the issue Tien Nguyen Manh just reported. Markus Jelsma , if Tien Nguyen Manh provides a patch that makes Nutch Boilerpipe output match that of the Boilerpipe demo, could you commit it to 2.x?
          Hide
          Markus Jelsma added a comment -

          Hi Otis - there are no significant improvements between the 1.1.0 and 1.2.0 of Boilerpipe, at least not when it comes to better extraction. I am very sure that when the demo was using 1.2.0, we got identical results with 1.2.0 as well, but still poor in cases not suitable such as overviews, blocks etc. I am also very sure that the current 1.2.0 is nowadays different than what the demo returns, it is not identical anymore, and improved quite a lot.

          We don't use it BP anymore but i'm happy to commit whenever 1.2.0 is in maven or part of Tika if it gets donated to the ASF. We need to get NUTCH-1233 in as well then.

          Show
          Markus Jelsma added a comment - Hi Otis - there are no significant improvements between the 1.1.0 and 1.2.0 of Boilerpipe, at least not when it comes to better extraction. I am very sure that when the demo was using 1.2.0, we got identical results with 1.2.0 as well, but still poor in cases not suitable such as overviews, blocks etc. I am also very sure that the current 1.2.0 is nowadays different than what the demo returns, it is not identical anymore, and improved quite a lot. We don't use it BP anymore but i'm happy to commit whenever 1.2.0 is in maven or part of Tika if it gets donated to the ASF. We need to get NUTCH-1233 in as well then.
          Hide
          Otis Gospodnetic added a comment -

          We don't use it BP anymore

          What do you mean by that? I looked at parse-tika/plugins.xml earlier today and saw BP 1.1.0 there. So I'm not sure what you mean...

          Show
          Otis Gospodnetic added a comment - We don't use it BP anymore What do you mean by that? I looked at parse-tika/plugins.xml earlier today and saw BP 1.1.0 there. So I'm not sure what you mean...
          Hide
          Matzz added a comment -

          We don't use it BP anymore

          BP integration will be totally abandoned? Are there any plans to use other content extractor in favour of Boilerpipe?

          Show
          Matzz added a comment - We don't use it BP anymore BP integration will be totally abandoned? Are there any plans to use other content extractor in favour of Boilerpipe?
          Hide
          Markus Jelsma added a comment -

          I am sorry, i did not mean to speak for the Nutch PMC at all; we not using BP means I am not using BP. As i said before, i am happy to commit this issue is the linked issues are resolved first.

          Show
          Markus Jelsma added a comment - I am sorry, i did not mean to speak for the Nutch PMC at all; we not using BP means I am not using BP. As i said before, i am happy to commit this issue is the linked issues are resolved first.

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              7 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development