Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.12
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

      Use the following properties to enable and control Boilerpipe.

      <property>
        <name>tika.extractor</name>
        <value>none</value>
        <description>
        Which text extraction algorithm to use. Valid values are: boilerpipe or none.
        </description>
      </property>
       
      <property> 
        <name>tika.extractor.boilerpipe.algorithm</name>
        <value>ArticleExtractor</value>
        <description> 
        Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
        or CanolaExtractor.
        </description>
      </property>
      
      1. NUTCH-961.patch
        6 kB
        Markus Jelsma
      2. NUTCH-961.patch
        3 kB
        Markus Jelsma
      3. NUTCH-961-1.11-1.patch
        7 kB
        Vincent Slot
      4. nutch-2.x-boilerpipe.patch
        5 kB
        Alexander Kingson
      5. NUTCH-961-1.8-1.patch
        7 kB
        Markus Jelsma
      6. NUTCH-961-2.1-v2.patch
        7 kB
        Roland von Herget
      7. NUTCH-961-2.1-v1.patch
        7 kB
        Roland von Herget
      8. NUTCH-961-1.5-1.patch
        7 kB
        Markus Jelsma
      9. NUTCH-961-1.4-dombuilder-1.patch
        0.6 kB
        Markus Jelsma
      10. NUTCH-961-1.3-3.patch
        2 kB
        Markus Jelsma
      11. NUTCH-961v2.patch
        3 kB
        Gabriele Kahlout
      12. NUTCH-961-1.3-tikaparser1.patch
        3 kB
        Gabriele Kahlout
      13. BoilerpipeExtractorRepository.java
        3 kB
        Markus Jelsma
      14. NUTCH-961-1.3-tikaparser.patch
        2 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          Julien Nioche added a comment -

          Tika 0.8 has some issues with PDF parsing, it would be better to use the next release instead. This won't be done as part of the 1.3 release as this is a new functionality and not a bugfix

          Show
          Julien Nioche added a comment - Tika 0.8 has some issues with PDF parsing, it would be better to use the next release instead. This won't be done as part of the 1.3 release as this is a new functionality and not a bugfix
          Hide
          Markus Jelsma added a comment -

          Boilerpipe comes with several algorithms for stripping away the boilerplate content. Although the ArticleExtractor is recommended, it certainly fails for many types of pages. Pages such as news overviews with blocks and lists are much better extracted with the CanolaExtractor instead. This poses a problem, we cannot have just one single configuration directive telling the parser which extractor to use for a whole crawl.

          Some thoughts on how to deal with it:

          • use Boilerpipe's estimator to automatically determine which extractor to use
          • have a facility to override false positives returned by the estimator and hardcode which extractor to use for URL groups (not unlike the subcollection plugin)
          Show
          Markus Jelsma added a comment - Boilerpipe comes with several algorithms for stripping away the boilerplate content. Although the ArticleExtractor is recommended, it certainly fails for many types of pages. Pages such as news overviews with blocks and lists are much better extracted with the CanolaExtractor instead. This poses a problem, we cannot have just one single configuration directive telling the parser which extractor to use for a whole crawl. Some thoughts on how to deal with it: use Boilerpipe's estimator to automatically determine which extractor to use have a facility to override false positives returned by the estimator and hardcode which extractor to use for URL groups (not unlike the subcollection plugin)
          Hide
          Markus Jelsma added a comment -

          Here's a WIP for 1.3 adding a repository (or factory) and patching pars-tika. Use the following settings to enable:

          tika.use_boilerpipe=true
          tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

          Test with bin/nutch org.apache.nutch.parse.ParserChecker -dumpText <url>

          There is an issue with extracting anchors of outlinks from the source text. There may also be issues with the repository of which im currently unaware of.

          Show
          Markus Jelsma added a comment - Here's a WIP for 1.3 adding a repository (or factory) and patching pars-tika. Use the following settings to enable: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor Test with bin/nutch org.apache.nutch.parse.ParserChecker -dumpText <url> There is an issue with extracting anchors of outlinks from the source text. There may also be issues with the repository of which im currently unaware of.
          Hide
          Gabriele Kahlout added a comment -

          @Markus - BoilerpipeExtractorRepository.java == NUTCH-961-1.3-tikaparser.patch, content-wise.

          Show
          Gabriele Kahlout added a comment - @Markus - BoilerpipeExtractorRepository.java == NUTCH-961 -1.3-tikaparser.patch, content-wise.
          Hide
          Markus Jelsma added a comment -

          Here's the correct file.

          Show
          Markus Jelsma added a comment - Here's the correct file.
          Hide
          Gabriele Kahlout added a comment -

          @Markus - Thank you.

          Watch out for [1] in parse-plugins.xml. .html pages may indeed by xhtml. You can safely delete alla parse-html mimeType associations, as long as you have [2] (and you want to use parse-tika instead of parse-html ).

          [1]
          <mimeType name="application/xhtml+xml">
          <plugin id="parse-html" />
          </mimeType>

          [2]
          <!-- by default if the mimeType is set to *, or
          if it can't be determined, use parse-tika -->
          <mimeType name="*">
          <plugin id="parse-tika" />
          </mimeType>

          Show
          Gabriele Kahlout added a comment - @Markus - Thank you. Watch out for [1] in parse-plugins.xml. .html pages may indeed by xhtml. You can safely delete alla parse-html mimeType associations, as long as you have [2] (and you want to use parse-tika instead of parse-html ). [1] <mimeType name="application/xhtml+xml"> <plugin id="parse-html" /> </mimeType> [2] <!-- by default if the mimeType is set to *, or if it can't be determined, use parse-tika --> <mimeType name="*"> <plugin id="parse-tika" /> </mimeType>
          Hide
          Markus Jelsma added a comment -

          Not safely, there are still issues regarding HTML parsing with Tika, even without this nasty boilerpipe hack.

          Show
          Markus Jelsma added a comment - Not safely, there are still issues regarding HTML parsing with Tika, even without this nasty boilerpipe hack.
          Hide
          Gabriele Kahlout added a comment -

          yeah, I was looking for an issue i think was called to replace parse-html with parse-tika as the default but I found only NUTCH-869[1]. It have just been mentioned in the mailing list (by Julien) and I thought an issue was filed for it.

          [1] https://issues.apache.org/jira/browse/NUTCH-869

          Show
          Gabriele Kahlout added a comment - yeah, I was looking for an issue i think was called to replace parse-html with parse-tika as the default but I found only NUTCH-869 [1] . It have just been mentioned in the mailing list (by Julien) and I thought an issue was filed for it. [1] https://issues.apache.org/jira/browse/NUTCH-869
          Hide
          Gabriele Kahlout added a comment -

          Same as NUTCH-961-1.3-tikaparser.patch by Markus but adds necessary configuration to nutch-default.xml (Unable to render embedded object: File (nutch-site.xml) not found.) as discussed on the mailing list or privately time ago.

          Show
          Gabriele Kahlout added a comment - Same as NUTCH-961 -1.3-tikaparser.patch by Markus but adds necessary configuration to nutch-default.xml ( Unable to render embedded object: File (nutch-site.xml) not found. ) as discussed on the mailing list or privately time ago.
          Hide
          Gabriele Kahlout added a comment -

          Modified to include necessary changes to parse-plugins.xml also.

          Show
          Gabriele Kahlout added a comment - Modified to include necessary changes to parse-plugins.xml also.
          Hide
          Gabriele Kahlout added a comment -

          Tested the patch against a checkout of 1.3 branch at revision 1101540, and made some trivial changes to TikaParser code.
          More interestingly I've also removed the following from parse-plugins.xml:

          • <mimeType name="application/xhtml+xml">
          • <plugin id="parse-html" />
          • </mimeType>
            -
          Show
          Gabriele Kahlout added a comment - Tested the patch against a checkout of 1.3 branch at revision 1101540, and made some trivial changes to TikaParser code. More interestingly I've also removed the following from parse-plugins.xml: <mimeType name="application/xhtml+xml"> <plugin id="parse-html" /> </mimeType> -
          Hide
          Gabriele Kahlout added a comment -

          cleaned up patch.
          To reproduce:

          export NUTCH_HOME=`pwd`"/nutch"; svn co -r 1101540 http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 $NUTCH_HOME
          cp $MR_HOME/BoilerpipeExtractorRepository.java $NUTCH_HOME/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
          cd $NUTCH_HOME; patch -p0 -ui $MR_HOME/NUTCH-961v2.patch
          ant
          
          Show
          Gabriele Kahlout added a comment - cleaned up patch. To reproduce: export NUTCH_HOME=`pwd` "/nutch" ; svn co -r 1101540 http: //svn.apache.org/repos/asf/nutch/branches/branch-1.3 $NUTCH_HOME cp $MR_HOME/BoilerpipeExtractorRepository.java $NUTCH_HOME/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java cd $NUTCH_HOME; patch -p0 -ui $MR_HOME/NUTCH-961v2.patch ant
          Hide
          Gabriele Kahlout added a comment -

          BTW, have you considered a more general patch to support (rather than expose) all of tika's options? I'm just thinking that perhaps no special Boilerpipe per-se support should (for the sake of code maintainability) be exposed at the Nutch level, but only an ability to pass parameters to tika. So at the nutch level one sets properties in nutch-site.xml (or even tika-site.xml) and those are forwarded to tika to the tika-delegating parser plugin.
          There should therefore be no need for any Boilerpipe testing for example, but rather tika integration testing.
          I'm just thinking out loud (w/o any patch).

          Show
          Gabriele Kahlout added a comment - BTW, have you considered a more general patch to support (rather than expose) all of tika's options? I'm just thinking that perhaps no special Boilerpipe per-se support should (for the sake of code maintainability) be exposed at the Nutch level, but only an ability to pass parameters to tika. So at the nutch level one sets properties in nutch-site.xml (or even tika-site.xml) and those are forwarded to tika to the tika-delegating parser plugin. There should therefore be no need for any Boilerpipe testing for example, but rather tika integration testing. I'm just thinking out loud (w/o any patch).
          Hide
          Markus Jelsma added a comment -

          This is not a general patch and won't be. It can, however, be a dependacy if for a broader Tika patch but i haven't seen other tickets as of yet.

          This patch cannot work by just passing parameters to Tika as it needs to use a different ContentHandler in parse-tika itself.

          Show
          Markus Jelsma added a comment - This is not a general patch and won't be. It can, however, be a dependacy if for a broader Tika patch but i haven't seen other tickets as of yet. This patch cannot work by just passing parameters to Tika as it needs to use a different ContentHandler in parse-tika itself.
          Hide
          Gabriele Kahlout added a comment -

          it needs to use a different ContentHandler in parse-tika itself.

          [Documentation opportunity] why?

          My intuition is that the default sax ContentHandler returns the full page and then Tika handles it, this time with the boilerpipe option.

          Show
          Gabriele Kahlout added a comment - it needs to use a different ContentHandler in parse-tika itself. [Documentation opportunity] why? My intuition is that the default sax ContentHandler returns the full page and then Tika handles it, this time with the boilerpipe option.
          Hide
          Ken Krugler added a comment -

          The way that Boilerpipe in Tika works is that it acts as a delegate, processing the SAX events generated by the default content handler that knows how to help clean up broken HTML.

          So it's incremental processing (you don't need to get the full page first).

          Separate note: Tika's Boilerpipe support now has an option to return HTML markup, so you could run it in this mode to get anchors/anchor text.

          Show
          Ken Krugler added a comment - The way that Boilerpipe in Tika works is that it acts as a delegate, processing the SAX events generated by the default content handler that knows how to help clean up broken HTML. So it's incremental processing (you don't need to get the full page first). Separate note: Tika's Boilerpipe support now has an option to return HTML markup, so you could run it in this mode to get anchors/anchor text.
          Hide
          Markus Jelsma added a comment -

          Ah, that's great! Is this in 0.9 or trunk? We still bind with 0.9. This may be useful because this patch doesn't add anchors to the detected outlinks. The last anchor(s) may contain the complete BP body! =D

          Show
          Markus Jelsma added a comment - Ah, that's great! Is this in 0.9 or trunk? We still bind with 0.9. This may be useful because this patch doesn't add anchors to the detected outlinks. The last anchor(s) may contain the complete BP body! =D
          Hide
          Markus Jelsma added a comment - - edited

          Patch to include mark up from Tika. Anchors are now detected but less outlinks are found! Anyone has a good suggestion on where to fetch our outlinks with the anchors from?

          Show
          Markus Jelsma added a comment - - edited Patch to include mark up from Tika. Anchors are now detected but less outlinks are found! Anyone has a good suggestion on where to fetch our outlinks with the anchors from?
          Hide
          Markus Jelsma added a comment -

          With BP enabled you can get an java.util.EmptyStackException from DOMBuilder. This is fixed in this patch by adding another check around the peek 'n pop methods.

          http://mail-archives.apache.org/mod_mbox/nutch-user/201107.mbox/%3C201107151523.18511.markus.jelsma@openindex.io%3E

          There is no answer yet to why this can occur yet i think checking before pop or peek is good anyway.

          Show
          Markus Jelsma added a comment - With BP enabled you can get an java.util.EmptyStackException from DOMBuilder. This is fixed in this patch by adding another check around the peek 'n pop methods. http://mail-archives.apache.org/mod_mbox/nutch-user/201107.mbox/%3C201107151523.18511.markus.jelsma@openindex.io%3E There is no answer yet to why this can occur yet i think checking before pop or peek is good anyway.
          Hide
          Markus Jelsma added a comment -

          It works in production but is still a big hack when dealing with outlinks. Mark as 1.5

          Show
          Markus Jelsma added a comment - It works in production but is still a big hack when dealing with outlinks. Mark as 1.5
          Hide
          Markus Jelsma added a comment -

          Here's a working patch we use in production. This includes a nasty work around in TikeParsers to collect all outlinks. Without it, only outlinks from the extracted text are collected.

          This is a bit nasty and i'd appreciate if anyone with a bit more experience with Tika can shed some light on this.

          Show
          Markus Jelsma added a comment - Here's a working patch we use in production. This includes a nasty work around in TikeParsers to collect all outlinks. Without it, only outlinks from the extracted text are collected. This is a bit nasty and i'd appreciate if anyone with a bit more experience with Tika can shed some light on this.
          Hide
          Markus Jelsma added a comment -

          Fixed already. See NUTCH-1233 for a patch!

          Show
          Markus Jelsma added a comment - Fixed already. See NUTCH-1233 for a patch!
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          kiran added a comment -

          Markus, do you think this patch can also work for 2.x Series ? If not, is it easy to port to 2.x ? Please let me know your suggestions.

          Show
          kiran added a comment - Markus, do you think this patch can also work for 2.x Series ? If not, is it easy to port to 2.x ? Please let me know your suggestions.
          Hide
          Markus Jelsma added a comment -

          Should work fine, parse plugins have not changed that much. Keep in mind that you may need bp1.2.0 and keep an eye on link extraction. See related issues.

          Show
          Markus Jelsma added a comment - Should work fine, parse plugins have not changed that much. Keep in mind that you may need bp1.2.0 and keep an eye on link extraction. See related issues.
          Hide
          Roland von Herget added a comment -

          Kiran, did you already start porting it to 2.x?

          Show
          Roland von Herget added a comment - Kiran, did you already start porting it to 2.x?
          Hide
          kiran added a comment -

          No Roland, not yet. I just switched to using 1.x series, but i will give a try at porting this to 2.x this week

          Show
          kiran added a comment - No Roland, not yet. I just switched to using 1.x series, but i will give a try at porting this to 2.x this week
          Hide
          Roland von Herget added a comment -

          Status:

          • ported
          • compiles
          • yields same results as stock 2.1 if disabled (tika.use_boilerpipe=false)

          more tests needed

          Show
          Roland von Herget added a comment - Status: ported compiles yields same results as stock 2.1 if disabled (tika.use_boilerpipe=false) more tests needed
          Hide
          Roland von Herget added a comment -
          • now with working config options
          • cleanup (removed unused useBoilerpipeEstimator)
          Show
          Roland von Herget added a comment - now with working config options cleanup (removed unused useBoilerpipeEstimator)
          Hide
          Miles Rowland added a comment -

          Roland, thanks for porting to 2.1. I'm having an issue where nutch is only successfully parsing the first fetched url, and all other urls fail to parse with a warning "unable to successfully parse content [website] of type [x]". If I run parseChecker on that url the parse runs successfully using tika/boilerplate, so it seems to be an issue that only occurs when trying to run the second parse or more in a batch job.

          I'm running Nutch 2.1 with MySQL. The problem occurs with both bp1.1.0 and 1.2.0.

          Show
          Miles Rowland added a comment - Roland, thanks for porting to 2.1. I'm having an issue where nutch is only successfully parsing the first fetched url, and all other urls fail to parse with a warning "unable to successfully parse content [website] of type [x] ". If I run parseChecker on that url the parse runs successfully using tika/boilerplate, so it seems to be an issue that only occurs when trying to run the second parse or more in a batch job. I'm running Nutch 2.1 with MySQL. The problem occurs with both bp1.1.0 and 1.2.0.
          Hide
          Markus Jelsma added a comment -

          Updated patch for trunk. Estimator code has been removed. Parser still relies on reparsing without BP for it to obtain all outlinks. See NUTCH-1233!

          Show
          Markus Jelsma added a comment - Updated patch for trunk. Estimator code has been removed. Parser still relies on reparsing without BP for it to obtain all outlinks. See NUTCH-1233 !
          Hide
          Tien Nguyen Manh added a comment -

          I used patch NUTCH-961-2.1-v2.patch for nutch-2.2.1
          i found that the text parsed by nutch-tika (with boilerpipe support) is different from text parsed by demo site http://boilerpipe-web.appspot.com
          I did upgrade to boilerpipe 1.2.0 to be match with demo site.

          The url i tested is http://www.medhelp.org/posts/Eye-Care/EYE/show/1199003

          The text from nutch-tika (i use ArticleExtractor)

          EYE - Eye Care - MedHelp Experts My MedHelp Login or Signup Eye Care Community EYE Post a Question « Back to Community About This Community: This patient support community is for discussions relating to eye care, cataracts , glaucoma , retinal detachment , eye infections, misaligned eyes , intra-ocular implants, refractive surgery ( LASIK and CK), glasses, contact lenses, amblyopia , eye injuries, dry eyes , ocular allergy, eye pain and discomfort, pediatric eye disorders, eyelid and tearduct surgery, poor eyesight, and eye surgery. View community archives Font Size: A A ABackground: Search this Community: Go 3 Comments EYE My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ? Watch this discussion Tweet Related Discussions How to decide if glasses are needed for children? (8 replies):How can a Doctor tell if a child has amblyopia? Is t... [more] Astigmatism (1 replies):My 5 year old son has severe astigmatism. He wears glass... [more] Can someone help me in regards to my sons eyes? (6 replies):I had noticed my son had, had an eye issue when he was a... [more] Blurred vision with glasses (2 replies):Hi, I recently got new glasses and but the vision in my ... [more] Eyesight getting worse (2 replies):Hello! So here's the story. My eyesight had never been ... [more]

          AND from demo

          3 Comments
          EYE
          My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ?

          the result from demo is much better for this url.
          So the parse-tike/boilerpipe not only extract main content from page but also include title and other node content.
          Is it expected?

          Show
          Tien Nguyen Manh added a comment - I used patch NUTCH-961 -2.1-v2.patch for nutch-2.2.1 i found that the text parsed by nutch-tika (with boilerpipe support) is different from text parsed by demo site http://boilerpipe-web.appspot.com I did upgrade to boilerpipe 1.2.0 to be match with demo site. The url i tested is http://www.medhelp.org/posts/Eye-Care/EYE/show/1199003 The text from nutch-tika (i use ArticleExtractor) EYE - Eye Care - MedHelp Experts My MedHelp Login or Signup Eye Care Community EYE Post a Question « Back to Community About This Community: This patient support community is for discussions relating to eye care, cataracts , glaucoma , retinal detachment , eye infections, misaligned eyes , intra-ocular implants, refractive surgery ( LASIK and CK), glasses, contact lenses, amblyopia , eye injuries, dry eyes , ocular allergy, eye pain and discomfort, pediatric eye disorders, eyelid and tearduct surgery, poor eyesight, and eye surgery. View community archives Font Size: A A ABackground: Search this Community: Go 3 Comments EYE My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ? Watch this discussion Tweet Related Discussions How to decide if glasses are needed for children? (8 replies):How can a Doctor tell if a child has amblyopia? Is t... [more] Astigmatism (1 replies):My 5 year old son has severe astigmatism. He wears glass... [more] Can someone help me in regards to my sons eyes? (6 replies):I had noticed my son had, had an eye issue when he was a... [more] Blurred vision with glasses (2 replies):Hi, I recently got new glasses and but the vision in my ... [more] Eyesight getting worse (2 replies):Hello! So here's the story. My eyesight had never been ... [more] AND from demo 3 Comments EYE My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ? the result from demo is much better for this url. So the parse-tike/boilerpipe not only extract main content from page but also include title and other node content. Is it expected?
          Hide
          Otis Gospodnetic added a comment -

          Looks like Ken Krugler is offering to help with publishing Boilerpipe to a Sonatype Maven repo in TIKA-676 (this Nutch issue apparently depends on this Tika issue) - thanks Ken!

          But note that simply moving Nutch to Boilerpipe 1.2.0 won't fix the issue Tien Nguyen Manh just reported.
          Markus Jelsma, if Tien Nguyen Manh provides a patch that makes Nutch Boilerpipe output match that of the Boilerpipe demo, could you commit it to 2.x?

          Show
          Otis Gospodnetic added a comment - Looks like Ken Krugler is offering to help with publishing Boilerpipe to a Sonatype Maven repo in TIKA-676 (this Nutch issue apparently depends on this Tika issue) - thanks Ken! But note that simply moving Nutch to Boilerpipe 1.2.0 won't fix the issue Tien Nguyen Manh just reported. Markus Jelsma , if Tien Nguyen Manh provides a patch that makes Nutch Boilerpipe output match that of the Boilerpipe demo, could you commit it to 2.x?
          Hide
          Markus Jelsma added a comment -

          Hi Otis - there are no significant improvements between the 1.1.0 and 1.2.0 of Boilerpipe, at least not when it comes to better extraction. I am very sure that when the demo was using 1.2.0, we got identical results with 1.2.0 as well, but still poor in cases not suitable such as overviews, blocks etc. I am also very sure that the current 1.2.0 is nowadays different than what the demo returns, it is not identical anymore, and improved quite a lot.

          We don't use it BP anymore but i'm happy to commit whenever 1.2.0 is in maven or part of Tika if it gets donated to the ASF. We need to get NUTCH-1233 in as well then.

          Show
          Markus Jelsma added a comment - Hi Otis - there are no significant improvements between the 1.1.0 and 1.2.0 of Boilerpipe, at least not when it comes to better extraction. I am very sure that when the demo was using 1.2.0, we got identical results with 1.2.0 as well, but still poor in cases not suitable such as overviews, blocks etc. I am also very sure that the current 1.2.0 is nowadays different than what the demo returns, it is not identical anymore, and improved quite a lot. We don't use it BP anymore but i'm happy to commit whenever 1.2.0 is in maven or part of Tika if it gets donated to the ASF. We need to get NUTCH-1233 in as well then.
          Hide
          Otis Gospodnetic added a comment -

          We don't use it BP anymore

          What do you mean by that? I looked at parse-tika/plugins.xml earlier today and saw BP 1.1.0 there. So I'm not sure what you mean...

          Show
          Otis Gospodnetic added a comment - We don't use it BP anymore What do you mean by that? I looked at parse-tika/plugins.xml earlier today and saw BP 1.1.0 there. So I'm not sure what you mean...
          Hide
          Mateusz Zakarczemny added a comment -

          We don't use it BP anymore

          BP integration will be totally abandoned? Are there any plans to use other content extractor in favour of Boilerpipe?

          Show
          Mateusz Zakarczemny added a comment - We don't use it BP anymore BP integration will be totally abandoned? Are there any plans to use other content extractor in favour of Boilerpipe?
          Hide
          Markus Jelsma added a comment -

          I am sorry, i did not mean to speak for the Nutch PMC at all; we not using BP means I am not using BP. As i said before, i am happy to commit this issue is the linked issues are resolved first.

          Show
          Markus Jelsma added a comment - I am sorry, i did not mean to speak for the Nutch PMC at all; we not using BP means I am not using BP. As i said before, i am happy to commit this issue is the linked issues are resolved first.
          Hide
          sarath chandra chama added a comment -

          Hi Markus, is it possible to release a patch for nutch 1.9 ?

          Show
          sarath chandra chama added a comment - Hi Markus, is it possible to release a patch for nutch 1.9 ?
          Hide
          Alexander Kingson added a comment -

          Hello,

          Since I was not getting satisfactory results after upgrading to boilerpipe 1.2.0 with parse-tika (with boilerpipe support) I have put some code to nutch-2.x parser to get the same results as the boilerpipe demo-website. Used some code from .v2.patch.
          Attaching the patch.

          Thanks.
          Alex.

          Show
          Alexander Kingson added a comment - Hello, Since I was not getting satisfactory results after upgrading to boilerpipe 1.2.0 with parse-tika (with boilerpipe support) I have put some code to nutch-2.x parser to get the same results as the boilerpipe demo-website. Used some code from .v2.patch. Attaching the patch. Thanks. Alex.
          Hide
          Vincent Slot added a comment -

          Modified the NUTCH-961 patch for 1.11

          Show
          Vincent Slot added a comment - Modified the NUTCH-961 patch for 1.11
          Hide
          Otis Gospodnetic added a comment -

          Any chance we could commit this, Markus Jelsma?

          Show
          Otis Gospodnetic added a comment - Any chance we could commit this, Markus Jelsma ?
          Hide
          Markus Jelsma added a comment -

          Yes but it requires NUTCH-1233.

          Show
          Markus Jelsma added a comment - Yes but it requires NUTCH-1233 .
          Hide
          Markus Jelsma added a comment -

          Update, i've updated NUTCH-1233 for current trunk as well as a fix for the outlink extraction in Tika via TIKA-1835.

          Show
          Markus Jelsma added a comment - Update, i've updated NUTCH-1233 for current trunk as well as a fix for the outlink extraction in Tika via TIKA-1835 .
          Hide
          Tien Nguyen Manh added a comment -

          i'm using this patch NUTCH-961-1.11-1.patch, it works fine when run from eclipse & run in hadoop. It have problem when i run in local mode
          It throws exception: "Can't retrieve Tika parser for mime-type text/html". It is not problem with parse-plugins.xml. It seem problem with TikaConfig constructor TikaConfig(ClassLoader loader), it failed to load some config via classLoader when run in local mode.

          Show
          Tien Nguyen Manh added a comment - i'm using this patch NUTCH-961 -1.11-1.patch, it works fine when run from eclipse & run in hadoop. It have problem when i run in local mode It throws exception: "Can't retrieve Tika parser for mime-type text/html". It is not problem with parse-plugins.xml. It seem problem with TikaConfig constructor TikaConfig(ClassLoader loader), it failed to load some config via classLoader when run in local mode.
          Hide
          Markus Jelsma added a comment -

          Hello - that doesn't seem related to this issue as it doesn't interfere with how its loaded. Also, we cannot reproduce that locally nor in Hadoop mode. But there was some issue on the mailing list a couple of days ago that also mentioned an issue as you describe.

          Show
          Markus Jelsma added a comment - Hello - that doesn't seem related to this issue as it doesn't interfere with how its loaded. Also, we cannot reproduce that locally nor in Hadoop mode. But there was some issue on the mailing list a couple of days ago that also mentioned an issue as you describe.
          Hide
          Markus Jelsma added a comment -

          Some news, the upstream Tika issue has been committed and resolved and i have requested an earlier Tika RC at which Chris Mattmann responded positive. An early Tika 1.12 might come soon after which i can quickly resolve NUTCH-1233 and, of course, this issue.

          One question to all of you and the PMC specifically, i would like to propose to enable Boilerpipe ArticleExtractor by default. I cannot think of any scenario at which a user would not want this. Please share your thoughts.

          Show
          Markus Jelsma added a comment - Some news, the upstream Tika issue has been committed and resolved and i have requested an earlier Tika RC at which Chris Mattmann responded positive. An early Tika 1.12 might come soon after which i can quickly resolve NUTCH-1233 and, of course, this issue. One question to all of you and the PMC specifically, i would like to propose to enable Boilerpipe ArticleExtractor by default. I cannot think of any scenario at which a user would not want this. Please share your thoughts.
          Hide
          Tien Nguyen Manh added a comment -

          One note with boilerpipe support, it is significant slower than parse-html. I tested to parse the same segment and here are results
          parse-html: 3hm, parse-tika with boilerpipe 5h10m and parse-tika without poilerpipe 4h.

          Show
          Tien Nguyen Manh added a comment - One note with boilerpipe support, it is significant slower than parse-html. I tested to parse the same segment and here are results parse-html: 3hm, parse-tika with boilerpipe 5h10m and parse-tika without poilerpipe 4h.
          Hide
          Markus Jelsma added a comment -

          That is probably due to the patch parsing twice. Once with BP for text, and once without for link extraction.

          Show
          Markus Jelsma added a comment - That is probably due to the patch parsing twice. Once with BP for text, and once without for link extraction.
          Hide
          Tien Nguyen Manh added a comment - - edited

          AH yes, Could you explain why we need to parse it twice? with NUTCH-1233 we can use just 1 parse?

          Show
          Tien Nguyen Manh added a comment - - edited AH yes, Could you explain why we need to parse it twice? with NUTCH-1233 we can use just 1 parse?
          Hide
          Markus Jelsma added a comment -

          With boilerpipe, you get only a very few outlinks, those found in the extracted text, and that is a problem

          Show
          Markus Jelsma added a comment - With boilerpipe, you get only a very few outlinks, those found in the extracted text, and that is a problem
          Hide
          Tien Nguyen Manh added a comment -

          Can NUTCH-1233: use tika to extract outlink solve that problem?

          Show
          Tien Nguyen Manh added a comment - Can NUTCH-1233 : use tika to extract outlink solve that problem?
          Hide
          Markus Jelsma added a comment -

          Yes!

          Show
          Markus Jelsma added a comment - Yes!
          Hide
          Markus Jelsma added a comment -

          Patch for trunk.

          Show
          Markus Jelsma added a comment - Patch for trunk.
          Hide
          Markus Jelsma added a comment -

          Tests pass as expected and Boilerpipe as well. Will commit shortly.

          Show
          Markus Jelsma added a comment - Tests pass as expected and Boilerpipe as well. Will commit shortly.
          Hide
          Markus Jelsma added a comment -

          Updated patch. ExtractorRepository was missing.

          Show
          Markus Jelsma added a comment - Updated patch. ExtractorRepository was missing.
          Hide
          Markus Jelsma added a comment -

          Committed to trunk in revision 1730694. Thanks everyone for contributions.

          Show
          Markus Jelsma added a comment - Committed to trunk in revision 1730694. Thanks everyone for contributions.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Nutch-trunk #3347 (See https://builds.apache.org/job/Nutch-trunk/3347/)
          NUTCH-961 Expose Tika's Boilerpipe support (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1730694)

          • trunk/CHANGES.txt
          • trunk/conf/nutch-default.xml
          • trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
          • trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Nutch-trunk #3347 (See https://builds.apache.org/job/Nutch-trunk/3347/ ) NUTCH-961 Expose Tika's Boilerpipe support (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1730694 ) trunk/CHANGES.txt trunk/conf/nutch-default.xml trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
          Hide
          ASF GitHub Bot added a comment -

          GitHub user jeremie70 opened a pull request:

          https://github.com/apache/nutch/pull/92

          Add the boilerpipe parsing adapted from NUTCH-961

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/jeremie70/nutch my-branch

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/nutch/pull/92.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #92


          commit f185bc4461c57a1a85578de0ecf0884c7026c3a6
          Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com>
          Date: 2016-02-26T10:37:28Z

          improve parser with boilerpipe

          commit 93ea2e51f444447be41ec93b2c0b0b61c117eeb3
          Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com>
          Date: 2016-02-26T10:37:28Z

          NUTCH-961 improve parser with boilerpipe

          commit be91764fdf59d4f6930fc3211a84a252e5452674
          Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com>
          Date: 2016-02-26T11:00:36Z

          Merge branch 'my-branch' of https://github.com/jeremie70/nutch into my-branch


          Show
          ASF GitHub Bot added a comment - GitHub user jeremie70 opened a pull request: https://github.com/apache/nutch/pull/92 Add the boilerpipe parsing adapted from NUTCH-961 You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeremie70/nutch my-branch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/92.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #92 commit f185bc4461c57a1a85578de0ecf0884c7026c3a6 Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com> Date: 2016-02-26T10:37:28Z improve parser with boilerpipe commit 93ea2e51f444447be41ec93b2c0b0b61c117eeb3 Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com> Date: 2016-02-26T10:37:28Z NUTCH-961 improve parser with boilerpipe commit be91764fdf59d4f6930fc3211a84a252e5452674 Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com> Date: 2016-02-26T11:00:36Z Merge branch 'my-branch' of https://github.com/jeremie70/nutch into my-branch
          Hide
          ASF GitHub Bot added a comment -

          Github user lewismc commented on a diff in the pull request:

          https://github.com/apache/nutch/pull/92#discussion_r54332145

          — Diff: conf/nutch-default.xml —
          @@ -876,6 +876,19 @@
          </description>
          </property>

          +<!-- tika properties -->
          +
          +<property>
          + <name>tika.boilerpipe</name>
          + <value>false</value>
          — End diff –

          Can you provide descriptions of these properties please?

          Show
          ASF GitHub Bot added a comment - Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/92#discussion_r54332145 — Diff: conf/nutch-default.xml — @@ -876,6 +876,19 @@ </description> </property> +<!-- tika properties --> + +<property> + <name>tika.boilerpipe</name> + <value>false</value> — End diff – Can you provide descriptions of these properties please?
          Hide
          ASF GitHub Bot added a comment -

          Github user lewismc commented on a diff in the pull request:

          https://github.com/apache/nutch/pull/92#discussion_r54332155

          — Diff: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java —
          @@ -0,0 +1,62 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +package org.apache.nutch.parse.tika;
          +
          +import java.lang.ClassLoader;
          +import java.lang.InstantiationException;
          +import java.util.WeakHashMap;
          +import org.apache.commons.logging.Log;
          — End diff –

          Nutch currently uses Slf4j

          org.slf4j.Logger
          org.slf4j.LoggerFactory

          I think!

          Show
          ASF GitHub Bot added a comment - Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/92#discussion_r54332155 — Diff: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java — @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.parse.tika; + +import java.lang.ClassLoader; +import java.lang.InstantiationException; +import java.util.WeakHashMap; +import org.apache.commons.logging.Log; — End diff – Nutch currently uses Slf4j org.slf4j.Logger org.slf4j.LoggerFactory I think!
          Hide
          ASF GitHub Bot added a comment -

          Github user lewismc commented on a diff in the pull request:

          https://github.com/apache/nutch/pull/92#discussion_r54332193

          — Diff: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java —
          @@ -0,0 +1,62 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +package org.apache.nutch.parse.tika;
          +
          +import java.lang.ClassLoader;
          +import java.lang.InstantiationException;
          +import java.util.WeakHashMap;
          +import org.apache.commons.logging.Log;
          +import org.apache.commons.logging.LogFactory;
          +import org.apache.tika.parser.html.BoilerpipeContentHandler;
          +import de.l3s.boilerpipe.BoilerpipeExtractor;
          +import de.l3s.boilerpipe.extractors.*;
          +
          +class BoilerpipeExtractorRepository {
          +
          + public static final Log LOG = LogFactory.getLog(BoilerpipeExtractorRepository.class);
          + public static final WeakHashMap<String, BoilerpipeExtractor> extractorRepository = new WeakHashMap<String, BoilerpipeExtractor>();
          +
          + /**
          + * Returns an instance of the specified extractor
          + */
          + public static BoilerpipeExtractor getExtractor(String boilerpipeExtractorName) {
          + // Check if there's no instance of this extractor
          + if (!extractorRepository.containsKey(boilerpipeExtractorName)) {
          + // FQCN
          + boilerpipeExtractorName = "de.l3s.boilerpipe.extractors." + boilerpipeExtractorName;
          +
          + // Attempt to load the class
          + try

          { + ClassLoader loader = BoilerpipeExtractor.class.getClassLoader(); + Class extractorClass = loader.loadClass(boilerpipeExtractorName); + + // Add an instance to the repository + extractorRepository.put(boilerpipeExtractorName, (BoilerpipeExtractor)extractorClass.newInstance()); + + }

          catch (ClassNotFoundException e) {
          + LOG.error("BoilerpipeExtractor " + boilerpipeExtractorName + " not found!");
          — End diff –

          In slf4j we can better structure the catch
          http://www.slf4j.org/faq.html#logging_performance
          e.g.
          ```
          logger.debug("The entry is {}.", entry);
          ```

          Show
          ASF GitHub Bot added a comment - Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/92#discussion_r54332193 — Diff: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java — @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.parse.tika; + +import java.lang.ClassLoader; +import java.lang.InstantiationException; +import java.util.WeakHashMap; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.tika.parser.html.BoilerpipeContentHandler; +import de.l3s.boilerpipe.BoilerpipeExtractor; +import de.l3s.boilerpipe.extractors.*; + +class BoilerpipeExtractorRepository { + + public static final Log LOG = LogFactory.getLog(BoilerpipeExtractorRepository.class); + public static final WeakHashMap<String, BoilerpipeExtractor> extractorRepository = new WeakHashMap<String, BoilerpipeExtractor>(); + + /** + * Returns an instance of the specified extractor + */ + public static BoilerpipeExtractor getExtractor(String boilerpipeExtractorName) { + // Check if there's no instance of this extractor + if (!extractorRepository.containsKey(boilerpipeExtractorName)) { + // FQCN + boilerpipeExtractorName = "de.l3s.boilerpipe.extractors." + boilerpipeExtractorName; + + // Attempt to load the class + try { + ClassLoader loader = BoilerpipeExtractor.class.getClassLoader(); + Class extractorClass = loader.loadClass(boilerpipeExtractorName); + + // Add an instance to the repository + extractorRepository.put(boilerpipeExtractorName, (BoilerpipeExtractor)extractorClass.newInstance()); + + } catch (ClassNotFoundException e) { + LOG.error("BoilerpipeExtractor " + boilerpipeExtractorName + " not found!"); — End diff – In slf4j we can better structure the catch http://www.slf4j.org/faq.html#logging_performance e.g. ``` logger.debug("The entry is {}.", entry); ```
          Hide
          ASF GitHub Bot added a comment -

          Github user lewismc commented on a diff in the pull request:

          https://github.com/apache/nutch/pull/92#discussion_r54332201

          — Diff: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java —
          @@ -109,7 +114,18 @@ public Parse getParse(String url, WebPage page) {
          HTMLDocumentImpl doc = new HTMLDocumentImpl();
          doc.setErrorChecking(false);
          DocumentFragment root = doc.createDocumentFragment();

          • DOMBuilder domhandler = new DOMBuilder(doc, root);
            + // DOMBuilder domhandler = new DOMBuilder(doc, root);
            + ContentHandler domHandler;
            + // Check whether to use Tika's BoilerplateContentHandler
            + if (useBoilerpipe) {
            + LOG.debug("Using Tikas's Boilerpipe with Extractor: " + boilerpipeExtractorName);
              • End diff –

          Can also use more efficient slf4j convention
          logger.debug("The entry is {}.", entry);

          Show
          ASF GitHub Bot added a comment - Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/92#discussion_r54332201 — Diff: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java — @@ -109,7 +114,18 @@ public Parse getParse(String url, WebPage page) { HTMLDocumentImpl doc = new HTMLDocumentImpl(); doc.setErrorChecking(false); DocumentFragment root = doc.createDocumentFragment(); DOMBuilder domhandler = new DOMBuilder(doc, root); + // DOMBuilder domhandler = new DOMBuilder(doc, root); + ContentHandler domHandler; + // Check whether to use Tika's BoilerplateContentHandler + if (useBoilerpipe) { + LOG.debug("Using Tikas's Boilerpipe with Extractor: " + boilerpipeExtractorName); End diff – Can also use more efficient slf4j convention logger.debug("The entry is {}.", entry);
          Hide
          ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/nutch/pull/92

          Show
          ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/92

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              6 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development