Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-745

MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 1.0.0
    • None
    • None
    • None
    • JDK1.6 + tomcat 6 + Eclipse3.3 + nutch 1.0

    Description

      MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run

      public ParseResult getParse(Content content)

      { return ParseResult.createParseResult(content.getUrl(), new ParseStatus(ParseStatus.FAILED, ParseStatus.FAILED_MISSING_CONTENT, "No textual content available").getEmptyParse(conf)); // return null; }

      ========nutch-site.xml=======
      <property>
      <name>plugin.includes</name>
      <value>protocol-http|urlfilter-regex|parse-(myHtml|html|text|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier|analysis-(zh)</value>
      <description><![CDATA[

      ]]> </description>
      </property>
      ==========parse-plugins.xml============
      <mimeType name="text/html">
      <plugin id="parse-myHtml" />
      <plugin id="parse-html" />
      </mimeType>
      <alias name="parse-myHtml"
      extension-id="org.apache.nutch.parse.html.MyHtmlParser" />

      ===src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java========
      public ParseResult getParse(Content content) {
      .....
      // cannot run the code:
      ParseResult filteredParse = this.htmlParseFilters.filter(content, parseResult,
      metaTags, root);
      .......

      Attachments

        Activity

          People

            Unassigned Unassigned
            xiatian Tian Xia
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: