Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2276

Tika Boilerpipe Parser in combo with RSS items doesn't work

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.11, 1.12
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None
    • Environment:

      feed parser for RSS
      Tika parser with boilerpipe (with ArticleExtractor) for HTML

      Description

      Sometimes it happens that the text (description) for an RSS item is too short or has characteristics that Tika with Boilerpipe decide to cut the entire text, resulting in an empty string.

      in fact when the feed plugin selects a parser uses the function:
      Parser parser = parserFactory.getParsers(contentType, link)[0];
      the content being a HTML returns the Tika Boilerpipe article extractor.

      Since the description text of an RSS as far as I know is always html, instead of asking for the contentType, we could set another mimetype for this specific case
      String contentType = contentMeta.get(Response.CONTENT_TYPE);
      ->String contentType = "text/html-short";

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              capponi.francesco@gmail.com Francesco Capponi
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: