Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: nutchbase, 1.2, nutchgora
    • Fix Version/s: nutchbase, 1.2, nutchgora
    • Component/s: parser
    • Labels:
      None

      Description

      We need to add back parse-html. There are a few serious problems with HTML parsing in Tika 0.7, so it's not possible to do a quality crawl using parse-tika alone. The necessary improvements to Tika are on the way, so if a future version of Tika > 0.7 has a chance of passing our tests we can again remove this plugin and use parse-tika alone.

        Activity

          People

          • Assignee:
            Julien Nioche
            Reporter:
            Andrzej Bialecki
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development