Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2696

Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.15
    • 1.16
    • segment
    • None
    • Hadoop version : 3.0.0 (CDH 6.1)

      Nutch : 1.15

      Mode : distributed mode

    Description

      All Nutch tasks work properly with Hadoop 3.x. (except SegmentReader)
      SegmentReader with -get option work fine.
      SegmentReader with -dump option replace non-ascii character by ?

      Exemple url : http://www.wikipedia.fr/index.php

       

      command : ./runtime/deploy/bin/nutch readseg -dump /user/nutch/crawl1.15/segments/20190221093756 /tmp/dump1.15 -nocontent -nogenerate -noparse -noparsedata
      ParseText::
       Wikipedia.fr - Portail de recherche sur les projets Wikim?dia
       Chercher sur Wikip?dia en fran?ais
       L?encyclop?die librement r?utilisable que chacun peut am?liorer.
      

       

       

      command : ./runtime/deploy/bin/nutch readseg -get /user/nutch/crawl1.15/segments/20190221093756 http://www.wikipedia.fr/index.php -nocontent -nogenerate -noparse -noparsedata
      ParseText::
       Wikipedia.fr - Portail de recherche sur les projets Wikimédia
       Chercher sur Wikipédia en français
       L’encyclopédie librement réutilisable que chacun peut améliorer.
      

       

      I try to build with hadoop 3.0.0 dependencies in ivy.xml but i have the same result

      It's work fine in local mode.

       

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              lhervaud Laurent Hervaud
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: