Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-625

Non-ascii character broken in dumped content for mixed encoding (utf-8 and multi-byte)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 1.0.0
    • None
    • None
    • None

    Description

      If the crawl db contains both utf-8 non-ascii character and non-utf-8 non-ascii character(i.e. multi-byte character), the dumped contents by readseg utility will have garbled character appear in all of the non-utf8 non-ascii text, and those texts are unable to repair by encoding reload.

      At the same time, the utf-8 text is normal, only the non-utf8 text broken.

      Any possible solution available for repairing the broken text?

      Attachments

        Activity

          People

            Unassigned Unassigned
            vinci Vinci
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: