Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1481

When using MySQL as storage unicode characters within URLS cause nutch to fail

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.1
    • Fix Version/s: 2.3
    • Component/s: crawldb
    • Environment:

      mysql 5.5.28 on centos

      Description

      MySQL's (innodb) primary key / unique key is restricted to 767 bytes.. currently the url of a web page is used as a primary key in nutch storage.

      when using latin1 character set on the 'id' column @ length 767 bytes/characters; unicode characters in urls cause jdbc to throw an exception,
      java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE2\x80\x8' for column 'id' at row 1

      when using utf8mb4 character set on the 'id' column @ length 190 characters / 760 bytes to fully support unicode characters; the field length becomes insufficient

      It may be better to use a hash of the url as the primary key instead of the url itself. This would allow urls of any length and full utf8 support.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              sumarlidason Arni Sumarlidason
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: