Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1481

When using MySQL as storage unicode characters within URLS cause nutch to fail

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 2.1
    • 2.3
    • crawldb
    • mysql 5.5.28 on centos

    Description

      MySQL's (innodb) primary key / unique key is restricted to 767 bytes.. currently the url of a web page is used as a primary key in nutch storage.

      when using latin1 character set on the 'id' column @ length 767 bytes/characters; unicode characters in urls cause jdbc to throw an exception,
      java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE2\x80\x8' for column 'id' at row 1

      when using utf8mb4 character set on the 'id' column @ length 190 characters / 760 bytes to fully support unicode characters; the field length becomes insufficient

      It may be better to use a hash of the url as the primary key instead of the url itself. This would allow urls of any length and full utf8 support.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sumarlidason Arni Sumarlidason
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: