Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-950

Content-Length limit, URL filter and few minor issues

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • nutchgora
    • nutchgora
    • None
    • None
    • Patch Available

    Description

      1. crawl command (nutch1.patch)

      The class was renamed to Crawler but the references to it were not updated.

      2. URL filter (nutch2.patch)

      This avoids a NPE on bogus urls which host do not have a suffix.

      3. Content-Length limit (nutch3.patch)

      This is related to NUTCH-899.
      The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic.

      4. Ivy configuration (nutch4.patch)

      • Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository.
      • Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here)
      • Add com.jcraft/jsch, which is a protocol-sftp plugin dependency.

      Attachments

        1. nutch4.patch
          2 kB
          Alexis
        2. nutch3.patch
          2 kB
          Alexis
        3. nutch2.patch
          1 kB
          Alexis
        4. nutch1.patch
          1 kB
          Alexis

        Activity

          People

            Unassigned Unassigned
            alexis779 Alexis
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: