Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-406

Metadata tries to write null values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 0.9.0
    • None
    • None

    Description

      During parsing, some urls (especially pdfs, it seems) may create <some_key, null> pairs in ParseData's parseMeta.
      When Metadata.write() tries to write such a pair, it causes an NPE.

      Stack trace will be something like this:
      at org.apache.hadoop.io.Text.encode(Text.java:373)
      at org.apache.hadoop.io.Text.encode(Text.java:354)
      at org.apache.hadoop.io.Text.writeString(Text.java:394)
      at org.apache.nutch.metadata.Metadata.write(Metadata.java:214)

      I can consistently reproduce this using the following url:
      http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf

      Attachments

        1. NUTCH-406.patch
          0.6 kB
          Dogacan Guney
        2. NUTCH-406.patch
          0.5 kB
          Dogacan Guney

        Activity

          People

            chrismattmann Chris A. Mattmann
            dogacan Dogacan Guney
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: