Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-936

LanguageIdentifier should not set empty lang field on NutchDocument

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.2
    • 1.3, nutchgora
    • indexer
    • None
    • Patch Available

    Description

      For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.

      This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?

      Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well.

      Attachments

        1. NUTCH-936-v12-1.patch
          0.4 kB
          Markus Jelsma
        2. NUTCH-936-v13-1.patch
          0.4 kB
          Markus Jelsma
        3. NUTCH-936-v13-1.patch
          0.4 kB
          Markus Jelsma

        Activity

          People

            markus17 Markus Jelsma
            markus17 Markus Jelsma
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: