Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1467

nutch 1.5.1 not able to parse mutliValued metatags

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.5.1
    • Fix Version/s: 1.9
    • Component/s: None
    • Labels:
      None

      Description

      Hi,

      I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents.

      Does anyone encounter this kind of issue ?

      Are there any changes that need to be made to the config files to make it work ?

      When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.

      Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech.

      Many Thanks,

        Attachments

        1. NUTCH-1467-TEST-1.patch
          5 kB
          Sebastian Nagel
        2. NUTCH-1467-trunk_v1.patch
          9 kB
          kiran
        3. NUTCH-1467-trunk_v2.patch
          5 kB
          kiran
        4. NUTCH-1467-trunk.patch
          6 kB
          Lewis John McGibbney
        5. NUTCH-1467-trunk-v3.patch
          11 kB
          kiran
        6. Patch_HTMLMetaProcessor.patch
          2 kB
          kiran
        7. Patch_HTMLMetaTags.patch
          1 kB
          kiran
        8. Patch_MetadataIndexer.patch
          1.0 kB
          kiran
        9. Patch_MetaTagsParser.patch
          3 kB
          kiran
        10. patch.txt
          1 kB
          kiran

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kiranch kiran
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: