Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1467

nutch 1.5.1 not able to parse mutliValued metatags

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.5.1
    • 1.9
    • None
    • None

    Description

      Hi,

      I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents.

      Does anyone encounter this kind of issue ?

      Are there any changes that need to be made to the config files to make it work ?

      When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.

      Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech.

      Many Thanks,

      Attachments

        1. patch.txt
          1 kB
          Kiran
        2. Patch_MetaTagsParser.patch
          3 kB
          Kiran
        3. Patch_MetadataIndexer.patch
          1.0 kB
          Kiran
        4. Patch_HTMLMetaTags.patch
          1 kB
          Kiran
        5. Patch_HTMLMetaProcessor.patch
          2 kB
          Kiran
        6. NUTCH-1467-trunk-v3.patch
          11 kB
          Kiran
        7. NUTCH-1467-trunk.patch
          6 kB
          Lewis John McGibbney
        8. NUTCH-1467-trunk_v2.patch
          5 kB
          Kiran
        9. NUTCH-1467-trunk_v1.patch
          9 kB
          Kiran
        10. NUTCH-1467-TEST-1.patch
          5 kB
          Sebastian Nagel

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kiranch Kiran
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: