Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2567

parse-metatags writes every meta tags twice

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Using nutch witch the following configuration, MetaTagsParser writes HTML meta tags to the metadata twice:

          <property>
              <name>plugin.includes</name>
              <value>protocol-http|parse-(tika|metatags)</value>
          </property>
      

      The problem seems to come from MetaTagsParser.java#L104-L111 :

      Both the meta tags from the existing ParseResult and from the HTMLMetaTags are added to the metadata with a "metatag." prefix. But the ParseResult object already contains the HTML meta tags, because they have been added by TikaParser here: TikaParser.java#L198-L206

       
      This bug is concerning, because it makes the segments uselessly big, especially if we want to store all metatags (by default, only metatag.description and metatag.keywords are stored, and thus duplicated).

      I would also suggest making the output of Metadata::toString more readable(for instance by adding a newline before each new metadata value). It would have made this bug way easier to spot inside the output of the parsechecker.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              gbouchar Gerard Bouchar
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: