Nutch
  1. Nutch
  2. NUTCH-62

Add html META tag information into metaData in index-more plugin

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: indexer
    • Labels:
      None

      Description

      Now(version dev-0.7), only some metaData in http response such as type, date, content-length are available int the index-more plugin. And we cannot index/sotre the meta data in html header (<META> exactly)

        Activity

        Hide
        Julien Nioche added a comment -

        This can be done in a more flexible way using index-metadata
        https://issues.apache.org/jira/browse/NUTCH-1264

        Show
        Julien Nioche added a comment - This can be done in a more flexible way using index-metadata https://issues.apache.org/jira/browse/NUTCH-1264
        Hide
        Lewis John McGibbney added a comment -

        There are various comments above which create slight confusion about what to do to resolve this issue... or infact what exactly the issue is that needs to be resolved!

        Is there a requirement to rework the htmlMetaProcessor class to incorporate the suggestions above e.g. "consistent schema in both cases..."

        Protocol.metadata aside, what we are essentially talking about is picking up all Parsedata.metadata included within meta tags which I assume we would wish to index at a later stage. Focussing on the HTMLMetaProcessor class we already acquire name, http-equiv and content attributes from meta tags. WOuld an improvement be to configure the class to pick up other attributes not already mentioned?

        Show
        Lewis John McGibbney added a comment - There are various comments above which create slight confusion about what to do to resolve this issue... or infact what exactly the issue is that needs to be resolved! Is there a requirement to rework the htmlMetaProcessor class to incorporate the suggestions above e.g. "consistent schema in both cases..." Protocol.metadata aside, what we are essentially talking about is picking up all Parsedata.metadata included within meta tags which I assume we would wish to index at a later stage. Focussing on the HTMLMetaProcessor class we already acquire name, http-equiv and content attributes from meta tags. WOuld an improvement be to configure the class to pick up other attributes not already mentioned?
        Hide
        Andrzej Bialecki added a comment -

        The latest SVN version already contains similar code (see parse-html/..../HTMLMetaProcessor.java). The only thing that is missing is to put the content meta tags into ParseData.metadata.

        As you know, we actually have two places to put metadata into: one is Protocol.metadata, where all protocol-related metadata should be stored, and the other is ParseData.metadata, where parse-related metadata should be stored, which is the case here.

        However... potentially this may overwrite other properties coming from protocol handlers, or discovered by other plugins or other parts of the code. E.g. the "lang" tag is such example, "content-encoding" and "charset" are other examples. The language identifier plugin works around this by using an "X-meta-lang" property name. (BTW: it could be cleaned up to avoid traversing the node tree once again, but instead make use of the already discovered meta tags, which are now passed as an argument to HtmlParseFilters).

        I suggest to rework this to use a consistent schema in both cases (i.e. Content.metadata and ParseData.metadata): let's put them under "X-nutch-<name>" (where <name> is e.g. the value of the key retrieved from HtmlMetaTags.getGeneralTags()), or "X-nutch-http-equiv<name>" prefix (where name is the value of the key retrieved from HtmlMetaTags.getHtpEquivTags)), and so on. So, this would be e.g. "X-nutch-robots", "X-nutch-base", "X-nutch-http-equiv-pragma", "X-nutch-http-equiv-refresh").

        This way we can store all <meta> information, without any danger of overwriting the original values.

        Show
        Andrzej Bialecki added a comment - The latest SVN version already contains similar code (see parse-html/..../HTMLMetaProcessor.java). The only thing that is missing is to put the content meta tags into ParseData.metadata. As you know, we actually have two places to put metadata into: one is Protocol.metadata, where all protocol-related metadata should be stored, and the other is ParseData.metadata, where parse-related metadata should be stored, which is the case here. However... potentially this may overwrite other properties coming from protocol handlers, or discovered by other plugins or other parts of the code. E.g. the "lang" tag is such example, "content-encoding" and "charset" are other examples. The language identifier plugin works around this by using an "X-meta-lang" property name. (BTW: it could be cleaned up to avoid traversing the node tree once again, but instead make use of the already discovered meta tags, which are now passed as an argument to HtmlParseFilters). I suggest to rework this to use a consistent schema in both cases (i.e. Content.metadata and ParseData.metadata): let's put them under "X-nutch-<name> " (where <name> is e.g. the value of the key retrieved from HtmlMetaTags.getGeneralTags()), or "X-nutch-http-equiv <name>" prefix (where name is the value of the key retrieved from HtmlMetaTags.getHtpEquivTags)), and so on. So, this would be e.g. "X-nutch-robots", "X-nutch-base", "X-nutch-http-equiv-pragma", "X-nutch-http-equiv-refresh"). This way we can store all <meta> information, without any danger of overwriting the original values.
        Hide
        Jack Tang added a comment -

        The attachment contains MetaDataParser and config file. It looks up html META tag, and stored the name-value pairs into metaData map, then you can index the info. in index-more plugin.

        Show
        Jack Tang added a comment - The attachment contains MetaDataParser and config file. It looks up html META tag, and stored the name-value pairs into metaData map, then you can index the info. in index-more plugin.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jack Tang
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development