Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4, nutchgora
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Parse-metatags plugin

      The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.

      In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml

      <property>
        <name>metatags.names</name>
        <value>description;keywords</value>
      </property>
      

      The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.

      The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml

      <property>
        <name>query.basic.description.boost</name>
        <value>2.0</value>
      </property>
      
      <property>
        <name>query.basic.keywords.boost</name>
        <value>2.0</value>
      </property>
      

      This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com

      1. NUTCH-809-trunk.patch
        15 kB
        Julien Nioche
      2. metatags-plugin+tutorial.zip
        29 kB
        Elisabeth Adler
      3. NUTCH-809_metatags_1.3.patch
        14 kB
        Elisabeth Adler
      4. NUTCH-809.patch
        20 kB
        Julien Nioche

        Issue Links

          Activity

          Julien Nioche created issue -
          Julien Nioche made changes -
          Field Original Value New Value
          Attachment NUTCH-809.patch [ 12440618 ]
          Julien Nioche made changes -
          Attachment NUTCH-809.patch [ 12440618 ]
          Julien Nioche made changes -
          Attachment NUTCH-809.patch [ 12440620 ]
          Julien Nioche made changes -
          Description h2. Parse-metatags plugin

          *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).*

          To use the legacy HTML parser specify in parse-plugins.xml

          {code:xml}
          <mimeType name="text/html">
            <plugin id="parse-html" />
          </mimeType>
          {code}

          The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.

          In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml

          {code:xml}
          <property>
            <name>metatags.names</name>
            <value>description;keywords</value>
          </property>
          {code}

          The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
          The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

          This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com

          h2. Parse-metatags plugin

          The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.

          In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml

          {code:xml}
          <property>
            <name>metatags.names</name>
            <value>description;keywords</value>
          </property>
          {code}

          The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
          The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

          This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com

          Julien Nioche made changes -
          Description h2. Parse-metatags plugin

          The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.

          In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml

          {code:xml}
          <property>
            <name>metatags.names</name>
            <value>description;keywords</value>
          </property>
          {code}

          The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
          The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

          This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com

          h2. Parse-metatags plugin

          The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.

          In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml

          {code:xml}
          <property>
            <name>metatags.names</name>
            <value>description;keywords</value>
          </property>
          {code}

          The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.

          The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml

          {code:xml}
          <property>
            <name>query.basic.description.boost</name>
            <value>2.0</value>
          </property>

          <property>
            <name>query.basic.keywords.boost</name>
            <value>2.0</value>
          </property>
          {code}


          This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com

          Julien Nioche made changes -
          Link This issue is related to NUTCH-422 [ NUTCH-422 ]
          Julien Nioche made changes -
          Link This issue is related to NUTCH-1005 [ NUTCH-1005 ]
          Elisabeth Adler made changes -
          Attachment NUTCH-809_metatags_1.3.patch [ 12497116 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.4 [ 12316519 ]
          Fix Version/s nutchgora [ 12314893 ]
          Affects Version/s 1.4 [ 12316519 ]
          Affects Version/s nutchgora [ 12314893 ]
          Lewis John McGibbney made changes -
          Link This issue relates to NUTCH-422 [ NUTCH-422 ]
          Lewis John McGibbney made changes -
          Link This issue relates to NUTCH-1005 [ NUTCH-1005 ]
          Chris A. Mattmann made changes -
          Fix Version/s 1.5 [ 12318246 ]
          Fix Version/s nutchgora [ 12314893 ]
          Fix Version/s 1.4 [ 12316519 ]
          Elisabeth Adler made changes -
          Attachment metatags-plugin+tutorial.zip [ 12510323 ]
          Julien Nioche made changes -
          Attachment NUTCH-809-trunk.patch [ 12519226 ]
          Markus Jelsma made changes -
          Fix Version/s 1.6 [ 12319941 ]
          Fix Version/s 1.5 [ 12318246 ]
          Julien Nioche made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 1.5 [ 12318246 ]
          Fix Version/s 1.6 [ 12319941 ]
          Resolution Fixed [ 1 ]
          Julien Nioche made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Kristof made changes -
          Link This issue relates NUTCH-1406 [ NUTCH-1406 ]
          Gavin made changes -
          Link This issue relates to NUTCH-1406 [ NUTCH-1406 ]
          Gavin made changes -
          Link This issue relates to NUTCH-1406 [ NUTCH-1406 ]

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Julien Nioche
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development