Nutch
  1. Nutch
  2. NUTCH-1478

Parse-metatags and index-metadata plugin for Nutch 2.x series

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.3
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. This will take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467).

      The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is no need to give 'metatag' keyword before metatag names. For example my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)

      This is only the first version and does not include the junit test. I will update the new version soon.

      This will parse the tags and index the tags in Solr. Make sure you create the fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.

      Please let me know if you have any suggestions

      This is supported by DLA (Digital Library and Archives) of Virginia Tech.

      1. Nutch1478.patch
        8 kB
        kiran
      2. Nutch1478.zip
        13 kB
        kiran
      3. metadata_parseChecker_sites.png
        280 kB
        kiran
      4. NUTCH-1478-parse-v2.patch
        17 kB
        Tien Nguyen Manh
      5. NUTCH-1478v3.patch
        30 kB
        Lewis John McGibbney
      6. NUTCH-1478v4.patch
        29 kB
        Yasin Kılınç
      7. NUTCH-1478v5.patch
        36 kB
        Talat UYARER
      8. NUTCH-1478v5.1.patch
        6 kB
        Vangelis Karvounis
      9. NUTCH-1478v6.patch
        34 kB
        Talat UYARER

        Activity

        Hide
        kiran added a comment -

        unzip the zip folder in src/plugins in Nutch 2.x src and do the patch. This worked for me. Please let me know if you have any issues.

        Show
        kiran added a comment - unzip the zip folder in src/plugins in Nutch 2.x src and do the patch. This worked for me. Please let me know if you have any issues.
        Hide
        J. Gobel added a comment - - edited

        Hi Kiran,

        I applied the patch but even still when I check the MySQL field I still see 'garbage' csh�����
        the link provided to your XML file is no longer working.

        Show
        J. Gobel added a comment - - edited Hi Kiran, I applied the patch but even still when I check the MySQL field I still see 'garbage' csh ����� the link provided to your XML file is no longer working.
        Hide
        kiran added a comment -

        Hi Gobel,

        I have updated the broken link(https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml).

        Can you please do './bin/nutch parsechecker http://www.google.com' and check if you are able to see metadata in the output ?

        Did you add fields or * in the metatags.names field in nutch-site.xml ?

        Thank you,
        Kiran.

        Show
        kiran added a comment - Hi Gobel, I have updated the broken link( https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml ). Can you please do './bin/nutch parsechecker http://www.google.com ' and check if you are able to see metadata in the output ? Did you add fields or * in the metatags.names field in nutch-site.xml ? Thank you, Kiran.
        Hide
        J. Gobel added a comment - - edited

        Hi Kiran,

        I unpacked the zip file in my plugin folder.Then I wget the patch file to my /src/plugin folder and applied the patch using . patch -p0 < Nutch1478.patch .

        I used your xml file, changed a few things and rebuild runtime with ant. I use MYSQL for example, and changed the path to my plugins folder.

        I checked with parsechecker and this is the result:
        :~/nutch2/nutch/runtime/local# bin/nutch parsechecker http://www.google.nl
        ---------
        Url
        ---------------
        http://www.google.nl
        ---------
        Metadata
        ---------

        I emptied my SQL database, to start from scratch. Did a crawl, and still in the field Metadata what I see is still 'garbage'. I have my Nutch 2.1 configured according to : http://nlp.solutions.asia/?p=180

        Perhaps you can share your schema.xml file as well? Maybe I am doing something wrong in there??

        Thanks in advance,

        Jaap

        Show
        J. Gobel added a comment - - edited Hi Kiran, I unpacked the zip file in my plugin folder.Then I wget the patch file to my /src/plugin folder and applied the patch using . patch -p0 < Nutch1478.patch . I used your xml file, changed a few things and rebuild runtime with ant. I use MYSQL for example, and changed the path to my plugins folder. I checked with parsechecker and this is the result: :~/nutch2/nutch/runtime/local# bin/nutch parsechecker http://www.google.nl --------- Url --------------- http://www.google.nl --------- Metadata --------- I emptied my SQL database, to start from scratch. Did a crawl, and still in the field Metadata what I see is still 'garbage'. I have my Nutch 2.1 configured according to : http://nlp.solutions.asia/?p=180 Perhaps you can share your schema.xml file as well? Maybe I am doing something wrong in there?? Thanks in advance, Jaap
        Hide
        kiran added a comment -

        This is a screenshot of how my parsechecker is working after i configured Nutch 2.x with plugins.

        Show
        kiran added a comment - This is a screenshot of how my parsechecker is working after i configured Nutch 2.x with plugins.
        Hide
        kiran added a comment - - edited

        Hi Jaap,

        I have ran the same command as you did and looks like there are no metatags in that page. Please check the attached screenshot of different websites i parsed and the metadata with it.

        Once parsechecker is working, we should make sure the indexing is working. For that, we need to define what fields we want to be indexed in (index.parse.md) field in nutch-site.xml. There is a difference in 1.x and 2.x in the way this field should be defined.

        When i was working with this plugin, i was able to define the metatag fields as it is (without a preceding metatags tag like 1.x) and the same way in the schema and it worked for me. This is my schema (https://github.com/salvager/apache-solr-4.0.0-BETA/blob/master/example/solr/ejournals/conf/schema.xml).

        The dc fields that i have defined are particular to the website i am crawling. They might not be present in all the websites.

        I hope this helps.

        Show
        kiran added a comment - - edited Hi Jaap, I have ran the same command as you did and looks like there are no metatags in that page. Please check the attached screenshot of different websites i parsed and the metadata with it. Once parsechecker is working, we should make sure the indexing is working. For that, we need to define what fields we want to be indexed in (index.parse.md) field in nutch-site.xml. There is a difference in 1.x and 2.x in the way this field should be defined. When i was working with this plugin, i was able to define the metatag fields as it is (without a preceding metatags tag like 1.x) and the same way in the schema and it worked for me. This is my schema ( https://github.com/salvager/apache-solr-4.0.0-BETA/blob/master/example/solr/ejournals/conf/schema.xml ). The dc fields that i have defined are particular to the website i am crawling. They might not be present in all the websites. I hope this helps.
        Hide
        J. Gobel added a comment -

        Hi Kira,

        Thanks for replying and uploading your schema.xlm. It is always good to have some sort of reference material.

        I checked other sites, and indeed it works. I was just so into fixing it that I totally forgot that some sites just don't use metadata any longer

        However, I do notice that it doesn't seem to fetch title's. nutch.apache.org does have title in metadata, i checked the source of the page.

        Rgds,
        Jaap

        Show
        J. Gobel added a comment - Hi Kira, Thanks for replying and uploading your schema.xlm. It is always good to have some sort of reference material. I checked other sites, and indeed it works. I was just so into fixing it that I totally forgot that some sites just don't use metadata any longer However, I do notice that it doesn't seem to fetch title's. nutch.apache.org does have title in metadata, i checked the source of the page. Rgds, Jaap
        Hide
        kiran added a comment -

        I think this is a problem with parsechecker in 2.x. Only the fields from metatags are getting displayed while the other fields are not printed even though they are parsed and indexed.

        For me, those fields are parsed and indexed in Solr. I could see the results but parsechecker is not exactly displaying the fields. A new issue needs to be created. This plugin just works with metatags and indexing them.

        Regards,
        Kiran.

        Show
        kiran added a comment - I think this is a problem with parsechecker in 2.x. Only the fields from metatags are getting displayed while the other fields are not printed even though they are parsed and indexed. For me, those fields are parsed and indexed in Solr. I could see the results but parsechecker is not exactly displaying the fields. A new issue needs to be created. This plugin just works with metatags and indexing them. Regards, Kiran.
        Hide
        J. Gobel added a comment - - edited

        Hi Kiran,

        I have spent some time checking and monitoring the updates in my MSQL Metadata fiel. And something odd is happening.
        Just before the crawling is finished, the metadata field is updated with correct information, I can see the field being updated with robotsindex, follow description etc. . But as soon as the crawl has finished the metadata field is updated to :csh�����

        I copy pasted my log here below (just the last lines). I am aware that there are still some issues with MYSQL as backend for Nutch 2.x

        p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 ..

        013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
        2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing http://nutch.apache.com/
        2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots index, follow
        2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords .com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, nic, extention, icann
        2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : description Registreer nu uw .com.nl of .net.nl extentie.
        2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default
        2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null in cleanup
        2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
        2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
        2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
        2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
        2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
        2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null in cleanup

        Show
        J. Gobel added a comment - - edited Hi Kiran, I have spent some time checking and monitoring the updates in my MSQL Metadata fiel. And something odd is happening. Just before the crawling is finished, the metadata field is updated with correct information, I can see the field being updated with robotsindex, follow description etc. . But as soon as the crawl has finished the metadata field is updated to : csh ����� I copy pasted my log here below (just the last lines). I am aware that there are still some issues with MYSQL as backend for Nutch 2.x p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 .. 013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing http://nutch.apache.com/ 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots index, follow 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords .com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, nic, extention, icann 2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : description Registreer nu uw .com.nl of .net.nl extentie. 2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default 2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000 2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000 2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null in cleanup
        Hide
        J. Gobel added a comment -

        Where can I close my comments? It works as designed.

        Show
        J. Gobel added a comment - Where can I close my comments? It works as designed.
        Hide
        Roland von Herget added a comment -

        +1
        works fine for me.
        Thank you kiran

        Show
        Roland von Herget added a comment - +1 works fine for me. Thank you kiran
        Hide
        kiran added a comment -

        Thanks Roland for testing.

        I will try to update this patch based on my update in 1.x by using Metadata data structure and also add the test.

        Show
        kiran added a comment - Thanks Roland for testing. I will try to update this patch based on my update in 1.x by using Metadata data structure and also add the test.
        Hide
        Nick added a comment -

        This plugin works great if the page has the metatags mentioned in the index.content.md but breaks if they are missing. How do I go about making the fields optional?

        <property>
        <name>index.content.md</name>
        <value>description,keywords,author</value>
        </property>

        bin/nutch indexchecker http://localhost/stories/
        fetching: http://localhost/stories/
        parsing: http://localhost/stories/
        contentType: text/html
        Exception in thread "main" java.lang.NullPointerException
        at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95)
        at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107)
        at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:151)

        bin/nutch indexchecker http://localhost/stories/cant-be-satisfied
        [1] 5726
        fetching: http://localhost/stories/cant-be-satisfied
        parsing: http://localhost/stories/cant-be-satisfied
        contentType: text/html
        content : Can't be Satisfied
        author : Robert Gordon
        title : Can't be Satisfied
        keywords : blues, music, muddy water
        host : localhost
        description : Life and Times of Muddy Waters
        tstamp : 2013-10-19T01:34:41.440Z
        url : http://localhost/stories/cant-be-satisfied

        Show
        Nick added a comment - This plugin works great if the page has the metatags mentioned in the index.content.md but breaks if they are missing. How do I go about making the fields optional? <property> <name>index.content.md</name> <value>description,keywords,author</value> </property> bin/nutch indexchecker http://localhost/stories/ fetching: http://localhost/stories/ parsing: http://localhost/stories/ contentType: text/html Exception in thread "main" java.lang.NullPointerException at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107) at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:127) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:151) bin/nutch indexchecker http://localhost/stories/cant-be-satisfied [1] 5726 fetching: http://localhost/stories/cant-be-satisfied parsing: http://localhost/stories/cant-be-satisfied contentType: text/html content : Can't be Satisfied author : Robert Gordon title : Can't be Satisfied keywords : blues, music, muddy water host : localhost description : Life and Times of Muddy Waters tstamp : 2013-10-19T01:34:41.440Z url : http://localhost/stories/cant-be-satisfied
        Hide
        kiran added a comment -

        This plugin is not up to date with the patch at NUTCH-1467 and should be updated, added test case, fixed if above issue exists. I will work on it soon.

        Show
        kiran added a comment - This plugin is not up to date with the patch at NUTCH-1467 and should be updated, added test case, fixed if above issue exists. I will work on it soon.
        Hide
        Tien Nguyen Manh added a comment -

        i port parse-metatags to 2.x, this patch support multi-value in metatags.

        Show
        Tien Nguyen Manh added a comment - i port parse-metatags to 2.x, this patch support multi-value in metatags.
        Hide
        Lewis John McGibbney added a comment - - edited

        Previous patch did not compile.
        This patch adds in index-metadata plugin as per origin patch, adds correct formatting. Finally, in addition to the existing patch, I've added a small improvement which checks that the metatags string array has more than one value before adding \t.
        if you apply the patch you will see the test failing for TestMetatagsParser... this needs fixed but i won't be able to do it right now.
        kiran do you fancy having a look at this if you get time?

        Show
        Lewis John McGibbney added a comment - - edited Previous patch did not compile. This patch adds in index-metadata plugin as per origin patch, adds correct formatting. Finally, in addition to the existing patch, I've added a small improvement which checks that the metatags string array has more than one value before adding \t. if you apply the patch you will see the test failing for TestMetatagsParser... this needs fixed but i won't be able to do it right now. kiran do you fancy having a look at this if you get time?
        Hide
        Yasin Kılınç added a comment -

        I reviewed this patch and some bug fixed. +1 for commit.

        Show
        Yasin Kılınç added a comment - I reviewed this patch and some bug fixed. +1 for commit.
        Hide
        Yasin Kılınç added a comment -

        I added new patch. It pass all test cases.

        Show
        Yasin Kılınç added a comment - I added new patch. It pass all test cases.
        Hide
        Anton added a comment - - edited

        I try NUTCH-1478v4.patch

        When I confure index.content.md = description or index.content.md = metatag.description in nutch-default.xml

            <property>
                <name>index.parse.md</name>
                <value>metatag.description</value>
                <description>
                    Comma-separated list of keys to be taken from the parse metadata to generate fields.
                    Can be used e.g. for 'description' or 'keywords' provided that these values are generated
                    by a parser (see parse-metatags plugin)
                </description>
            </property>
        
            <property>
                <name>index.content.md</name>
                <value>description</value>
                <description>
                    Comma-separated list of keys to be taken from the content metadata to generate fields.
                </description>
            </property>
        
            <property>
                <name>index.db.md</name>
                <value></value>
                <description>
                    Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
                    Can be used to index values propagated from the seeds with the plugin urlmeta
                </description>
            </property>
        
            <!-- parse-metatags plugin properties -->
            <property>
                <name>metatags.names</name>
                <value>description</value>
                <description> Names of the metatags to extract, separated by;.
                    Use '*' to extract all metatags. Prefixes the names with 'metatag.'
                    in the parse-metadata. For instance to index description and keywords,
                    you need to activate the plugin index-metadata and set the value of the
                    parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
                </description>
            </property>
        

        I got NPE

        14/02/04 13:00:47 WARN mapred.LocalJobRunner: job_local1932930342_0001
        java.lang.Exception: java.lang.NullPointerException
        	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
        Caused by: java.lang.NullPointerException
        	at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95)
        	at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107)
        	at org.apache.nutch.indexer.IndexUtil.index(IndexUtil.java:77)
        	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:103)
        	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:61)
        	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
        	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        	at java.lang.Thread.run(Thread.java:744)
        
        Show
        Anton added a comment - - edited I try NUTCH-1478 v4.patch When I confure index.content.md = description or index.content.md = metatag.description in nutch-default.xml <property> <name> index.parse.md </name> <value> metatag.description </value> <description> Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin) </description> </property> <property> <name> index.content.md </name> <value> description </value> <description> Comma-separated list of keys to be taken from the content metadata to generate fields. </description> </property> <property> <name> index.db.md </name> <value> </value> <description> Comma-separated list of keys to be taken from the crawldb metadata to generate fields. Can be used to index values propagated from the seeds with the plugin urlmeta </description> </property> <!-- parse-metatags plugin properties --> <property> <name> metatags.names </name> <value> description </value> <description> Names of the metatags to extract, separated by;. Use '*' to extract all metatags. Prefixes the names with 'metatag.' in the parse-metadata. For instance to index description and keywords, you need to activate the plugin index-metadata and set the value of the parameter 'index.parse.md' to 'metatag.description;metatag.keywords'. </description> </property> I got NPE 14/02/04 13:00:47 WARN mapred.LocalJobRunner: job_local1932930342_0001 java.lang.Exception: java.lang.NullPointerException at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.NullPointerException at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107) at org.apache.nutch.indexer.IndexUtil.index(IndexUtil.java:77) at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:103) at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:61) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744)
        Hide
        Lewis John McGibbney added a comment -

        Hi Anton , thank you for testing this patch however it it difficult for us to reproduce if we do not know when you encountered the NPE. Can you elaborate on how we can reproduce? Thank you

        Show
        Lewis John McGibbney added a comment - Hi Anton , thank you for testing this patch however it it difficult for us to reproduce if we do not know when you encountered the NPE. Can you elaborate on how we can reproduce? Thank you
        Hide
        Anton added a comment -

        Steps to reproduce:
        1) Add fields for metatags
        <field name="metatag.description" type="string" stored="true" indexed="true"/>
        in schema.xml both in solr and nutch
        2) restart solr
        3) configure nutch-default.xml as in my comment above
        4) setup urls/seed.txt in nutch
        5) ant clean && ant runtime
        6) run crawl command

        I use solr-4.6.0 apache-nutch-2.2.1

        When I run full crawl with such command

        /home/hadoop/webcrawer/apache-nutch-2.2.1/runtime/deploy/bin/crawl urls/seed.txt az http://localhost:8088/solr/ 1

        metadata is successfully parsed and stored in database, problem occurs in SolrIndexerJob

        14/02/04 13:00:46 INFO solr.SolrIndexerJob: SolrIndexerJob: starting
        14/02/04 13:00:46 INFO plugin.PluginRepository: Plugins: looking in: /home/hadoop/data/hadoop-unjar8289682370547831088/classes/plugins
        14/02/04 13:00:46 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
        14/02/04 13:00:46 INFO plugin.PluginRepository: Registered Plugins:
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Http Protocol Plug-in (protocol-http)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Tika Parser Plug-in (parse-tika)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Anchor Indexing Filter (index-anchor)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	MetaTags (parse-metatags)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Index Metadata (index-metadata)
        14/02/04 13:00:46 INFO plugin.PluginRepository: Registered Extension-Points:
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
        14/02/04 13:00:46 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
        14/02/04 13:00:46 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100
        14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
        14/02/04 13:00:46 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off
        14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
        14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.metadata.MetadataIndexer
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:host.name=ascompany.info
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_45
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-7-oracle/jre
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/home/hadoop/hadoop-1.2.1/libexec/../conf:/usr/lib/jvm/java-7-oracle/lib/tools.jar:/home/hadoop/hadoop-1.2.1/libexec/..:/home/hadoop/hadoop-1.2.1/libexec/../hadoop-core-1.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/asm-3.2.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/aspectjrt-1.6.11.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/aspectjtools-1.6.11.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-beanutils-1.7.0.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-cli-1.2.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-codec-1.4.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-collections-3.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-configuration-1.6.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-daemon-1.0.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-digester-1.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-el-1.0.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-httpclient-3.0.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-io-2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-lang-2.4.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-logging-1.1.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-logging-api-1.0.4.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-math-2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-net-3.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/core-3.1.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/hadoop-capacity-scheduler-1.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/hadoop-fairscheduler-1.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/hadoop-thriftfs-1.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/hsqldb-1.8.0.10.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jasper-compiler-5.5.12.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jasper-runtime-5.5.12.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jdeb-0.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jersey-core-1.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jersey-json-1.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jersey-server-1.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jets3t-0.6.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jetty-6.1.26.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jetty-util-6.1.26.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jsch-0.1.42.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/junit-4.5.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/kfs-0.2.2.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/log4j-1.2.15.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/mockito-all-1.8.5.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/oro-2.0.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/servlet-api-2.5-20081211.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/slf4j-api-1.4.3.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/xmlenc-0.52.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/hadoop/hadoop-1.2.1/libexec/../lib/native/Linux-amd64-64
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:os.version=3.2.0-4-amd64
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:user.name=hadoop
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/hadoop
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/hadoop
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d1, negotiated timeout = 180000
        14/02/04 13:00:46 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same.
        14/02/04 13:00:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d2, negotiated timeout = 180000
        14/02/04 13:00:46 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same.
        14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
        14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d3, negotiated timeout = 180000
        14/02/04 13:00:47 INFO mapred.JobClient: Running job: job_local1932930342_0001
        14/02/04 13:00:47 INFO mapred.LocalJobRunner: Waiting for map tasks
        14/02/04 13:00:47 INFO mapred.LocalJobRunner: Starting task: attempt_local1932930342_0001_m_000000_0
        14/02/04 13:00:47 INFO util.ProcessTree: setsid exited with exit code 0
        14/02/04 13:00:47 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2ba04d20
        14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d4, negotiated timeout = 180000
        14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same.
        14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d5, negotiated timeout = 180000
        14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same.
        14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d6, negotiated timeout = 180000
        14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same.
        14/02/04 13:00:47 INFO mapred.MapTask: Processing split: org.apache.gora.mapreduce.GoraInputSplit@3a37f44d
        14/02/04 13:00:47 INFO mapreduce.GoraRecordReader: gora.buffer.read.limit = 10000
        14/02/04 13:00:47 INFO solr.SolrIndexerJob: Authenticating as: solr-user
        14/02/04 13:00:47 INFO conf.Configuration: found resource solrindex-mapping.xml at file:/home/hadoop/data/hadoop-unjar8289682370547831088/solrindex-mapping.xml
        14/02/04 13:00:47 INFO solr.SolrMappingReader: source: content dest: content
        14/02/04 13:00:47 INFO solr.SolrMappingReader: source: title dest: title
        14/02/04 13:00:47 INFO solr.SolrMappingReader: source: host dest: host
        14/02/04 13:00:47 INFO solr.SolrMappingReader: source: batchId dest: batchId
        14/02/04 13:00:47 INFO solr.SolrMappingReader: source: boost dest: boost
        14/02/04 13:00:47 INFO solr.SolrMappingReader: source: digest dest: digest
        14/02/04 13:00:47 INFO solr.SolrMappingReader: source: tstamp dest: tstamp
        14/02/04 13:00:47 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100
        14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
        14/02/04 13:00:47 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off
        14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
        14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.metadata.MetadataIndexer
        14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d7, negotiated timeout = 180000
        14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same.
        14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
        14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d8, negotiated timeout = 180000
        14/02/04 13:00:47 INFO mapred.LocalJobRunner: Map task executor complete.
        14/02/04 13:00:47 WARN mapred.FileOutputCommitter: Output path is null in cleanup
        14/02/04 13:00:47 WARN mapred.LocalJobRunner: job_local1932930342_0001
        java.lang.Exception: java.lang.NullPointerException
        	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
        Caused by: java.lang.NullPointerException
        	at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95)
        	at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107)
        	at org.apache.nutch.indexer.IndexUtil.index(IndexUtil.java:77)
        	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:103)
        	at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:61)
        	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
        	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        	at java.lang.Thread.run(Thread.java:744)
        14/02/04 13:00:48 INFO mapred.JobClient:  map 0% reduce 0%
        14/02/04 13:00:48 INFO mapred.JobClient: Job complete: job_local1932930342_0001
        14/02/04 13:00:48 INFO mapred.JobClient: Counters: 0
        14/02/04 13:00:48 ERROR solr.SolrIndexerJob: SolrIndexerJob: java.lang.RuntimeException: job failed: name=[az]solr-index, jobid=job_local1932930342_0001
        	at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
        	at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
        	at org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54)
        	at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
        	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        	at org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        	at java.lang.reflect.Method.invoke(Method.java:606)
        	at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
        
        Show
        Anton added a comment - Steps to reproduce: 1) Add fields for metatags <field name="metatag.description" type="string" stored="true" indexed="true"/> in schema.xml both in solr and nutch 2) restart solr 3) configure nutch-default.xml as in my comment above 4) setup urls/seed.txt in nutch 5) ant clean && ant runtime 6) run crawl command I use solr-4.6.0 apache-nutch-2.2.1 When I run full crawl with such command /home/hadoop/webcrawer/apache-nutch-2.2.1/runtime/deploy/bin/crawl urls/seed.txt az http://localhost:8088/solr/ 1 metadata is successfully parsed and stored in database, problem occurs in SolrIndexerJob 14/02/04 13:00:46 INFO solr.SolrIndexerJob: SolrIndexerJob: starting 14/02/04 13:00:46 INFO plugin.PluginRepository: Plugins: looking in: /home/hadoop/data/hadoop-unjar8289682370547831088/classes/plugins 14/02/04 13:00:46 INFO plugin.PluginRepository: Plugin Auto-activation mode: [ true ] 14/02/04 13:00:46 INFO plugin.PluginRepository: Registered Plugins: 14/02/04 13:00:46 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/02/04 13:00:46 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/02/04 13:00:46 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/02/04 13:00:46 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/02/04 13:00:46 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/02/04 13:00:46 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/02/04 13:00:46 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/02/04 13:00:46 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 14/02/04 13:00:46 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/02/04 13:00:46 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 14/02/04 13:00:46 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/02/04 13:00:46 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/02/04 13:00:46 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 14/02/04 13:00:46 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/02/04 13:00:46 INFO plugin.PluginRepository: MetaTags (parse-metatags) 14/02/04 13:00:46 INFO plugin.PluginRepository: Index Metadata (index-metadata) 14/02/04 13:00:46 INFO plugin.PluginRepository: Registered Extension-Points: 14/02/04 13:00:46 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/02/04 13:00:46 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/02/04 13:00:46 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/02/04 13:00:46 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/02/04 13:00:46 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/02/04 13:00:46 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/02/04 13:00:46 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/02/04 13:00:46 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100 14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 14/02/04 13:00:46 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off 14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.metadata.MetadataIndexer 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:host.name=ascompany.info 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_45 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-7-oracle/jre 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/home/hadoop/hadoop-1.2.1/libexec/../conf:/usr/lib/jvm/java-7-oracle/lib/tools.jar:/home/hadoop/hadoop-1.2.1/libexec/..:/home/hadoop/hadoop-1.2.1/libexec/../hadoop-core-1.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/asm-3.2.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/aspectjrt-1.6.11.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/aspectjtools-1.6.11.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-beanutils-1.7.0.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-cli-1.2.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-codec-1.4.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-collections-3.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-configuration-1.6.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-daemon-1.0.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-digester-1.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-el-1.0.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-httpclient-3.0.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-io-2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-lang-2.4.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-logging-1.1.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-logging-api-1.0.4.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-math-2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/commons-net-3.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/core-3.1.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/hadoop-capacity-scheduler-1.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/hadoop-fairscheduler-1.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/hadoop-thriftfs-1.2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/hsqldb-1.8.0.10.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jasper-compiler-5.5.12.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jasper-runtime-5.5.12.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jdeb-0.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jersey-core-1.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jersey-json-1.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jersey-server-1.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jets3t-0.6.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jetty-6.1.26.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jetty-util-6.1.26.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jsch-0.1.42.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/junit-4.5.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/kfs-0.2.2.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/log4j-1.2.15.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/mockito-all-1.8.5.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/oro-2.0.8.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/servlet-api-2.5-20081211.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/slf4j-api-1.4.3.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/xmlenc-0.52.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/hadoop/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-api-2.1.jar 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/hadoop/hadoop-1.2.1/libexec/../lib/ native /Linux-amd64-64 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA> 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:os.version=3.2.0-4-amd64 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:user.name=hadoop 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/hadoop 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/hadoop 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d1, negotiated timeout = 180000 14/02/04 13:00:46 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same. 14/02/04 13:00:46 INFO util.NativeCodeLoader: Loaded the native -hadoop library 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d2, negotiated timeout = 180000 14/02/04 13:00:46 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same. 14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d3, negotiated timeout = 180000 14/02/04 13:00:47 INFO mapred.JobClient: Running job: job_local1932930342_0001 14/02/04 13:00:47 INFO mapred.LocalJobRunner: Waiting for map tasks 14/02/04 13:00:47 INFO mapred.LocalJobRunner: Starting task: attempt_local1932930342_0001_m_000000_0 14/02/04 13:00:47 INFO util.ProcessTree: setsid exited with exit code 0 14/02/04 13:00:47 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2ba04d20 14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d4, negotiated timeout = 180000 14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same. 14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d5, negotiated timeout = 180000 14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same. 14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d6, negotiated timeout = 180000 14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same. 14/02/04 13:00:47 INFO mapred.MapTask: Processing split: org.apache.gora.mapreduce.GoraInputSplit@3a37f44d 14/02/04 13:00:47 INFO mapreduce.GoraRecordReader: gora.buffer.read.limit = 10000 14/02/04 13:00:47 INFO solr.SolrIndexerJob: Authenticating as: solr-user 14/02/04 13:00:47 INFO conf.Configuration: found resource solrindex-mapping.xml at file:/home/hadoop/data/hadoop-unjar8289682370547831088/solrindex-mapping.xml 14/02/04 13:00:47 INFO solr.SolrMappingReader: source: content dest: content 14/02/04 13:00:47 INFO solr.SolrMappingReader: source: title dest: title 14/02/04 13:00:47 INFO solr.SolrMappingReader: source: host dest: host 14/02/04 13:00:47 INFO solr.SolrMappingReader: source: batchId dest: batchId 14/02/04 13:00:47 INFO solr.SolrMappingReader: source: boost dest: boost 14/02/04 13:00:47 INFO solr.SolrMappingReader: source: digest dest: digest 14/02/04 13:00:47 INFO solr.SolrMappingReader: source: tstamp dest: tstamp 14/02/04 13:00:47 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100 14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 14/02/04 13:00:47 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off 14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.metadata.MetadataIndexer 14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d7, negotiated timeout = 180000 14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs actual schema 'az_webpage' , assuming they are the same. 14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x142ea7be01213d8, negotiated timeout = 180000 14/02/04 13:00:47 INFO mapred.LocalJobRunner: Map task executor complete. 14/02/04 13:00:47 WARN mapred.FileOutputCommitter: Output path is null in cleanup 14/02/04 13:00:47 WARN mapred.LocalJobRunner: job_local1932930342_0001 java.lang.Exception: java.lang.NullPointerException at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.NullPointerException at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107) at org.apache.nutch.indexer.IndexUtil.index(IndexUtil.java:77) at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:103) at org.apache.nutch.indexer.IndexerJob$IndexerMapper.map(IndexerJob.java:61) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744) 14/02/04 13:00:48 INFO mapred.JobClient: map 0% reduce 0% 14/02/04 13:00:48 INFO mapred.JobClient: Job complete: job_local1932930342_0001 14/02/04 13:00:48 INFO mapred.JobClient: Counters: 0 14/02/04 13:00:48 ERROR solr.SolrIndexerJob: SolrIndexerJob: java.lang.RuntimeException: job failed: name=[az]solr-index, jobid=job_local1932930342_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46) at org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54) at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
        Hide
        Lewis John McGibbney added a comment -

        Can you tell us what is happening at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95) I am not running with this patch right now but might get around to to lit later on.

        Show
        Lewis John McGibbney added a comment - Can you tell us what is happening at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:95) I am not running with this patch right now but might get around to to lit later on.
        Hide
        Talat UYARER added a comment -

        Hi Anton ,

        Your problem because of your schema defination. You should define meta_description. I can solve this problem

        Show
        Talat UYARER added a comment - Hi Anton , Your problem because of your schema defination. You should define meta_description. I can solve this problem
        Hide
        Anton added a comment - - edited

        Hi Lewis John McGibbney,
        Snipped of source code with NPE below.
        I added comment to mark line 95 with NPE

            // add the fields from contentmd
            if (contentFieldnames != null) {
              for (String metatag : contentFieldnames) {
                // String[] value = parse.getData().getContentMeta().getValues(metatag);
                ByteBuffer bvalues = page.getFromMetadata(new Utf8(metatag));
                String value = new String(bvalues.array());                       //line 95 with NPE
                if (value != null)
                  doc.add("meta_" + metatag, value);
        
              }
            }
        

        Hi Talat UYARER Do you mean that I need to define another field name in schema.xml?
        I have such field definition now:

         <field name="metatag.description" type="string" stored="true" indexed="true"/>
        

        It has the same name as in wiki http://wiki.apache.org/nutch/IndexMetatags.
        and another type of field ('string'), not the same as in wiki ('text')

        PS: I will try 'meta_description' may be it helps
        PPS: I tried 'meta_description'. It did not help. I have the same NPE as above in line 95

        Show
        Anton added a comment - - edited Hi Lewis John McGibbney , Snipped of source code with NPE below. I added comment to mark line 95 with NPE // add the fields from contentmd if (contentFieldnames != null ) { for ( String metatag : contentFieldnames) { // String [] value = parse.getData().getContentMeta().getValues(metatag); ByteBuffer bvalues = page.getFromMetadata( new Utf8(metatag)); String value = new String (bvalues.array()); //line 95 with NPE if (value != null ) doc.add( "meta_" + metatag, value); } } Hi Talat UYARER Do you mean that I need to define another field name in schema.xml? I have such field definition now: <field name= "metatag.description" type= "string" stored= "true" indexed= "true" /> It has the same name as in wiki http://wiki.apache.org/nutch/IndexMetatags . and another type of field ('string'), not the same as in wiki ('text') PS: I will try 'meta_description' may be it helps PPS: I tried 'meta_description'. It did not help. I have the same NPE as above in line 95
        Hide
        Talat UYARER added a comment -

        It is not related with the topic but this patch has problem about naming. Why we write MetatagIndexer name for a index filter. Wdyt Lewis John McGibbney ?

        Show
        Talat UYARER added a comment - It is not related with the topic but this patch has problem about naming. Why we write MetatagIndexer name for a index filter. Wdyt Lewis John McGibbney ?
        Hide
        Talat UYARER added a comment -

        Hi Anton ,

        Can you share your seedlist and problem key. I want to try.

        Show
        Talat UYARER added a comment - Hi Anton , Can you share your seedlist and problem key. I want to try.
        Hide
        Anton added a comment -

        I have issue with such seed.txt

        http://temel.az
        http://www.sambo.az
        http://avtosalon.az/
        
        Show
        Anton added a comment - I have issue with such seed.txt http: //temel.az http: //www.sambo.az http: //avtosalon.az/
        Hide
        Talat UYARER added a comment - - edited

        I fixed several mistakes within the patch. This is final. Anton , can you test the patch ?

        Show
        Talat UYARER added a comment - - edited I fixed several mistakes within the patch. This is final. Anton , can you test the patch ?
        Hide
        Anton added a comment - - edited

        Yes, nutch with patch v5 works fine without error.
        Thanks!!!

        in $SOLR_HOME$/conf/schema.xml I use such field names different from current wiki suggestion http://wiki.apache.org/nutch/IndexMetatags

             <!-- fields for metatags -->
             <field name="meta_description" type="string" stored="true" indexed="true"/>
             <field name="meta_keywords" type="string" stored="true" indexed="true"/> 
        
        Show
        Anton added a comment - - edited Yes, nutch with patch v5 works fine without error. Thanks!!! in $SOLR_HOME$/conf/schema.xml I use such field names different from current wiki suggestion http://wiki.apache.org/nutch/IndexMetatags <!-- fields for metatags --> <field name= "meta_description" type= "string" stored= "true" indexed= "true" /> <field name= "meta_keywords" type= "string" stored= "true" indexed= "true" />
        Hide
        Vangelis Karvounis added a comment -

        Hi! I have a few questions on how to run this patch:
        1. In nutch-site.xml:
        <property>
        <name>plugin.includes</name>
        <value>protocol-http|urlfilter-domain|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
        <description> </description>
        </property>

        2. In nutch-site.xml can you tell us how to use those 4 new properties?
        <property>
        <name>index.parse.md</name>
        <value>description,keywords</value>
        <description></description>
        </property>

        <property>
        <name>index.content.md</name>
        <value></value>
        <description> </description>
        </property>

        <property>
        <name>index.db.md</name>
        <value></value>
        <description> </description>
        </property>

        <!-- parse-metatags plugin properties -->
        <property>
        <name>description;keywords</name>
        <value>*</value>
        <description> </description>
        </property>

        3. I read somewhere that we need to input
        <field name="metatag.description" type="string" stored="true" indexed="true"/>
        in schema.xml both in solr and nutch. Is that correct?

        4. I want to see my chosen metatags at MySQL, for I find it more useful for my queries. Any ideas how to implement this?

        5. I want to crawl a page for <meta og:video> or <meta twitter: image> . Any ideas????

        Show
        Vangelis Karvounis added a comment - Hi! I have a few questions on how to run this patch: 1. In nutch-site.xml: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-domain|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> <description> </description> </property> 2. In nutch-site.xml can you tell us how to use those 4 new properties? <property> <name>index.parse.md</name> <value>description,keywords</value> <description></description> </property> <property> <name>index.content.md</name> <value></value> <description> </description> </property> <property> <name>index.db.md</name> <value></value> <description> </description> </property> <!-- parse-metatags plugin properties --> <property> <name>description;keywords</name> <value>*</value> <description> </description> </property> 3. I read somewhere that we need to input <field name="metatag.description" type="string" stored="true" indexed="true"/> in schema.xml both in solr and nutch. Is that correct? 4. I want to see my chosen metatags at MySQL, for I find it more useful for my queries. Any ideas how to implement this? 5. I want to crawl a page for <meta og:video> or <meta twitter: image> . Any ideas????
        Hide
        Talat UYARER added a comment -

        Hi Vangelis Karvounis,

        • First question is correct.
        • For Second question you can fix it like this. Other properties is not necessary. I miss them, I will update my patch for this.
          <property>
          <name>index.parse.md</name>
          <value>description,keywords</value>
          <description></description>
          </property>
          
          <!-- parse-metatags plugin properties -->
          <property>
          <name>metatags.names</name>
          <value>description;keywords</value>
          <description> </description>
          </property>
          
        • For third question you can use asterisk for accept every generated fields ,
          <field name="meta_*" type="string" stored="true" indexed="true"/>
        • Forth Question: I dont know.
        • Fifth question I am not sure. If you share a webpage I can test it.

        Talat

        Show
        Talat UYARER added a comment - Hi Vangelis Karvounis , First question is correct. For Second question you can fix it like this. Other properties is not necessary. I miss them, I will update my patch for this. <property> <name>index.parse.md</name> <value>description,keywords</value> <description></description> </property> <!-- parse-metatags plugin properties --> <property> <name>metatags.names</name> <value>description;keywords</value> <description> </description> </property> For third question you can use asterisk for accept every generated fields , <field name="meta_*" type="string" stored="true" indexed="true"/> Forth Question: I dont know. Fifth question I am not sure. If you share a webpage I can test it. Talat
        Hide
        Vangelis Karvounis added a comment -

        Thanks for the answer Talat!
        Let's say we crawl the url: http://www.uefa.com/worldcup/video/videoid=2064600.html?autoplay=true.

        Its page's source tells us:
        <!DOCTYPE html><html lang="en"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# video: http://ogp.me/ns/video# "><title>Veloso's World Cup dream for Portugal - FIFA World Cup - Video - UEFA.com</title><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="description" content=""It will be a unique and unforgettable event," Portugal's Miguel Veloso told UEFA.com as the FIFA World Cup in Brazil nears, but he knows they have been handed a tough group." /><meta name="keywords" content="velosos,world,cup,dream,portugal,Miguel Veloso,Portugal,Ukraine,Dynamo Kyiv" /><meta name="author" content="uefa.com" /><meta property="og:type" content="video.other" /><meta property="og:title" content="The official website for European football – UEFA.com" /><meta property="og:url" content="http://www.uefa.com/worldcup/video/videoid=2064600.html" /><meta property="og:image" content="http://www.uefa.com/MultimediaFiles/Photo/competitions/General/02/06/23/87/2062387_s2.jpg " /><meta property="og:description" content=""It will be a unique and unforgettable event," Portugal's Miguel Veloso told UEFA.com as the FIFA World Cup in Brazil nears, but he knows they have been handed a tough group." /><meta property="og:site_name" content="UEFA.com" /><meta property="video:release_date" content="2014-03-04T9:00Z" /><meta property="video:tag" content="velosos" /><meta property="video:tag" content="world" /><meta property="video:tag" content="cup" /><meta property="video:tag" content="dream" /><meta property="video:tag" content="portugal" /><meta property="video:tag" content="Miguel Veloso" /><meta property="video:tag" content="Portugal" /><meta property="video:tag" content="Ukraine" /><meta property="video:tag" content="Dynamo Kyiv" /><meta name="thumb" content="/multimediafiles/photo/competitions/general/02/06/23/87/2062387_s5.jpg" /><meta name="date" content="Tuesday 4 March 2014" /><meta name="isodate" content="2014-03-04" /><meta name="phototitle" content="Veluso" /><link rel="canonical" href="http://www.uefa.com/worldcup/video/videoid=2064600.html" /><link rel="image_src" href="http://www.uefa.com/multimediafiles/photo/competitions/general/02/06/23/87/2062387_s5.jpg"> </link><meta name="viewport" content="width=device-width, initial-scale=1.0" /><script type="text/javascript">

        I am interested in extracting the info <meta property="og:image" content="http://www.uefa.com/MultimediaFiles/Photo/competitions/General/02/06/23/87/2062387_s2.jpg " /> OR/AND the info <meta property="video:tag" content="cup" />.

        Do you think that parser can achieve this or we need to implement something else?

        Thank you in advance!

        Show
        Vangelis Karvounis added a comment - Thanks for the answer Talat! Let's say we crawl the url: http://www.uefa.com/worldcup/video/videoid=2064600.html?autoplay=true . Its page's source tells us: <!DOCTYPE html><html lang="en"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# video: http://ogp.me/ns/video# "><title>Veloso's World Cup dream for Portugal - FIFA World Cup - Video - UEFA.com</title><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="description" content=""It will be a unique and unforgettable event," Portugal's Miguel Veloso told UEFA.com as the FIFA World Cup in Brazil nears, but he knows they have been handed a tough group." /><meta name="keywords" content="velosos,world,cup,dream,portugal,Miguel Veloso,Portugal,Ukraine,Dynamo Kyiv" /><meta name="author" content="uefa.com" /><meta property="og:type" content="video.other" /><meta property="og:title" content="The official website for European football – UEFA.com" /><meta property="og:url" content="http://www.uefa.com/worldcup/video/videoid=2064600.html" /><meta property="og:image" content="http://www.uefa.com/MultimediaFiles/Photo/competitions/General/02/06/23/87/2062387_s2.jpg " /><meta property="og:description" content=""It will be a unique and unforgettable event," Portugal's Miguel Veloso told UEFA.com as the FIFA World Cup in Brazil nears, but he knows they have been handed a tough group." /><meta property="og:site_name" content="UEFA.com" /><meta property="video:release_date" content="2014-03-04T9:00Z" /><meta property="video:tag" content="velosos" /><meta property="video:tag" content="world" /><meta property="video:tag" content="cup" /><meta property="video:tag" content="dream" /><meta property="video:tag" content="portugal" /><meta property="video:tag" content="Miguel Veloso" /><meta property="video:tag" content="Portugal" /><meta property="video:tag" content="Ukraine" /><meta property="video:tag" content="Dynamo Kyiv" /><meta name="thumb" content="/multimediafiles/photo/competitions/general/02/06/23/87/2062387_s5.jpg" /><meta name="date" content="Tuesday 4 March 2014" /><meta name="isodate" content="2014-03-04" /><meta name="phototitle" content="Veluso" /><link rel="canonical" href="http://www.uefa.com/worldcup/video/videoid=2064600.html" /><link rel="image_src" href="http://www.uefa.com/multimediafiles/photo/competitions/general/02/06/23/87/2062387_s5.jpg"> </link><meta name="viewport" content="width=device-width, initial-scale=1.0" /><script type="text/javascript"> I am interested in extracting the info <meta property="og:image" content="http://www.uefa.com/MultimediaFiles/Photo/competitions/General/02/06/23/87/2062387_s2.jpg " /> OR/AND the info <meta property="video:tag" content="cup" />. Do you think that parser can achieve this or we need to implement something else? Thank you in advance!
        Hide
        Vangelis Karvounis added a comment -

        I use Eclipse and did some changes and I have managed to implement my previous question. I have some problems understanding how the patching process works. If I make it I will upload it or I will upload something else that is explanatory!

        Show
        Vangelis Karvounis added a comment - I use Eclipse and did some changes and I have managed to implement my previous question. I have some problems understanding how the patching process works. If I make it I will upload it or I will upload something else that is explanatory!
        Hide
        Vangelis Karvounis added a comment -

        I have made a patch but I don't know if I have done it correct..
        Anyway, my goal here was to input both property and rel tags. I would be glad if I could be of any help!
        Vangelis

        If you want to patch this version, you need to alter the plugin/parse-metatags/MetaTagsParser.java from the latest v5 patch as following:

        Add the following code just before 'return parse' inside the method ParseFilter(String url, WebPage page, Parse parse,HTMLMetaTags metaTags, DocumentFragment doc)

        Properties property = metaTags.getPropertyTags();
        Enumeration<?> properNames = property.propertyNames();

        while (properNames.hasMoreElements()) {
        String name1 = (String) properNames.nextElement();
        String value1 = property.getProperty(name1);

        if (metatagset.contains("*") || metatagset.contains(name1.toLowerCase()))

        { LOG.debug("Found meta tag : " + name1 + "\t" + value1); //System.out.println("Found meta tag : " + name1 + "\t" + value1); page.putToMetadata(new Utf8(PARSE_META_PREFIX + name1.toLowerCase()), ByteBuffer.wrap(value1.getBytes())); }

        }

        Properties relProp = metaTags.getRelTags();
        Enumeration<?> relNames = relProp.propertyNames();

        while (relNames.hasMoreElements()) {
        String name2 = (String) relNames.nextElement();
        String value2 = relProp.getProperty(name2);

        if (metatagset.contains("*") || metatagset.contains(name2.toLowerCase()))

        { LOG.debug("Found meta tag : " + name2 + "\t" + value2); //System.out.println("Found meta tag : " + name1 + "\t" + value1); page.putToMetadata(new Utf8(PARSE_META_PREFIX + name2.toLowerCase()), ByteBuffer.wrap(value2.getBytes())); }

        }

        //System.out.println(" "+metaTags.toString());

        Show
        Vangelis Karvounis added a comment - I have made a patch but I don't know if I have done it correct.. Anyway, my goal here was to input both property and rel tags. I would be glad if I could be of any help! Vangelis If you want to patch this version, you need to alter the plugin/parse-metatags/MetaTagsParser.java from the latest v5 patch as following: Add the following code just before 'return parse' inside the method ParseFilter(String url, WebPage page, Parse parse,HTMLMetaTags metaTags, DocumentFragment doc) Properties property = metaTags.getPropertyTags(); Enumeration<?> properNames = property.propertyNames(); while (properNames.hasMoreElements()) { String name1 = (String) properNames.nextElement(); String value1 = property.getProperty(name1); if (metatagset.contains("*") || metatagset.contains(name1.toLowerCase())) { LOG.debug("Found meta tag : " + name1 + "\t" + value1); //System.out.println("Found meta tag : " + name1 + "\t" + value1); page.putToMetadata(new Utf8(PARSE_META_PREFIX + name1.toLowerCase()), ByteBuffer.wrap(value1.getBytes())); } } Properties relProp = metaTags.getRelTags(); Enumeration<?> relNames = relProp.propertyNames(); while (relNames.hasMoreElements()) { String name2 = (String) relNames.nextElement(); String value2 = relProp.getProperty(name2); if (metatagset.contains("*") || metatagset.contains(name2.toLowerCase())) { LOG.debug("Found meta tag : " + name2 + "\t" + value2); //System.out.println("Found meta tag : " + name1 + "\t" + value1); page.putToMetadata(new Utf8(PARSE_META_PREFIX + name2.toLowerCase()), ByteBuffer.wrap(value2.getBytes())); } } //System.out.println(" "+metaTags.toString());
        Hide
        Talat UYARER added a comment -

        I update unnecessary configuration. Some trival updates.

        Show
        Talat UYARER added a comment - I update unnecessary configuration. Some trival updates.
        Hide
        Lewis John McGibbney added a comment -

        Hi Talat UYARER lets have a couple of people try this patch out then we can commit if there are no problems. Thanks

        Show
        Lewis John McGibbney added a comment - Hi Talat UYARER lets have a couple of people try this patch out then we can commit if there are no problems. Thanks
        Hide
        Lewis John McGibbney added a comment -

        I am a big fat +1 for v6 patch to be committed. Tested and verified. All tests cases are passing nicely as well.
        Anyone else?

        Show
        Lewis John McGibbney added a comment - I am a big fat +1 for v6 patch to be committed. Tested and verified. All tests cases are passing nicely as well. Anyone else?
        Hide
        Vangelis Karvounis added a comment -

        +1. Very nice work

        Show
        Vangelis Karvounis added a comment - +1. Very nice work
        Hide
        Lewis John McGibbney added a comment -

        v6 patch committed @revision 1577143 in 2.x HEAD
        Thank you to everyone who worked on this issue. Everyone is credited in CHANGES.txt

        Show
        Lewis John McGibbney added a comment - v6 patch committed @revision 1577143 in 2.x HEAD Thank you to everyone who worked on this issue. Everyone is credited in CHANGES.txt
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Nutch-nutchgora #951 (See https://builds.apache.org/job/Nutch-nutchgora/951/)
        NUTCH-1478 Parse-metatags and index-metadata plugin for Nutch 2.x series (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1577143)

        • /nutch/branches/2.x/CHANGES.txt
        • /nutch/branches/2.x/build.xml
        • /nutch/branches/2.x/conf/nutch-default.xml
        • /nutch/branches/2.x/conf/schema.xml
        • /nutch/branches/2.x/src/plugin/build.xml
        • /nutch/branches/2.x/src/plugin/index-metadata
        • /nutch/branches/2.x/src/plugin/index-metadata/build.xml
        • /nutch/branches/2.x/src/plugin/index-metadata/ivy.xml
        • /nutch/branches/2.x/src/plugin/index-metadata/plugin.xml
        • /nutch/branches/2.x/src/plugin/index-metadata/src
        • /nutch/branches/2.x/src/plugin/index-metadata/src/java
        • /nutch/branches/2.x/src/plugin/index-metadata/src/java/org
        • /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache
        • /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache/nutch
        • /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache/nutch/indexer
        • /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata
        • /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java
        • /nutch/branches/2.x/src/plugin/parse-metatags
        • /nutch/branches/2.x/src/plugin/parse-metatags/README.txt
        • /nutch/branches/2.x/src/plugin/parse-metatags/build.xml
        • /nutch/branches/2.x/src/plugin/parse-metatags/ivy.xml
        • /nutch/branches/2.x/src/plugin/parse-metatags/plugin.xml
        • /nutch/branches/2.x/src/plugin/parse-metatags/sample
        • /nutch/branches/2.x/src/plugin/parse-metatags/sample/testMetatags.html
        • /nutch/branches/2.x/src/plugin/parse-metatags/sample/testMultivalueMetatags.html
        • /nutch/branches/2.x/src/plugin/parse-metatags/src
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/java
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org/apache
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org/apache/nutch
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org/apache/nutch/parse
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/test
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org/apache
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org/apache/nutch
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org/apache/nutch/parse
        • /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/TestMetaTagsParser.java
        • /nutch/branches/2.x/src/test/org/apache/nutch/indexer/TestIndexingFilters.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Nutch-nutchgora #951 (See https://builds.apache.org/job/Nutch-nutchgora/951/ ) NUTCH-1478 Parse-metatags and index-metadata plugin for Nutch 2.x series (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1577143 ) /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/build.xml /nutch/branches/2.x/conf/nutch-default.xml /nutch/branches/2.x/conf/schema.xml /nutch/branches/2.x/src/plugin/build.xml /nutch/branches/2.x/src/plugin/index-metadata /nutch/branches/2.x/src/plugin/index-metadata/build.xml /nutch/branches/2.x/src/plugin/index-metadata/ivy.xml /nutch/branches/2.x/src/plugin/index-metadata/plugin.xml /nutch/branches/2.x/src/plugin/index-metadata/src /nutch/branches/2.x/src/plugin/index-metadata/src/java /nutch/branches/2.x/src/plugin/index-metadata/src/java/org /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache/nutch /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache/nutch/indexer /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata /nutch/branches/2.x/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java /nutch/branches/2.x/src/plugin/parse-metatags /nutch/branches/2.x/src/plugin/parse-metatags/README.txt /nutch/branches/2.x/src/plugin/parse-metatags/build.xml /nutch/branches/2.x/src/plugin/parse-metatags/ivy.xml /nutch/branches/2.x/src/plugin/parse-metatags/plugin.xml /nutch/branches/2.x/src/plugin/parse-metatags/sample /nutch/branches/2.x/src/plugin/parse-metatags/sample/testMetatags.html /nutch/branches/2.x/src/plugin/parse-metatags/sample/testMultivalueMetatags.html /nutch/branches/2.x/src/plugin/parse-metatags/src /nutch/branches/2.x/src/plugin/parse-metatags/src/java /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org/apache /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org/apache/nutch /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org/apache/nutch/parse /nutch/branches/2.x/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java /nutch/branches/2.x/src/plugin/parse-metatags/src/test /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org/apache /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org/apache/nutch /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org/apache/nutch/parse /nutch/branches/2.x/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/TestMetaTagsParser.java /nutch/branches/2.x/src/test/org/apache/nutch/indexer/TestIndexingFilters.java
        Hide
        Shanaka Jayasundera added a comment -

        Hi All,

        I've downloaded latest code from 2.x branch and try to index meta data to Solr but Solr query results are not showing meta data.

        But , parsechecker working fine . Do I need to do any additional configurations to get meta data on solr query results.

        $ ./bin/nutch parsechecker http://nutch.apache.org/
        fetching: http://nutch.apache.org/
        parsing: http://nutch.apache.org/
        contentType: text/html
        signature: b2bb805dcd51f12784190d58d619f0bc
        ---------
        Url
        ---------------
        http://nutch.apache.org/
        ---------
        Metadata
        ---------
        meta_forrest-version : 0.10-dev
        meta_generator : Apache Forrest
        meta_forrest-skin-name : nutch_rs_ : �
        meta_content-type : text/html; charset=UTF-8

        Command I'm using to crawl and Index is ,
        bin/crawl urls/seed.txt TestCrawl3.1 http://localhost:8983/solr/ 2

        I've not done much configuration changes, I've configure nutch-sites.xml and gora.properties to use hbase & gora

        Appreciate if anyone can help me to identify the missing configurations.
        Thanks in advance.

        Show
        Shanaka Jayasundera added a comment - Hi All, I've downloaded latest code from 2.x branch and try to index meta data to Solr but Solr query results are not showing meta data. But , parsechecker working fine . Do I need to do any additional configurations to get meta data on solr query results. $ ./bin/nutch parsechecker http://nutch.apache.org/ fetching: http://nutch.apache.org/ parsing: http://nutch.apache.org/ contentType: text/html signature: b2bb805dcd51f12784190d58d619f0bc --------- Url --------------- http://nutch.apache.org/ --------- Metadata --------- meta_forrest-version : 0.10-dev meta_generator : Apache Forrest meta_forrest-skin-name : nutch_rs_ : � meta_content-type : text/html; charset=UTF-8 Command I'm using to crawl and Index is , bin/crawl urls/seed.txt TestCrawl3.1 http://localhost:8983/solr/ 2 I've not done much configuration changes, I've configure nutch-sites.xml and gora.properties to use hbase & gora Appreciate if anyone can help me to identify the missing configurations. Thanks in advance.
        Hide
        Lewis John McGibbney added a comment -

        Can you please take this to the user@ mailing list? Thank you

        Show
        Lewis John McGibbney added a comment - Can you please take this to the user@ mailing list? Thank you

          People

          • Assignee:
            Unassigned
            Reporter:
            kiran
          • Votes:
            6 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development