Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.2.1
-
None
-
None
-
Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Description
Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
<property>
<name>elastic.cluster</name>
<value>elasticsearch</value>
<description>The cluster name to discover. Either host and potr must be defined
or cluster.</description>
</property>
I have then created a folder called urls and added seed.txt.
i ran the following commands
bin/nutch inject urls
bin/nutch generate -topN 1000
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
bin/nutch index -all
it runs no errors however no documents have been index
i also tried setting up the following with solr and no documents are indexed
Log:
2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success
2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06
2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting
2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z]
2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ...
2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites []
2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized
2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ...
2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address
, publish_address
{inet[/10.0.2.15:9301]}2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added
{[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q
2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address
, publish_address
{inet[/10.0.2.15:9201]}2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started
2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-24 02:58:11,581 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-24 02:58:11,716 INFO elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0]
2014-06-24 02:58:11,717 INFO elastic.ElasticIndexWriter - Processing to finalize last execute
2014-06-24 02:58:11,717 INFO elasticsearch.node - [Silver] stopping ...
2014-06-24 02:58:11,751 INFO elasticsearch.node - [Silver] stopped
2014-06-24 02:58:11,751 INFO elasticsearch.node - [Silver] closing ...
2014-06-24 02:58:11,756 INFO elasticsearch.node - [Silver] closed
2014-06-24 02:58:11,759 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2014-06-24 02:58:12,511 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-24 02:58:12,511 INFO indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9300)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
2014-06-24 02:58:12,525 INFO elasticsearch.node - [Lifeguard] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z]
2014-06-24 02:58:12,525 INFO elasticsearch.node - [Lifeguard] initializing ...
2014-06-24 02:58:12,555 INFO elasticsearch.plugins - [Lifeguard] loaded [], sites []
2014-06-24 02:58:13,025 INFO elasticsearch.node - [Lifeguard] initialized
2014-06-24 02:58:13,025 INFO elasticsearch.node - [Lifeguard] starting ...
2014-06-24 02:58:13,032 INFO elasticsearch.transport - [Lifeguard] bound_address
, publish_address
{inet[/10.0.2.15:9301]}2014-06-24 02:58:16,063 INFO cluster.service - [Lifeguard] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added
{[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
2014-06-24 02:58:16,072 INFO elasticsearch.discovery - [Lifeguard] elasticsearch/MWiqtTiqS5aC_M7QvGtfyg
2014-06-24 02:58:16,074 INFO elasticsearch.http - [Lifeguard] bound_address
, publish_address
{inet[/10.0.2.15:9201]}2014-06-24 02:58:16,076 INFO elasticsearch.node - [Lifeguard] started
2014-06-24 02:58:16,076 INFO indexer.IndexingJob - IndexingJob: done.