Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1076

Solrindex has no documents following bin/nutch solrindex when using protocol-file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.3
    • None
    • indexer
    • Ubuntu Linux 10.04 server
      JDK 1.6
      Nutch 1.3
      Solr 3.1.0

    Description

      Note: When using protocol-http I am able to update solr effortlessly.

      To test this I have a single pdf file that I am trying to index in my urls directory.

      I execute:

      bin/nutch crawl urls

      Output:

      solrUrl is not set, indexing will be skipped...
      crawl started in: crawl-20110805151045
      rootUrlDir = urls
      threads = 10
      depth = 5
      solrUrl=null
      Injector: starting at 2011-08-05 15:10:45
      Injector: crawlDb: crawl-20110805151045/crawldb
      Injector: urlDir: urls
      Injector: Converting injected urls to crawl db entries.
      Injector: Merging injected urls into crawl db.
      Injector: finished at 2011-08-05 15:10:48, elapsed: 00:00:02
      Generator: starting at 2011-08-05 15:10:48
      Generator: Selecting best-scoring urls due for fetch.
      Generator: filtering: true
      Generator: normalizing: true
      Generator: jobtracker is 'local', generating exactly one partition.
      Generator: Partitioning selected urls for politeness.
      Generator: segment: crawl-20110805151045/segments/20110805151050
      Generator: finished at 2011-08-05 15:10:51, elapsed: 00:00:03
      Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
      Fetcher: starting at 2011-08-05 15:10:51
      Fetcher: segment: crawl-20110805151045/segments/20110805151050
      Fetcher: threads: 10
      QueueFeeder finished: total 1 records + hit by time limit :0
      fetching file:///home/nutch/nutch-1.3/runtime/local/indexdir/Altec.pdf
      -finishing thread FetcherThread, activeThreads=9
      -finishing thread FetcherThread, activeThreads=8
      -finishing thread FetcherThread, activeThreads=7
      -finishing thread FetcherThread, activeThreads=6
      -finishing thread FetcherThread, activeThreads=5
      -finishing thread FetcherThread, activeThreads=4
      -finishing thread FetcherThread, activeThreads=3
      -finishing thread FetcherThread, activeThreads=2
      -finishing thread FetcherThread, activeThreads=1
      -finishing thread FetcherThread, activeThreads=0
      -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
      -activeThreads=0
      Fetcher: finished at 2011-08-05 15:10:53, elapsed: 00:00:02
      ParseSegment: starting at 2011-08-05 15:10:53
      ParseSegment: segment: crawl-20110805151045/segments/20110805151050
      ParseSegment: finished at 2011-08-05 15:10:56, elapsed: 00:00:03
      CrawlDb update: starting at 2011-08-05 15:10:56
      CrawlDb update: db: crawl-20110805151045/crawldb
      CrawlDb update: segments: [crawl-20110805151045/segments/20110805151050]
      CrawlDb update: additions allowed: true
      CrawlDb update: URL normalizing: true
      CrawlDb update: URL filtering: true
      CrawlDb update: Merging segment data into db.
      CrawlDb update: finished at 2011-08-05 15:10:57, elapsed: 00:00:01
      Generator: starting at 2011-08-05 15:10:57
      Generator: Selecting best-scoring urls due for fetch.
      Generator: filtering: true
      Generator: normalizing: true
      Generator: jobtracker is 'local', generating exactly one partition.
      Generator: 0 records selected for fetching, exiting ...
      Stopping at depth=1 - no more URLs to fetch.
      LinkDb: starting at 2011-08-05 15:10:58
      LinkDb: linkdb: crawl-20110805151045/linkdb
      LinkDb: URL normalize: true
      LinkDb: URL filter: true
      LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110805151045/segments/20110805151050
      LinkDb: finished at 2011-08-05 15:10:59, elapsed: 00:00:01
      crawl finished: crawl-20110805151045

      Then with a clean solr index (stats output from stats.jsp below):

      searcherName : Searcher@14dd758 main
      caching : true
      numDocs : 0
      maxDoc : 0
      reader : SolrIndexReader

      {this=1ee148b,r=ReadOnlyDirectoryReader@1ee148b,refCnt=1,segments=0}
      readerDir : org.apache.lucene.store.NIOFSDirectory@/home/solr/apache-solr-3.1.0/example/solr/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@987197
      indexVersion : 1312575204101
      openedAt : Fri Aug 05 15:13:24 CDT 2011
      registeredAt : Fri Aug 05 15:13:24 CDT 2011
      warmupTime : 0

      I then execute:

      bin/nutch solrindex http://localhost:8983/solr/ crawl-20110805151045/crawldb/ crawl-20110805151045/linkdb/ crawl-20110805151045/segments/*

      bin/nutch output:

      SolrIndexer: starting at 2011-08-05 15:15:48
      SolrIndexer: finished at 2011-08-05 15:15:50, elapsed: 00:00:01

      solr output:

      Aug 5, 2011 3:15:50 PM org.apache.solr.update.DirectUpdateHandler2 commit
      INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher <init>
      INFO: Opening Searcher@15f1f9c main
      Aug 5, 2011 3:15:50 PM org.apache.solr.update.DirectUpdateHandler2 commit
      INFO: end_commit_flush
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher warm
      INFO: autowarming Searcher@15f1f9c main from Searcher@14dd758 main
      fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher warm
      INFO: autowarming result for Searcher@15f1f9c main
      fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher warm
      INFO: autowarming Searcher@15f1f9c main from Searcher@14dd758 main
      filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher warm
      INFO: autowarming result for Searcher@15f1f9c main
      filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher warm
      INFO: autowarming Searcher@15f1f9c main from Searcher@14dd758 main
      queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=1,evictions=0,size=1,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher warm
      INFO: autowarming result for Searcher@15f1f9c main
      queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher warm
      INFO: autowarming Searcher@15f1f9c main from Searcher@14dd758 main
      documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher warm
      INFO: autowarming result for Searcher@15f1f9c main
      documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.core.QuerySenderListener newSearcher
      INFO: QuerySenderListener sending requests to Searcher@15f1f9c main
      Aug 5, 2011 3:15:50 PM org.apache.solr.core.QuerySenderListener newSearcher
      INFO: QuerySenderListener done.
      Aug 5, 2011 3:15:50 PM org.apache.solr.core.SolrCore registerSearcher
      INFO: [] Registered new searcher Searcher@15f1f9c main
      Aug 5, 2011 3:15:50 PM org.apache.solr.search.SolrIndexSearcher close
      INFO: Closing Searcher@14dd758 main
      fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=1,evictions=0,size=1,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
      Aug 5, 2011 3:15:50 PM org.apache.solr.update.processor.LogUpdateProcessor finish
      INFO: {commit=} 0 8
      Aug 5, 2011 3:15:50 PM org.apache.solr.core.SolrCore execute
      INFO: [] webapp=/solr path=/update params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2} status=0 QTime=8

      output from stats.jsp:

      stats:
      searcherName : Searcher@15f1f9c main
      caching : true
      numDocs : 0
      maxDoc : 0
      reader : SolrIndexReader{this=1ee148b,r=ReadOnlyDirectoryReader@1ee148b,refCnt=1,segments=0}

      readerDir : org.apache.lucene.store.NIOFSDirectory@/home/solr/apache-solr-3.1.0/example/solr/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@987197
      indexVersion : 1312575204101
      openedAt : Fri Aug 05 15:15:50 CDT 2011
      registeredAt : Fri Aug 05 15:15:50 CDT 2011
      warmupTime : 2

      Attachments

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              seth.griffin Seth Griffin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: