Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1251

SolrDedup to use proper Lucene catch-all query

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.4
    • 1.6
    • indexer
    • None
    • Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query).

    Description

      Deletion of duplicates fails. This happens because the "get all" query used to get Solr index size is "id:[* TO *]", which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception.

      To correct this problem, change the "get all" query (SOLR_GET_ALL_QUERY) to "*:*", which is the standard Solr "get all" query.

      Indexing log extract:

      java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query
      at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
      at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
      Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query
      at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
      at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
      at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
      ... 3 more
      Caused by: org.apache.solr.common.SolrException: Internal Server Error

      Internal Server Error

      request: http://localhost:8081/arch/select?q=id:[* TO *]&fl=id,boost,tstamp,digest&start=0&rows=82938&wt=javabin&version=2
      at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
      at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
      at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
      ... 5 more

      Attachments

        Activity

          People

            markus17 Markus Jelsma
            arch Arkadi Kosmynin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: