Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1251

SolrDedup to use proper Lucene catch-all query

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.6
    • Component/s: indexer
    • Labels:
      None
    • Environment:

      Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query).

      Description

      Deletion of duplicates fails. This happens because the "get all" query used to get Solr index size is "id:[* TO *]", which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception.

      To correct this problem, change the "get all" query (SOLR_GET_ALL_QUERY) to "*:*", which is the standard Solr "get all" query.

      Indexing log extract:

      java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query
      at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
      at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
      Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query
      at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
      at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
      at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
      ... 3 more
      Caused by: org.apache.solr.common.SolrException: Internal Server Error

      Internal Server Error

      request: http://localhost:8081/arch/select?q=id:[* TO *]&fl=id,boost,tstamp,digest&start=0&rows=82938&wt=javabin&version=2
      at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
      at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
      at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
      ... 5 more

        Attachments

          Activity

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              arch Arkadi Kosmynin
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: