Nutch
  1. Nutch
  2. NUTCH-1251

SolrDedup to use proper Lucene catch-all query

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.6
    • Component/s: indexer
    • Labels:
      None
    • Environment:

      Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query).

      Description

      Deletion of duplicates fails. This happens because the "get all" query used to get Solr index size is "id:[* TO *]", which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception.

      To correct this problem, change the "get all" query (SOLR_GET_ALL_QUERY) to "*:*", which is the standard Solr "get all" query.

      Indexing log extract:

      java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query
      at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
      at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
      Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query
      at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
      at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
      at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
      ... 3 more
      Caused by: org.apache.solr.common.SolrException: Internal Server Error

      Internal Server Error

      request: http://localhost:8081/arch/select?q=id:[* TO *]&fl=id,boost,tstamp,digest&start=0&rows=82938&wt=javabin&version=2
      at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
      at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
      at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
      ... 5 more

        Activity

        Hide
        Hudson added a comment -

        Integrated in nutch-trunk-maven #330 (See https://builds.apache.org/job/nutch-trunk-maven/330/)
        NUTCH-1251 SolrDedup to use proper Lucene catch-all query (Revision 1353857)

        Result = SUCCESS
        markus :
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
        Show
        Hudson added a comment - Integrated in nutch-trunk-maven #330 (See https://builds.apache.org/job/nutch-trunk-maven/330/ ) NUTCH-1251 SolrDedup to use proper Lucene catch-all query (Revision 1353857) Result = SUCCESS markus : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
        Hide
        Markus Jelsma added a comment -

        Committed for 1.6 in rev. 1353857.
        Thanks Arkadi!

        Show
        Markus Jelsma added a comment - Committed for 1.6 in rev. 1353857. Thanks Arkadi!
        Hide
        Arkadi Kosmynin added a comment -

        Thanks Markus!

        Show
        Arkadi Kosmynin added a comment - Thanks Markus!
        Hide
        Markus Jelsma added a comment -

        20120304-push-1.6

        Show
        Markus Jelsma added a comment - 20120304-push-1.6
        Hide
        Arkadi Kosmynin added a comment -

        It is one line change. File org.apache.nutch.indexer.solr.SolrDeleteDuplicates.java, line 90.

        Show
        Arkadi Kosmynin added a comment - It is one line change. File org.apache.nutch.indexer.solr.SolrDeleteDuplicates.java, line 90.
        Hide
        Markus Jelsma added a comment -

        Can you provide a patch for trunk?

        Show
        Markus Jelsma added a comment - Can you provide a patch for trunk?

          People

          • Assignee:
            Markus Jelsma
            Reporter:
            Arkadi Kosmynin
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development