Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-9141

SQL: Trace and test query mapping problems

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6
    • 2.7
    • sql

    Description

      One of mandatory steps of SQL query execution is topology mapping - we need to select nodes where required caches are located, and make sure that their partition distribution is valid for the given SQL query. Once nodes are detected, we try to reserve partitions of interest on mapper nodes to make sure that they will not be evicted during query execution.

      However, mapping step may fail for many reasons. Most often this is rebalance or concurrent node failures. In this case we simply retry the whole query execution from scratch. In IGNITE-9114 we ensured that retry cycle is not infinite and that root cause of remap is logged. However, original root cause of remap is not propagated to client node making the problem hard to debug for end users. Also we do not have enough tests for remap events. Let's fix this.

      Proposed implementation flow:
      1) Add retryCause: String field to GridQueryNextPageResponse which should be populated along with retry field on mapper node. See GridMapQueryExecutor#sendRetry method to understand what may cause retries (failed to reserve partitions or failed to execute non-collocated join). Make sure that these error messages are as verbose as possible with all necessary details (root cause, cache names, affected partitions, etc).
      2) Make sure that root cause is set in ReduceQueryRun#state and then propagated to user exception in case of retry timeout.
      3) Evaluate all places inside org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query which may lead to re-try and make sure that root cause is verbose and propagated to user exception in case of retry timeout.
      4) Add tests covering all re-try branches and ensure that query fails after timeout and that error message is correct.

      NB: Once propagation of error message to reducer is implemented, we may remove additional logging altogether.

      Attachments

        1. IGNITE-9141__Implemented_.patch
          52 kB
          Sergey Grimstad

        Issue Links

          Activity

            People

              SGrimstad Sergey Grimstad
              vozerov Vladimir Ozerov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: