Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7480

[Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • Rust, Rust - DataFusion
    • None

    Description

      There are two scenarios that cause problems but are related to the queries with aggregate expressions and the SQL planner. The aggregate_test_100 dataset is used for both of the queries. 

      At a high level, the issue is basically that queries containing aggregate expressions may generate the wrong schema.

       

      Scenario 1

      Columns are grouped by but not selected.

      Query:

      SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13

      Error:

      ArrowError(InvalidArgumentError("number of columns must match number of fields in schema"))

      While the error is an ArrowError, it actually looks like it is because the wrong schema is generated. In the src/sql/planner.rs file the impl for SqlToRel is defined. In the sql_to_rel method, it checks if the query contains aggregate expressions, and if it does it generates the schema from the columns included in group expressions and aggregate expressions.

      This in turn generates the following schema:

      Schema {
          fields: [
              Field {
                  name: "c1",
                  data_type: Utf8,
                  nullable: false,
              },
              Field {
                  name: "c13",
                  data_type: Utf8,
                  nullable: false,
              },
              Field {
                  name: "MIN",
                  data_type: Float64,
                  nullable: true,
              },
          ],
          metadata: {},
      }

      I am not super familiar with how DataFusion works under the hood, but I would assume that this schema is actually correct for the Aggregate logical plan, but isn't projecting the data correctly thus resulting in the wrong query result schema? 

       

      Senario 2

      Columns are selected, but not grouped or part of an aggregate function. This query actually will run, but the wrong schema is produced.

      Query: 

      SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1

      Schema generated:

      Schema {
          fields: [
              Field {
                  name: "c0",
                  data_type: Utf8,
                  nullable: true,
              },
              Field {
                  name: "c1",
                  data_type: Float64,
                  nullable: true,
              },
              Field {
                  name: "c1",
                  data_type: Float64,
                  nullable: true,
              },
          ],
          metadata: {},
      } 

      This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, Float64).

       


      Schema 2 is questionable since some DBMS will run the query (ex MySQL) but others (Postgres) will require that all the columns must be in the GROUP BY to be used in an aggregate function.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            andygrove Andy Grove Assign to me
            kylemccarthy Kyle McCarthy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment