Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17298

Require explicit CROSS join for cartesian products by default

    Details

    • Type: Story
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations under the default configuration (spark.sql.crossJoin.enabled = false).
      By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S.

      If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown.
      Turning on the spark.sql.crossJoin.enabled configuration flag will disable this check and allow cartesian products without an explicit cross join.

        Issue Links

          Activity

          Hide
          vssrinath Srinath added a comment -

          I've updated the description. Hopefully it is clearer.
          Note that before this change, even with spark.sql.crossJoin.enabled = false,
          case 1.a may sometimes NOT throw an error (i.e. execute successfully) depending on the physical plan chosen.
          With the proposed change, it would always throw an error

          Show
          vssrinath Srinath added a comment - I've updated the description. Hopefully it is clearer. Note that before this change, even with spark.sql.crossJoin.enabled = false, case 1.a may sometimes NOT throw an error (i.e. execute successfully) depending on the physical plan chosen. With the proposed change, it would always throw an error
          Hide
          srowen Sean Owen added a comment -

          Agree, though spark.sql.crossJoin.enabled=false by default, so the queries result in an error right now. This disagrees with https://issues.apache.org/jira/browse/SPARK-17298?focusedCommentId=15446920&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15446920

          This change allows it to work where it didn't before. I mean it's inaccurate to say that the change is to require this new syntax, because as before it can be allowed by changing the flag too.

          Show
          srowen Sean Owen added a comment - Agree, though spark.sql.crossJoin.enabled=false by default, so the queries result in an error right now. This disagrees with https://issues.apache.org/jira/browse/SPARK-17298?focusedCommentId=15446920&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15446920 This change allows it to work where it didn't before. I mean it's inaccurate to say that the change is to require this new syntax, because as before it can be allowed by changing the flag too.
          Hide
          sameerag Sameer Agarwal added a comment -

          Sean, if I understand correctly, here are the new semantics Srinath is proposing:

          1. Case 1: spark.sql.crossJoin.enabled = false
          (a) select * from A inner join B throws an error
          (b) select * from A cross join B doesn't throw an error
          2. Case 2: spark.sql.crossJoin.enabled = true
          (a) select * from A inner join B doesn't throw an error
          (b) select * from A cross join B doesn't throw an error

          1(a) and 2(a) confirm with the existing semantics in Spark. This PR proposes 1(b) and 2(b).

          Show
          sameerag Sameer Agarwal added a comment - Sean, if I understand correctly, here are the new semantics Srinath is proposing: 1. Case 1: spark.sql.crossJoin.enabled = false (a) select * from A inner join B throws an error (b) select * from A cross join B doesn't throw an error 2. Case 2: spark.sql.crossJoin.enabled = true (a) select * from A inner join B doesn't throw an error (b) select * from A cross join B doesn't throw an error 1(a) and 2(a) confirm with the existing semantics in Spark. This PR proposes 1(b) and 2(b).
          Hide
          srowen Sean Owen added a comment -

          This should be an error unless you set the property to allow cartesian joins. At least, it was for me when I tried it this week on 1.6. It's possible something changed in which case disregard this, if you've tested it.

          Show
          srowen Sean Owen added a comment - This should be an error unless you set the property to allow cartesian joins. At least, it was for me when I tried it this week on 1.6. It's possible something changed in which case disregard this, if you've tested it.
          Hide
          vssrinath Srinath added a comment -

          So if I do the following:

          create temporary view nt1 as select * from values
          ("one", 1),
          ("two", 2),
          ("three", 3)
          as nt1(k, v1);

          create temporary view nt2 as select * from values
          ("one", 1),
          ("two", 22),
          ("one", 5)
          as nt2(k, v2);

          SELECT * FROM nt1, nt2; – or
          select * FROM nt1 inner join nt2;

          The SELECT queries do not in fact result in an error. The proposed change would have them return an error

          Show
          vssrinath Srinath added a comment - So if I do the following: create temporary view nt1 as select * from values ("one", 1), ("two", 2), ("three", 3) as nt1(k, v1); create temporary view nt2 as select * from values ("one", 1), ("two", 22), ("one", 5) as nt2(k, v2); SELECT * FROM nt1, nt2; – or select * FROM nt1 inner join nt2; The SELECT queries do not in fact result in an error. The proposed change would have them return an error
          Hide
          srowen Sean Owen added a comment -

          This already results in an error. You mean that it will not result in an error if you specify it explicitly in the query right?

          Show
          srowen Sean Owen added a comment - This already results in an error. You mean that it will not result in an error if you specify it explicitly in the query right?
          Hide
          apachespark Apache Spark added a comment -

          User 'srinathshankar' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14866

          Show
          apachespark Apache Spark added a comment - User 'srinathshankar' has created a pull request for this issue: https://github.com/apache/spark/pull/14866
          Hide
          vssrinath Srinath added a comment -

          You are correct that with this change, queries of the form

          select * from A inner join B
          

          will now throw an error where previously they would not.
          The reason for this suggestion is that users may often forget to specify join conditions altogether, leading to incorrect, long-running queries. Requiring explicit cross joins helps clarify intent.

          Turning on the spark.sql.crossJoin.enabled flag will revert to previous behavior.

          Show
          vssrinath Srinath added a comment - You are correct that with this change, queries of the form select * from A inner join B will now throw an error where previously they would not. The reason for this suggestion is that users may often forget to specify join conditions altogether, leading to incorrect, long-running queries. Requiring explicit cross joins helps clarify intent. Turning on the spark.sql.crossJoin.enabled flag will revert to previous behavior.
          Hide
          srowen Sean Owen added a comment -

          Hm, aren't you suggesting that cartesian joins be allowed when explicitly requested, regardless of the global flag? that's not the same as requiring this syntax, which probably isn't feasible as it would break things.

          Show
          srowen Sean Owen added a comment - Hm, aren't you suggesting that cartesian joins be allowed when explicitly requested, regardless of the global flag? that's not the same as requiring this syntax, which probably isn't feasible as it would break things.

            People

            • Assignee:
              vssrinath Srinath
              Reporter:
              vssrinath Srinath
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development