Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25961

Random numbers are not supported when handling data skew

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.1
    • None
    • SQL
    • None
    • spark on yarn 2.3.1

    Description

      My SQL query uses two tables to join. One table join key has null value. I use random value instead of null;however, it has error. The error as follows:

      Error in query: nondeterministic expressions are only allowed in
      
      Project, Filter, Aggregate or Window, found
      
      scan spark source code is org.apache.spark.sql.catalyst.analysis.CheckAnalysis check sql, because the number of random variables is uncertain, it is prohibited
      
      case o if o.expressions.exists(!_.deterministic) &&
       !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
       !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] =>
       // The rule above is used to check Aggregate operator.
       failAnalysis(
       s"""nondeterministic expressions are only allowed in
      |Project, Filter, Aggregate or Window, found:|
      |${o.expressions.map(_.sql).mkString(",")}|
      |in operator ${operator.simpleString}
       """.stripMargin)|
      

       
      Is it possible to add Join to this code? It's not yet tested.And whether there will be other effects

      case o if o.expressions.exists(!_.deterministic) &&
       !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
       !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] +{color:#d04437}&& !o.isInstanceOf[Join]{color}+ =>
       // The rule above is used to check Aggregate operator.
       failAnalysis(
       s"""nondeterministic expressions are only allowed in
      |Project, Filter, Aggregate or Window or Join, found:|
      |${o.expressions.map(_.sql).mkString(",")}|
      |in operator ${operator.simpleString}
       """.stripMargin)|
      

      This is my SQL query is as below:

      SELECT T1.CUST_NO AS CUST_NO,
             T3.CON_LAST_NAME AS CUST_NAME,
             T3.CON_SEX_MF AS SEX_CODE,
             T3.X_POSITION AS POST_LV_CODE
      FROM tmp.ICT_CUST_RANGE_INFO T1
      LEFT JOIN tmp.F_CUST_BASE_INFO_ALL T3 ON CASE
                                                   WHEN coalesce(T1.CUST_NO, '') ='' THEN concat('cust_no', RAND())
                                                   ELSE T1.CUST_NO
                                               END = T3.BECIF
      AND T3.DATE='20181105'
      WHERE T1.DATE='20181105'
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            zengxl zengxl
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: