Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9131

Python UDFs change data values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.4.0, 1.4.1
    • 1.5.0
    • PySpark, SQL
    • None
    • Pyspark 1.4 and 1.4.1

    • Spark 1.5 release

    Description

      I am having some troubles when using a custom udf in dataframes with pyspark 1.4.

      I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format.

      I show you my code below:

      c= a.join(b, a['ID'] == b['ID_new'], 'inner')
      
      c.filter(c['ID'] == '6000000002698917').show()
      
      udf_A = UserDefinedFunction(lambda x: x, DateType())
      udf_B = UserDefinedFunction(lambda x: x, DateType())
      udf_C = UserDefinedFunction(lambda x: x, DateType())
      
      d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td'))
      
      d.filter(d['ID'] == '6000000002698917').show()
      

      I am showing here the results from the outputs:

      +----------------+----------------+----------+----------+
      |          ID     |     ID_new  |     t1	 |   t2     |
      +----------------+----------------+----------+----------+
      |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
      |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
      |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
      |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
      |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
      |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
      |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
      |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
      +----------------+----------------+----------+----------+
      
      +----------------+---------------+---------------+------------+------------+
      |       ID        |	    ta	   |	   tb	     |	 tc	   |     td	  |
      +----------------+---------------+---------------+------------+------------+
      |6000000002698917|     2012-02-28|       2007-03-05|    2003-03-05|    2014-02-28|
      |6000000002698917|     2012-02-20|       2007-02-15|    2002-02-15|    2013-02-20|
      |6000000002698917|     2012-02-28|       2007-03-10|    2005-03-10|    2014-02-28|
      |6000000002698917|     2012-02-20|       2007-03-05|    2003-03-05|    2013-02-20|
      |6000000002698917|     2012-02-20|       2013-08-02|    2013-01-02|    2013-02-20|
      |6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    2014-02-28|
      |6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    2014-02-28|
      |6000000002698917|     2012-02-20|       2014-01-02|    2013-01-02|    2013-02-20|
      +----------------+---------------+---------------+------------+------------+
      

      The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random).

      Thanks in advance

      Attachments

        1. testjson_jira9131.z01
          5.00 MB
          Luis Guerra
        2. testjson_jira9131.z06
          5.00 MB
          Luis Guerra
        3. testjson_jira9131.zip
          2.24 MB
          Luis Guerra
        4. testjson_jira9131.z04
          5.00 MB
          Luis Guerra
        5. testjson_jira9131.z05
          5.00 MB
          Luis Guerra
        6. testjson_jira9131.z03
          5.00 MB
          Luis Guerra
        7. testjson_jira9131.z02
          5.00 MB
          Luis Guerra

        Activity

          People

            davies Davies Liu
            luispeguerra Luis Guerra
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: