[SPARK-9131] Python UDFs change data values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.4.0, 1.4.1
Fix Version/s: 1.5.0
Component/s: PySpark, SQL
Labels:
None
Environment:

Pyspark 1.4 and 1.4.1

Target Version/s:

1.5.0
Sprint:
Spark 1.5 release

Description

I am having some troubles when using a custom udf in dataframes with pyspark 1.4.

I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format.

I show you my code below:

c= a.join(b, a['ID'] == b['ID_new'], 'inner')

c.filter(c['ID'] == '6000000002698917').show()

udf_A = UserDefinedFunction(lambda x: x, DateType())
udf_B = UserDefinedFunction(lambda x: x, DateType())
udf_C = UserDefinedFunction(lambda x: x, DateType())

d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td'))

d.filter(d['ID'] == '6000000002698917').show()

I am showing here the results from the outputs:

+----------------+----------------+----------+----------+
|          ID     |     ID_new  |     t1	 |   t2     |
+----------------+----------------+----------+----------+
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
+----------------+----------------+----------+----------+

+----------------+---------------+---------------+------------+------------+
|       ID        |	    ta	   |	   tb	     |	 tc	   |     td	  |
+----------------+---------------+---------------+------------+------------+
|6000000002698917|     2012-02-28|       2007-03-05|    2003-03-05|    2014-02-28|
|6000000002698917|     2012-02-20|       2007-02-15|    2002-02-15|    2013-02-20|
|6000000002698917|     2012-02-28|       2007-03-10|    2005-03-10|    2014-02-28|
|6000000002698917|     2012-02-20|       2007-03-05|    2003-03-05|    2013-02-20|
|6000000002698917|     2012-02-20|       2013-08-02|    2013-01-02|    2013-02-20|
|6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    2014-02-28|
|6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    2014-02-28|
|6000000002698917|     2012-02-20|       2014-01-02|    2013-01-02|    2013-02-20|
+----------------+---------------+---------------+------------+------------+

The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random).

Thanks in advance

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

testjson_jira9131.z01
22/Jul/15 08:29
5.00 MB
Luis Guerra
testjson_jira9131.z06
22/Jul/15 08:29
5.00 MB
Luis Guerra
testjson_jira9131.zip
22/Jul/15 08:29
2.24 MB
Luis Guerra
testjson_jira9131.z04
22/Jul/15 08:29
5.00 MB
Luis Guerra
testjson_jira9131.z05
22/Jul/15 08:29
5.00 MB
Luis Guerra
testjson_jira9131.z03
22/Jul/15 08:29
5.00 MB
Luis Guerra
testjson_jira9131.z02
22/Jul/15 08:29
5.00 MB
Luis Guerra

Activity

People

Assignee:: Davies Liu

Reporter:: Luis Guerra

Votes:: 2 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Jul/15 07:41

Updated:: 18/Sep/15 02:52

Resolved:: 05/Aug/15 05:51