[SPARK-19223] InputFileBlockHolder doesn't work with Python UDF for datasource other than FileFormat - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.2.0
Component/s: PySpark, SQL
Labels:
None

Description

For the datasource other than FileFormat, such as spark-xml which is based on BaseRelation and uses HadoopRDD, NewHadoopRDD, InputFileBlockHolder doesn't work with Python UDF.

The method to reproduce it is, running the following codes with bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1:

from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession

def filename(path):
    return path

session = SparkSession.builder.appName('APP').getOrCreate()

session.udf.register('sameText',filename)
sameText = udf(filename, StringType())

df = session.read.format('xml').load('a.xml', rowTag='root').select('*',input_file_name().alias('file'))
df.select('file').show()  // works
df.select(sameText(df['file'])).show()  // returns empty content

a.xml:

<root>
  <x>TEXT</x>
  <y>TEXT2</y>
</root>

Attachments

Issue Links

links to

[Github] Pull Request #16585 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: L. C. Hsieh

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Jan/17 06:32

Updated:: 18/Jan/17 15:08

Resolved:: 18/Jan/17 15:08