Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19223

InputFileBlockHolder doesn't work with Python UDF for datasource other than FileFormat

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.2.0
    • PySpark, SQL
    • None

    Description

      For the datasource other than FileFormat, such as spark-xml which is based on BaseRelation and uses HadoopRDD, NewHadoopRDD, InputFileBlockHolder doesn't work with Python UDF.

      The method to reproduce it is, running the following codes with bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1:

      from pyspark.sql.functions import udf,input_file_name
      from pyspark.sql.types import StringType
      from pyspark.sql import SparkSession
      
      def filename(path):
          return path
      
      session = SparkSession.builder.appName('APP').getOrCreate()
      
      session.udf.register('sameText',filename)
      sameText = udf(filename, StringType())
      
      df = session.read.format('xml').load('a.xml', rowTag='root').select('*',input_file_name().alias('file'))
      df.select('file').show()  // works
      df.select(sameText(df['file'])).show()  // returns empty content
      

      a.xml:

      <root>
        <x>TEXT</x>
        <y>TEXT2</y>
      </root>
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            viirya L. C. Hsieh
            viirya L. C. Hsieh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment