Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41741

[SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.3.3, 3.4.0
    • SQL
    • None

    Description

      Hello ~
       
      I found a problem, but there are two ways to solve it.
       
      The parquet filter is pushed down. When using the like '***%' statement to query, if the system default encoding is not UTF-8, it may cause an error.
       
      There are two ways to bypass this problem as far as I know
      1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
      2. spark.sql.parquet.filterPushdown.string.startsWith=false
       

      The following is the information to reproduce this problem

      The parquet sample file is in the attachment

      spark.read.parquet("file:///home/kylin/hjldir/part-00000-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
      spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) 


       
       
       
      I think the correct code should be:

      private val strToBinary = Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) 

      Attachments

        1. part-00000-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
          0.8 kB
          Jiale He
        2. image-2022-12-28-18-00-00-861.png
          276 kB
          Jiale He
        3. image-2022-12-28-18-00-21-586.png
          399 kB
          Jiale He
        4. image-2023-01-09-11-10-31-262.png
          277 kB
          Jiale He
        5. image-2023-01-09-18-27-53-479.png
          408 kB
          Jiale He

        Activity

          People

            yumwang Yuming Wang
            jlelehe Jiale He
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: