[SPARK-41741] [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.3.3, 3.4.0
Component/s: SQL
Labels:
None

Description

Hello ~

I found a problem, but there are two ways to solve it.

The parquet filter is pushed down. When using the like '***%' statement to query, if the system default encoding is not UTF-8, it may cause an error.

There are two ways to bypass this problem as far as I know
1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
2. spark.sql.parquet.filterPushdown.string.startsWith=false

The following is the information to reproduce this problem

The parquet sample file is in the attachment

spark.read.parquet("file:///home/kylin/hjldir/part-00000-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false)

I think the correct code should be:

private val strToBinary = Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8))

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

part-00000-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
28/Dec/22 09:59
0.8 kB
Jiale He
image-2022-12-28-18-00-00-861.png
28/Dec/22 10:00
276 kB
Jiale He
image-2022-12-28-18-00-21-586.png
28/Dec/22 10:00
399 kB
Jiale He
image-2023-01-09-11-10-31-262.png
09/Jan/23 03:10
277 kB
Jiale He
image-2023-01-09-18-27-53-479.png
09/Jan/23 10:28
408 kB
Jiale He

Issue Links

links to

[Github] Pull Request #40090 (wangyum)

Activity

People

Assignee:: Yuming Wang

Reporter:: Jiale He

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 28/Dec/22 09:59

Updated:: 20/Feb/23 11:34

Resolved:: 20/Feb/23 11:34