[SPARK-20144] spark.read.parquet no long maintains ordering of the data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.0.2
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with.

This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workflows because they assume the ordering of the data.

Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.

Attachments

Issue Links

links to

[Github] Pull Request #22673 (darabos)

GitHub Pull Request #22673

Activity

People

Assignee:: Unassigned

Reporter:: Li Jin

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 29/Mar/17 14:20

Updated:: 11/Apr/19 20:27

Resolved:: 04/Apr/17 11:02