[SPARK-16975] Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: Input/Output
Labels:
- parquet
Environment:

Ubuntu Linux 14.04

Description

Spark-2.0.0 seems to have some problems reading a parquet dataset generated by 1.6.2.

In [80]: spark.read.parquet('/path/to/data')
...
AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data. It must be specified manually;'

The dataset is ~150G and partitioned by _locality_code column. None of the partitions are empty. I have narrowed the failing dataset to the first 32 partitions of the data:

In [82]: spark.read.parquet(*subdirs[:32])
...
AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be specified manually;'

Interestingly, it works OK if you remove any of the partitions from the list:

In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + subdirs[i+1:32]))

Another strange thing is that the schemas for the first and the last 31 partitions of the subset are identical:

In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == spark.read.parquet(*subdirs[1:32]).schema.fields
Out[84]: True

Which got me interested and I tried this:

In [87]: spark.read.parquet(*([subdirs[0]] * 32))
...
AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be specified manually;'

In [88]: spark.read.parquet(*([subdirs[15]] * 32))
...
AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be specified manually;'

In [89]: spark.read.parquet(*([subdirs[31]] * 32))
...
AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be specified manually;'

If I read the first partition, save it in 2.0 and try to read in the same manner, everything is fine:

In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl

In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))

I have originally posted it to user mailing list, but with the last discoveries this clearly seems like a bug.

Attachments

Issue Links

links to

[Github] Pull Request #14585 (dongjoon-hyun)

[Github] Pull Request #14627 (HyukjinKwon)

Activity

People

Assignee:: Dongjoon Hyun

Reporter:: immerrr again

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Aug/16 11:04

Updated:: 13/Aug/16 01:19

Resolved:: 12/Aug/16 07:09