[SPARK-6917] Broken data returned to PySpark dataframe if any large numbers used in Scala land - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.3.0
Fix Version/s: 1.4.0
Component/s: PySpark, SQL
Labels:
None
Environment:

Spark 1.3, Python 2.7.6, Scala 2.10

Description

When trying to access data stored in a Parquet file with an INT96 column (read: TimestampType() encoded for Impala), if the INT96 column is included in the fetched data, other, smaller numeric types come back broken.

In [1]: sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").select('int_col', 'long_col').first()
Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))

In [2]: sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").first()
Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=<DstTzInfo 'America/Toronto' EDT-1 day, 19:00:00 DST>))

Note the {u'_class_': u'scala.runtime.BoxedUnit'} values being returned for the int_col and long_col columns in the second loop above. This only happens if I select the date_col which is stored as INT96.

I don't know much about Scala boxing, but I assume that somehow by including numeric columns that are bigger than a machine word I trigger some different, slower execution path somewhere that boxes stuff and causes this problem.

If anyone could give me any pointers on where to get started fixing this I'd be happy to dive in!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

part-r-00001.parquet
14/Apr/15 23:26
1 kB
Harry Brundage

Issue Links

is related to

SPARK-7314 Upgrade Pyrolite to 4.4

Resolved

links to

[Github] Pull Request #6558 (davies)

Activity

People

Assignee:: Davies Liu

Reporter:: Harry Brundage

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Apr/15 23:25

Updated:: 02/Jun/15 06:12

Resolved:: 02/Jun/15 06:12