[SPARK-10731] The head() implementation of dataframe is very slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.1, 1.5.0
Fix Version/s: 1.5.1, 1.6.0
Component/s: PySpark
Labels:
- pyspark

Description

df=sqlContext.read.parquet("someparquetfiles")
df.head()

The above lines take over 15 minutes. It seems the dataframe requires 3 stages to return the first row. It reads all data (which is about 1 billion rows) and run Limit twice. The take(1) implementation in the RDD performs much better.

Attachments

Issue Links

links to

[Github] Pull Request #8852 (davies)

[Github] Pull Request #8876 (rxin)

Activity

People

Assignee:: Reynold Xin

Reporter:: Jerry Lam

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Sep/15 17:41

Updated:: 23/Sep/15 23:43

Resolved:: 23/Sep/15 23:43