[SPARK-14037] count(df) is very slow for dataframe constructed using SparkR::createDataFrame - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.6.1
Fix Version/s: None
Component/s: SparkR
Labels:
Environment:

Ubuntu 12.04
RAM : 6 GB

Spark 1.6.1 Standalone

Description

Any operations on dataframe created using SparkR::createDataFrame is very slow.

I have a CSV of size ~ 6MB. Below is the sample content

12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter

I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, sep=","). And then converted into Spark dataframe using sp_df <- createDataFrame(sqlContext, r_df)

Now count(sp_df) took more than 30 seconds

When I load the same CSV using spark-csv like, direct_df <- read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = "com.databricks.spark.csv", inferSchema = "false", header="true")

count(direct_df) took below 1 sec.

I know performance has been improved in createDataFrame in Spark 1.6. But other operations like count(), is very slow.

How can I get rid of this performance issue?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

console.log
30/Mar/16 06:57
8 kB
Samuel Alexander
spark_ui_ray.png
25/Mar/16 02:21
150 kB
Sun Rui
spark_ui.png
24/Mar/16 12:35
192 kB
Samuel Alexander

Issue Links

links to

[Github] Pull Request #12090 (sun-rui)

Activity

People

Assignee:: Unassigned

Reporter:: Samuel Alexander

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Mar/16 11:21

Updated:: 21/May/19 04:16

Resolved:: 21/May/19 04:16