Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.6.1
-
None
-
Ubuntu 12.04
RAM : 6 GBSpark 1.6.1 Standalone
Description
Any operations on dataframe created using SparkR::createDataFrame is very slow.
I have a CSV of size ~ 6MB. Below is the sample content
12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, sep=","). And then converted into Spark dataframe using sp_df <- createDataFrame(sqlContext, r_df)
Now count(sp_df) took more than 30 seconds
When I load the same CSV using spark-csv like, direct_df <- read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = "com.databricks.spark.csv", inferSchema = "false", header="true")
count(direct_df) took below 1 sec.
I know performance has been improved in createDataFrame in Spark 1.6. But other operations like count(), is very slow.
How can I get rid of this performance issue?