Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14037

count(df) is very slow for dataframe constructed using SparkR::createDataFrame

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.6.1
    • None
    • SparkR
    • Ubuntu 12.04
      RAM : 6 GB

      Spark 1.6.1 Standalone

    Description

      Any operations on dataframe created using SparkR::createDataFrame is very slow.

      I have a CSV of size ~ 6MB. Below is the sample content

      12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
      12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
      12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
      12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
      12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
      12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
      12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
      12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
      12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
      12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter

      I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, sep=","). And then converted into Spark dataframe using sp_df <- createDataFrame(sqlContext, r_df)

      Now count(sp_df) took more than 30 seconds

      When I load the same CSV using spark-csv like, direct_df <- read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = "com.databricks.spark.csv", inferSchema = "false", header="true")

      count(direct_df) took below 1 sec.

      I know performance has been improved in createDataFrame in Spark 1.6. But other operations like count(), is very slow.

      How can I get rid of this performance issue?

      Attachments

        1. console.log
          8 kB
          Samuel Alexander
        2. spark_ui_ray.png
          150 kB
          Sun Rui
        3. spark_ui.png
          192 kB
          Samuel Alexander

        Activity

          People

            Unassigned Unassigned
            samalexg Samuel Alexander
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: