Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-918

Severe performance drop when using csv matrix -> spark df -> systemml binary block

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Resolved
    • None
    • SystemML 1.3
    • Runtime
    • None

    Description

      Imran is trying to read in a csv file which encodes a matrix.
      The dimensions of the matrix are 10000 x 784
      Here is what the metadata looks like:
      {
      "data_type": "matrix",
      "value_type": "double",
      "rows": 10000,
      "cols": 784,
      "format": "csv",
      "header": false,
      "sep": ",",
      "description":

      { "author": "SystemML" }

      }

      There are two ways I read this matrix into the tSNE algorithm as variable X.

      1. The csv file is read into a dataframe and then the MLContext api is used to convert it to the binary block format. Here is the code used to create the dataframe:

      val X0 = {sqlContext.read
          .format("com.databricks.spark.csv")
          .option("header", "false") // Use first line of all files as header
          .option("inferSchema", "true") // Automatically infer data types
          .load("hdfs://rr-dense1/user/iyounus/data/mnist_test.csv")
               }
      // code for tsne ..........
      val X = X0.drop(X0.col("C784")) // Drop last column
      val tsneScript = dml(tsne).in("X", X).out("Y", "C")
      println(Calendar.getInstance().get(Calendar.MINUTE))
      val res = ml.execute(tsneScript)
      

      2. A DML read statement reads the csv file directly.

      Option 2 happens almost instantaneously. Option 1 takes about 8 minutes on a beefy machine with 40g allocated to the driver and 60g allocated to the executors on a 7 node cluster.

      Looking at the spark web UI and following the code path, it leads us to this function:
      https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L874-L879

      This needs some investigation.
      iyounus, fschueler

      Attachments

        1. stages.jpg
          1.01 MB
          Nakul Jindal

        Activity

          People

            Unassigned Unassigned
            nakul02 Nakul Jindal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: