Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13299

DataFrame limit operation is not consistent

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.3.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
    • None
    • None

    Description

      I faced to a problem with using limit method from DataFrame API.
      I try to get first 999 records from the AVRO source which contains about 3.5K records.

      DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro");
      
      df = df.limit(999);
      

      Then after saving operation I get the rows not in the same order as in input data set. Sometimes it gives me proper order but usually not.

      df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists);
      

      Here you can see Spark plan (maybe it can help to figure out the cause of the issue):

      == Parsed Logical Plan ==
      Limit 999
       Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)
      
      == Analyzed Logical Plan ==
      mobileNumber: bigint, tariff: string, debit: float
      Limit 999
       Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)
      
      == Optimized Logical Plan ==
      Limit 999
       Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)
      
      == Physical Plan ==
      Limit 999
       Scan AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)[mobileNumber#0L,tariff#1,debit#2]
      
      Code Generation: true
      

      Attachments

        1. SparkLimitIssue.png
          117 kB
          Nazarii Balkovskyi

        Activity

          People

            Unassigned Unassigned
            nazarii.balkovskii Nazarii Balkovskyi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: