Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18150

Spark 2.* failes to create partitions for avro files

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: DStreams, SQL
    • Labels:
      None
    • Target Version/s:

      Description

      I am using Apache Spark 2.0.1 for processing the Grid HDFS Avro file, however I don't see spark distributing the job into different tasks instead it uses single task and all the operations (read, load, filter, show ) are done in a sequence using same task.

      This means I am not able to leverage distributed parallel processing.

      I tried the same operation on JSON file on HDFS, it works good, means the job gets distributed into multiple tasks and partition. I see parallelism.

      I then tested the same on Spark 1.6, there it does the partitioning. Looks like there is a bug in Spark 2.* version. If not can some one help me know how to achieve the same on Avro file, do I need to do something special for Avro files ?

      Note:
      I explored spark setting: "spark.default.parallelism", "spark.sql.files.maxPartitionBytes", "--num-executors" and "spark.sql.shuffle.partitions". These were not of much help. "spark.default.parallelism", ensured to have multiple tasks however a single task ended up performing all the operation.

      I am using com.databricks.spark.avro (3.0.1) for Spark 2.0.1.

      Thanks,
      Sunil

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              sunilsbjoshi Sunil Kumar
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: