Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24974

Spark put all file's paths into SharedInMemoryCache even for unused partitions.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.2.1
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:

      Description

      SharedInMemoryCache has all  filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions.

      I partitioned files by report_date and type and i have directory structure like 

      /custom_path/report_date=2018-07-24/type=A/file_1.parquet
      

       

      I am trying to execute 

      val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count
      

       

      In my query i need to load only files of type A and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. 

       

      This could be related to https://jira.apache.org/jira/browse/SPARK-17994 

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              stanand99 andrzej.stankevich@gmail.com
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: