Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4502

Spark SQL reads unneccesary nested fields from Parquet

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.1.0
    • 2.4.0
    • SQL
    • None

    Description

      When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields.

      For example, I loaded json tweets data into SparkSQL and ran the following query:

      SELECT User.contributors_enabled from Tweets;

      User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message:

      14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms

      For comparison, I also ran:
      SELECT User FROM Tweets;

      And here is the log message:
      14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms

      So both queries load 38 columns from Parquet, while the first query only needs 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read.

      Attachments

        Issue Links

          Activity

            People

              michael Michael MacFadden
              liwen Liwen Sun
              Votes:
              47 Vote for this issue
              Watchers:
              56 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: