Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4413

Parquet support through datasource API

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      Right now there are several issues with out parquet support. Specifically, the only way to access parquet files though pure SQL is by including Hive, which has the following issues

      • fairly verbose syntax
      • requires you to explicitly add partitions
      • does not support decimal types.
      • querying tables with many partitions results in metadata operations dominating the query time (even worse when reading from S3).

      It would be great to have better native support here though the new datasources API. Ideally once that is in place we can deprecate the existing ParquetRelation.

        Attachments

          Activity

            People

            • Assignee:
              marmbrus Michael Armbrust
              Reporter:
              marmbrus Michael Armbrust
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: