Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10395

Simplify CatalystReadSupport

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.5.0
    • 1.6.0
    • SQL
    • None

    Description

      The API interface of Parquet ReadSupport is a little bit over complicated because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior), ReadSupport need to be instantiated and initialized twice on both driver side and executor side. The init() method is for driver side initialization, while prepareForRead() is for executor side. However, starting from parquet-mr 1.6.0, it's no longer the case, and ReadSupport is only instantiated and initialized on executor side. So, theoretically, now it's totally fine to combine these two methods into a single initialization method. The only reason (I could think of) to still have them here is for parquet-mr API backwards-compatibility.

      Due to this reason, we no longer need to rely on ReadContext to pass requested schema from init() to prepareForRead(), using a private `var` for requested schema in CatalystReadSupport would be enough.

      Another thing is that, after removing the old Parquet support code, now we always set Catalyst requested schema properly when reading Parquet files. So all those "fallback" logic in CatalystReadSupport is now redundant.

      Attachments

        Activity

          People

            lian cheng Cheng Lian
            lian cheng Cheng Lian
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: