Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13664

Simplify and Speedup HadoopFSRelation

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 2.0.0
    • SQL
    • None

    Description

      A majority of Spark SQL queries likely run though HadoopFSRelation, however there are currently several complexity and performance problems with this code path:

      • The class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data.
      • For very large tables, we are broadcasting the entire list of files to every executor. SPARK-11441
      • For partitioned tables, we always do an extra projection. This results not only in a copy, but undoes much of the performance gains that we are going to get from vectorized reads.

      This is an umbrella ticket to track a set of improvements to this codepath.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            marmbrus Michael Armbrust
            marmbrus Michael Armbrust
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment