Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17529

On highly skewed data, outer join merges are slow

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2, 2.0.0
    • 2.1.0
    • Spark Core
    • None

    Description

      All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the same performance problem.

      My co-worker Yewei Zhang was investigating a performance problem with a highly skewed dataset.
      "Part of this query performs a full outer join over [an ID] on highly skewed data. On the left view, there is one record for id = 0 out of 2,272,486 records; On the right view there are 8,353,097 records for id = 0 out of
      12,685,073 records"

      The sub-query was taking 5.2 minutes. We discovered that snappy was responsible for some measure of this problem and incorporated the new snappy release. This brought the sub-query down to 2.4 minutes. A large percentage of the remaining time was spent in the merge code which I tracked down to a BitSet clearing issue. We have noted that you already have the snappy fix, this issue describes the problem with the BitSet.

      The BitSet grows to handle the largest matching set of keys and is used to track joins. The BitSet is re-used in all subsequent joins (unless it is too small)

      The skewing of our data caused a very large BitSet to be allocated on the very first row joined. Unfortunately, the entire BitSet is cleared on each re-use. For each of the remaining rows which likely match only a few rows on the other side, the entire 1MB of the BitSet is cleared. If unpartitioned, this would happen roughly 6 million times. The fix I developed for this is to clear only the portion of the BitSet that is needed. After applying it, the sub-query dropped from 2.4 minutes to 29 seconds.

      Small (0 or negative) IDs are often used as place-holders for null, empty, or unknown data, so I expect this fix to be generally useful, if rarely encountered to this particular degree.

      Attachments

        Activity

          People

            dnavas David C. Navas
            davidnavas David C Navas
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: