Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35743 Improve Parquet vectorized reader
  3. SPARK-35640

Refactor Parquet vectorized reader to remove duplicated code paths

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0
    • SQL
    • None

    Description

      Currently in Parquet vectorized code path, there are many code duplications such as the following:

        public void readIntegers(
            int total,
            WritableColumnVector c,
            int rowId,
            int level,
            VectorizedValuesReader data) throws IOException {
          int left = total;
          while (left > 0) {
            if (this.currentCount == 0) this.readNextGroup();
            int n = Math.min(left, this.currentCount);
            switch (mode) {
              case RLE:
                if (currentValue == level) {
                  data.readIntegers(n, c, rowId);
                } else {
                  c.putNulls(rowId, n);
                }
                break;
              case PACKED:
                for (int i = 0; i < n; ++i) {
                  if (currentBuffer[currentBufferIdx++] == level) {
                    c.putInt(rowId + i, data.readInteger());
                  } else {
                    c.putNull(rowId + i);
                  }
                }
                break;
            }
            rowId += n;
            left -= n;
            currentCount -= n;
          }
        }
      

      This makes it hard to maintain as any change on this will need to be replicated in 20+ places. The issue becomes more serious when we are going to implement column index and complex type support for the vectorized path.

      The original intention is for performance. However now days JIT compilers tend to be smart on this and will inline virtual calls as much as possible.

      Attachments

        Activity

          People

            csun Chao Sun
            csun Chao Sun
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: