Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4365

Remove unnecessary filter call on records returned from parquet library

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those :

      from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java

      public boolean nextKeyValue() throws IOException, InterruptedException {
      boolean recordFound = false;

      while (!recordFound) {
      // no more records left
      if (current >= total)

      { return false; }

      try {
      checkRead();
      currentValue = recordReader.read();
      current ++;
      if (recordReader.shouldSkipCurrentRecord())

      { // this record is being filtered via the filter2 package if (DEBUG) LOG.debug("skipping record"); continue; }

      if (currentValue == null)

      { // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar; if (DEBUG) LOG.debug("filtered record reader reached end of block"); continue; }


      recordFound = true;

      if (DEBUG) LOG.debug("read value: " + currentValue);
      } catch (RuntimeException e)

      { throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); }


      }
      return true;
      }

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              saucam Yash Datta
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: