Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.1
    • Fix Version/s: 0.12.0, 0.11.2
    • Component/s: Storage
    • Labels:
      None

      Description

      There are lots of changes since parquet's graduation.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tajo-master-build #1082 (See https://builds.apache.org/job/Tajo-master-build/1082/)
          TAJO-2073: Upgrade parquet-mr to 1.8.1. (jhkim: rev ef43dfaa5a69de8d1f3df3c23598d7b92758bacd)

          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetFileWriter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoRecordMaterializer.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java
          • tajo-storage/tajo-storage-hdfs/pom.xml
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetWriter.java
          • tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestStorages.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordWriter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoReadSupport.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoRecordConverter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetReader.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ColumnChunkPageWriteStore.java
          • tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/parquet/TestSchemaConverter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/CodecFactory.java
          • tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/parquet/TestReadWrite.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoParquetReader.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoSchemaConverter.java
          • CHANGES
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoParquetWriter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoWriteSupport.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #1082 (See https://builds.apache.org/job/Tajo-master-build/1082/ ) TAJO-2073 : Upgrade parquet-mr to 1.8.1. (jhkim: rev ef43dfaa5a69de8d1f3df3c23598d7b92758bacd) tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetFileWriter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoRecordMaterializer.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java tajo-storage/tajo-storage-hdfs/pom.xml tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetWriter.java tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestStorages.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordWriter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoReadSupport.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoRecordConverter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetReader.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ColumnChunkPageWriteStore.java tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/parquet/TestSchemaConverter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/CodecFactory.java tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/parquet/TestReadWrite.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoParquetReader.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoSchemaConverter.java CHANGES tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoParquetWriter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoWriteSupport.java
          Hide
          jhkim Jinho Kim added a comment -

          committed it
          Thanks

          Show
          jhkim Jinho Kim added a comment - committed it Thanks
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Tajo-master-CODEGEN-build #677 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/677/)
          TAJO-2073: Upgrade parquet-mr to 1.8.1. (jhkim: rev ef43dfaa5a69de8d1f3df3c23598d7b92758bacd)

          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoParquetWriter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetFileWriter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetWriter.java
          • tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestStorages.java
          • tajo-storage/tajo-storage-hdfs/pom.xml
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoReadSupport.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoRecordMaterializer.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ColumnChunkPageWriteStore.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoRecordConverter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetReader.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/CodecFactory.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoWriteSupport.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoParquetReader.java
          • CHANGES
          • tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/parquet/TestSchemaConverter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoSchemaConverter.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordWriter.java
          • tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/parquet/TestReadWrite.java
          • tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Tajo-master-CODEGEN-build #677 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/677/ ) TAJO-2073 : Upgrade parquet-mr to 1.8.1. (jhkim: rev ef43dfaa5a69de8d1f3df3c23598d7b92758bacd) tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoParquetWriter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetFileWriter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetWriter.java tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestStorages.java tajo-storage/tajo-storage-hdfs/pom.xml tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoReadSupport.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoRecordMaterializer.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ColumnChunkPageWriteStore.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoRecordConverter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/ParquetReader.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/CodecFactory.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoWriteSupport.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoParquetReader.java CHANGES tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/parquet/TestSchemaConverter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/TajoSchemaConverter.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordWriter.java tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/parquet/TestReadWrite.java tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tajo/pull/958

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tajo/pull/958
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jinossy commented on the pull request:

          https://github.com/apache/tajo/pull/958#issuecomment-184022027

          Thanks for your review!
          I'll commit it that reflects your comments

          Show
          githubbot ASF GitHub Bot added a comment - Github user jinossy commented on the pull request: https://github.com/apache/tajo/pull/958#issuecomment-184022027 Thanks for your review! I'll commit it that reflects your comments
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jihoonson commented on the pull request:

          https://github.com/apache/tajo/pull/958#issuecomment-183168945

          +1. I left trivial comments. Please consider them before you commit.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on the pull request: https://github.com/apache/tajo/pull/958#issuecomment-183168945 +1. I left trivial comments. Please consider them before you commit.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jihoonson commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/958#discussion_r52702301

          — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java —
          @@ -138,24 +162,29 @@ public float getProgress()

          { return (float) current / total; }
          • public void initialize(MessageType requestedSchema, MessageType fileSchema,
          • Map<String, String> extraMetadata, Map<String, String> readSupportMetadata,
            + public void initialize(MessageType fileSchema,
              • End diff –

          FileMetaData also includes the file schema.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on a diff in the pull request: https://github.com/apache/tajo/pull/958#discussion_r52702301 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java — @@ -138,24 +162,29 @@ public float getProgress() { return (float) current / total; } public void initialize(MessageType requestedSchema, MessageType fileSchema, Map<String, String> extraMetadata, Map<String, String> readSupportMetadata, + public void initialize(MessageType fileSchema, End diff – FileMetaData also includes the file schema.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jihoonson commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/958#discussion_r52702202

          — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java —
          @@ -70,37 +81,50 @@
          private long totalCountLoadedSoFar = 0;

          private Path file;
          + private UnmaterializableRecordCounter unmaterializableRecordCounter;
          +
          + /**
          + * @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro.
          + * @param filter for filtering individual records
          + */
          + public InternalParquetRecordReader(ReadSupport<T> readSupport, Filter filter)

          { + this.readSupport = readSupport; + this.filter = checkNotNull(filter, "filter"); + }

          /**

          • @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro.
            */
            public InternalParquetRecordReader(ReadSupport<T> readSupport) { - this(readSupport, null); + this(readSupport, FilterCompat.NOOP); }

          /**

          • @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro.
          • @param filter Optional filter for only returning matching records.
            + * @deprecated use {@link #InternalParquetRecordReader(ReadSupport, Filter)}

            */

          • public InternalParquetRecordReader(ReadSupport<T> readSupport, UnboundRecordFilter
          • filter) {
          • this.readSupport = readSupport;
          • this.recordFilter = filter;
            + @Deprecated
            + public InternalParquetRecordReader(ReadSupport<T> readSupport, UnboundRecordFilter filter) { + this(readSupport, FilterCompat.get(filter)); }

          private void checkRead() throws IOException {
          if (current == totalCountLoadedSoFar) {
          if (current != 0) {

          • long timeAssembling = System.currentTimeMillis() - startedAssemblingCurrentBlockAt;
          • totalTimeSpentProcessingRecords += timeAssembling;
          • if (DEBUG) LOG.debug("Assembled and processed " + totalCountLoadedSoFar + " records from " + columnCount + " columns in " + totalTimeSpentProcessingRecords + " ms: " + ((float) totalCountLoadedSoFar / totalTimeSpentProcessingRecords) + " rec/ms, " + ((float) totalCountLoadedSoFar * columnCount / totalTimeSpentProcessingRecords) + " cell/ms");
          • long totalTime = totalTimeSpentProcessingRecords + totalTimeSpentReadingBytes;
          • long percentReading = 100 * totalTimeSpentReadingBytes / totalTime;
          • long percentProcessing = 100 * totalTimeSpentProcessingRecords / totalTime;
          • if (DEBUG) LOG.debug("time spent so far " + percentReading + "% reading ("totalTimeSpentReadingBytes" ms) and " + percentProcessing + "% processing ("totalTimeSpentProcessingRecords" ms)");
            + totalTimeSpentProcessingRecords += (System.currentTimeMillis() - startedAssemblingCurrentBlockAt);
            + if (Log.INFO) {
              • End diff –

          Even though these logs seem to be printed whenever a row group is fully read, I'm concerned with there will be too many logs.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on a diff in the pull request: https://github.com/apache/tajo/pull/958#discussion_r52702202 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java — @@ -70,37 +81,50 @@ private long totalCountLoadedSoFar = 0; private Path file; + private UnmaterializableRecordCounter unmaterializableRecordCounter; + + /** + * @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro. + * @param filter for filtering individual records + */ + public InternalParquetRecordReader(ReadSupport<T> readSupport, Filter filter) { + this.readSupport = readSupport; + this.filter = checkNotNull(filter, "filter"); + } /** @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro. */ public InternalParquetRecordReader(ReadSupport<T> readSupport) { - this(readSupport, null); + this(readSupport, FilterCompat.NOOP); } /** @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro. @param filter Optional filter for only returning matching records. + * @deprecated use {@link #InternalParquetRecordReader(ReadSupport, Filter)} */ public InternalParquetRecordReader(ReadSupport<T> readSupport, UnboundRecordFilter filter) { this.readSupport = readSupport; this.recordFilter = filter; + @Deprecated + public InternalParquetRecordReader(ReadSupport<T> readSupport, UnboundRecordFilter filter) { + this(readSupport, FilterCompat.get(filter)); } private void checkRead() throws IOException { if (current == totalCountLoadedSoFar) { if (current != 0) { long timeAssembling = System.currentTimeMillis() - startedAssemblingCurrentBlockAt; totalTimeSpentProcessingRecords += timeAssembling; if (DEBUG) LOG.debug("Assembled and processed " + totalCountLoadedSoFar + " records from " + columnCount + " columns in " + totalTimeSpentProcessingRecords + " ms: " + ((float) totalCountLoadedSoFar / totalTimeSpentProcessingRecords) + " rec/ms, " + ((float) totalCountLoadedSoFar * columnCount / totalTimeSpentProcessingRecords) + " cell/ms"); long totalTime = totalTimeSpentProcessingRecords + totalTimeSpentReadingBytes; long percentReading = 100 * totalTimeSpentReadingBytes / totalTime; long percentProcessing = 100 * totalTimeSpentProcessingRecords / totalTime; if (DEBUG) LOG.debug("time spent so far " + percentReading + "% reading (" totalTimeSpentReadingBytes " ms) and " + percentProcessing + "% processing (" totalTimeSpentProcessingRecords " ms)"); + totalTimeSpentProcessingRecords += (System.currentTimeMillis() - startedAssemblingCurrentBlockAt); + if (Log.INFO) { End diff – Even though these logs seem to be printed whenever a row group is fully read, I'm concerned with there will be too many logs.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jihoonson commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/958#discussion_r52701715

          — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java —
          @@ -78,6 +96,7 @@ public Tuple next() throws IOException {
          */
          @Override
          public void reset() throws IOException {
          + throw new TajoRuntimeException(new UnsupportedException());
          — End diff –

          UnimplementedException looks more proper.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jihoonson commented on a diff in the pull request: https://github.com/apache/tajo/pull/958#discussion_r52701715 — Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java — @@ -78,6 +96,7 @@ public Tuple next() throws IOException { */ @Override public void reset() throws IOException { + throw new TajoRuntimeException(new UnsupportedException()); — End diff – UnimplementedException looks more proper.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user jinossy opened a pull request:

          https://github.com/apache/tajo/pull/958

          TAJO-2073: Upgrade parquet-mr to 1.8.1.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/jinossy/tajo TAJO-2073

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tajo/pull/958.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #958


          commit 093bbd63c0c22bee51cdc1335f27e9766249af6e
          Author: Jinho Kim <jhkim@apache.org>
          Date: 2016-02-03T10:12:37Z

          TAJO-2073: Upgrade parquet-mr to 1.8.1.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user jinossy opened a pull request: https://github.com/apache/tajo/pull/958 TAJO-2073 : Upgrade parquet-mr to 1.8.1. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinossy/tajo TAJO-2073 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tajo/pull/958.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #958 commit 093bbd63c0c22bee51cdc1335f27e9766249af6e Author: Jinho Kim <jhkim@apache.org> Date: 2016-02-03T10:12:37Z TAJO-2073 : Upgrade parquet-mr to 1.8.1.

            People

            • Assignee:
              jhkim Jinho Kim
              Reporter:
              jhkim Jinho Kim
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development