Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5009

Query with a simple join fails on Hive generated parquet

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.9.0
    • 1.9.0
    • Storage - Parquet
    • None
    • Commit ID: 5a439424594eb10d113163eaa1fdf8034f387235c
      1.9.0 SNAPSHOT - Nov 5 2016

    Description

      Query:

      SELECT *
      FROM store_sales ss, customer c
      WHERE  ss.ss_customer_sk = c.c_customer_sk 
      LIMIT 1; 
      

      Error:

      Error: SYSTEM ERROR: IOException: End of stream reached while initializing buffered reader.
      
      Fragment 2:0
      
      [Error Id: 93726aea-1d62-4e7c-a2bf-1d7cc1e834e4 on abhi1:31010]
      
        (org.apache.drill.common.exceptions.DrillRuntimeException) Error in parquet record reader.
      ...
      ...
       Caused By (org.apache.drill.common.exceptions.ExecutionSetupException) Error opening or reading metadata for parquet file at location: customer.parquet
          org.apache.drill.exec.store.parquet.columnreaders.PageReader.<init>():145
          org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.<init>():59
          org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.<init>():96
          org.apache.drill.exec.store.parquet.columnreaders.NullableColumnReader.<init>():39
          org.apache.drill.exec.store.parquet.columnreaders.NullableFixedByteAlignedReaders$NullableFixedByteAlignedReader.<init>():58
          org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory.getNullableColumnReader():252
          org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory.createFixedColumnReader():186
          org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.setup():402
          org.apache.drill.exec.physical.impl.ScanBatch.next():212
          org.apache.drill.exec.record.AbstractRecordBatch.next():119
          org.apache.drill.exec.record.AbstractRecordBatch.next():109
          org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
          org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
          org.apache.drill.exec.record.AbstractRecordBatch.next():162
          org.apache.drill.exec.physical.impl.BaseRootExec.next():104
          org.apache.drill.exec.physical.impl.broadcastsender.BroadcastSenderRootExec.innerNext():95
          org.apache.drill.exec.physical.impl.BaseRootExec.next():94
          org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
          org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
          java.security.AccessController.doPrivileged():-2
          javax.security.auth.Subject.doAs():415
          org.apache.hadoop.security.UserGroupInformation.doAs():1595
          org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
          org.apache.drill.common.SelfCleaningRunnable.run():38
          java.util.concurrent.ThreadPoolExecutor.runWorker():1145
          java.util.concurrent.ThreadPoolExecutor$Worker.run():615
          java.lang.Thread.run():745
      ...
      

      Log attached.

      Attachments

        1. DRILL-5009.log.txt
          1.84 MB
          Abhishek Girish

        Issue Links

          Activity

            agirish Abhishek Girish added a comment - Dataset: https://s3-us-west-1.amazonaws.com/drill-public/bugs/DRILL-5009.tar.gz
            parthc Parth Chandra added a comment -

            agirish How did you get this file???
            The last row group in customer.parquet has this metadata -

            row group 1001: RC:0 TS:0 OFFSET:13107970
            ----------------------------------------------------------------------------------------------------
            c_customer_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_customer_id: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_current_cdemo_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_current_hdemo_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_current_addr_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_first_shipto_date_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_first_sales_date_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_salutation: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_first_name: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_last_name: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_preferred_cust_flag: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_birth_day: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_birth_month: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_birth_year: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_birth_country: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_login: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_email_address: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0
            c_last_review_date: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0

            This row group is basically empty ( RC:0 => Row Count = 0).

            parthc Parth Chandra added a comment - agirish How did you get this file??? The last row group in customer.parquet has this metadata - row group 1001: RC:0 TS:0 OFFSET:13107970 ---------------------------------------------------------------------------------------------------- c_customer_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_customer_id: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_current_cdemo_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_current_hdemo_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_current_addr_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_first_shipto_date_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_first_sales_date_sk: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_salutation: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_first_name: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_last_name: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_preferred_cust_flag: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_birth_day: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_birth_month: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_birth_year: INT32 UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_birth_country: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_login: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_email_address: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 c_last_review_date: BINARY UNCOMPRESSED DO:0 FPO:13107970 SZ:0/0/NaN VC:0 This row group is basically empty ( RC:0 => Row Count = 0).
            parthc Parth Chandra added a comment -

            After discussing with Jinfeng, we concluded that we should filter out the row groups that are empty. I've fixed this for the Parquet metadata path, but still need to fix this for the hive native reader path.
            Additionally, I'll add a fix in the PageReader to handle this condition better.

            parthc Parth Chandra added a comment - After discussing with Jinfeng, we concluded that we should filter out the row groups that are empty. I've fixed this for the Parquet metadata path, but still need to fix this for the hive native reader path. Additionally, I'll add a fix in the PageReader to handle this condition better.

            parthc, the data files were generated by rkins. These were created to test Drill with various Hive generated Parquet files.

            agirish Abhishek Girish added a comment - parthc , the data files were generated by rkins . These were created to test Drill with various Hive generated Parquet files.
            githubbot ASF GitHub Bot added a comment -

            GitHub user parthchandra opened a pull request:

            https://github.com/apache/drill/pull/651

            DRILL-5009: Skip reading of empty row groups while reading Parquet me…

            …tadata.

            We will no longer attempt to scan such row groups.

            You can merge this pull request into a Git repository by running:

            $ git pull https://github.com/parthchandra/drill DRILL-5009

            Alternatively you can review and apply these changes as the patch at:

            https://github.com/apache/drill/pull/651.patch

            To close this pull request, make a commit to your master/trunk branch
            with (at least) the following in the commit message:

            This closes #651


            commit 6a5a6f0576a2c83617a5a980d61ae82a49162235
            Author: Parth Chandra <parthc@apache.org>
            Date: 2016-11-08T04:29:23Z

            DRILL-5009: Skip reading of empty row groups while reading Parquet metadata.
            We will no longer attempt to scan such row groups.


            githubbot ASF GitHub Bot added a comment - GitHub user parthchandra opened a pull request: https://github.com/apache/drill/pull/651 DRILL-5009 : Skip reading of empty row groups while reading Parquet me… …tadata. We will no longer attempt to scan such row groups. You can merge this pull request into a Git repository by running: $ git pull https://github.com/parthchandra/drill DRILL-5009 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/651.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #651 commit 6a5a6f0576a2c83617a5a980d61ae82a49162235 Author: Parth Chandra <parthc@apache.org> Date: 2016-11-08T04:29:23Z DRILL-5009 : Skip reading of empty row groups while reading Parquet metadata. We will no longer attempt to scan such row groups.
            githubbot ASF GitHub Bot added a comment -

            Github user sudheeshkatkam commented on the issue:

            https://github.com/apache/drill/pull/651

            +1

            githubbot ASF GitHub Bot added a comment - Github user sudheeshkatkam commented on the issue: https://github.com/apache/drill/pull/651 +1
            githubbot ASF GitHub Bot added a comment -

            Github user asfgit closed the pull request at:

            https://github.com/apache/drill/pull/651

            githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/651

            Fixed in 4b1902c

            sudheeshkatkam Sudheesh Katkam added a comment - Fixed in 4b1902c

            People

              parthc Parth Chandra
              agirish Abhishek Girish
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: