Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-1058

Unable to read or write nested/repeated data in PARQUET format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.4.0
    • Storage - Writer
    • None
    • CentOS release 6.5

    Description

      =================================================
      DRILL WRITING A PARQUET TABLE WITH NESTED DATA
      =================================================
      I have a JSON file with nested data (schema present below):

      {"rownum":1,"name":"fred ovid","age":76,"gpa":1.55,"studentnum":692315658449,"create_time":"2014-05-27 00:26:07", "interests": [ "Reading", "Mountain Biking", "Hacking" ]}

      I am able to read this JSON file successfully from drill and access nested values. However when I try to import this data and create a table in PARQUET format, it errors:

      QUERY: create table test as select * from `/user/root/sample-data/nested_student.json`;

      ERROR: Query failed: org.apache.drill.exec.rpc.RpcException: Remote failure while running query.[error_id: "3ce3dc1e-d920-4262-ae2d-28bd2d034597"
      endpoint {
      address: "perfnode154.perf.lab"
      user_port: 31010
      control_port: 31011
      data_port: 31012
      }
      error_type: 0
      message: "Failure while running fragment. < ParquetEncodingException:[ error starting field interests at 6 ] < ClassCastException:[ parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO ]"
      ]
      Error: exception while executing query (state=,code=0)

      2014-06-24 00:41:18,646 [b10db58d-8d4d-4d02-9fb5-a5081e5cb254:frag:0:0] ERROR o.a.d.e.w.f.AbstractStatusReporter - Error 48602de2-8306-47d2-875f-8ad2cd2e964a: Failure while running fragment.
      java.lang.ClassCastException: parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
              at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.startField(MessageColumnIO.java:171) ~[parquet-column-1.5.0-20140513.004024-1.jar:na]
              at org.apache.drill.exec.store.ParquetOutputRecordWriter.addRepeatedVarCharHolder(ParquetOutputRecordWriter.java:761) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.store.EventBasedRecordWriter$RepeatedVarCharFieldWriter.writeField(EventBasedRecordWriter.java:1156) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.store.EventBasedRecordWriter.write(EventBasedRecordWriter.java:150) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:111) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:91) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:72) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:65) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:45) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:94) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:91) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:56) ~[drill-java-exec-1.0.0-m2-incubat
      ing-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:85) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:46) ~[drill-java-exec-1.0.0-m2-incubat
      ing-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
              at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:100) ~[drill-java-exec-1.0.0-m2
      -incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      

      =================================================
      DRILL READING A PARQUET TABLE WITH NESTED DATA
      =================================================
      I generated a parquet file by reading the below Json file into pig and storing it in a parquet format:
      {"recipe":"Tacos","ingredients":[

      {"name":"Beef"}

      ,

      {"name":"Lettuce"}

      ,

      {"name":"Cheese"}

      ],"inventor":{"name":"Alex","age":25}}
      {"recipe":"TomatoSoup","ingredients":[

      {"name":"Tomatoes"}

      ,

      {"name":"Milk"}

      ],"inventor":{"name":"Steve","age":23}}

      When I try to read this parquet table in Drill, it errors:

      QUERY: Select * from `/user/root/complex.parquet`;

      ERROR: Query failed: org.apache.drill.exec.rpc.RpcException: Remote failure while running query.[error_id: "c2e735f4-e11c-4e10-a410-959b3880dce0"
      endpoint {
      address: "perfnode154.perf.lab"
      user_port: 31010
      control_port: 31011
      data_port: 31012
      }
      error_type: 0
      message: "Failure while running fragment. < UnsupportedOperationException:[ unsupported type: BINARY LIST ]"
      ]
      Error: exception while executing query (state=,code=0)

      2014-07-23 22:16:45,239 [d106ad59-595f-42e7-880a-ef9f6bff1ff0:frag:0:0] DEBUG o.a.d.e.w.fragment.FragmentExecutor - Failure while initializing operator tree
      java.lang.UnsupportedOperationException: unsupported type: BINARY LIST
      	at org.apache.drill.exec.store.parquet.ParquetRecordReader.toMajorType(ParquetRecordReader.java:446) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.store.parquet.ParquetRecordReader.setup(ParquetRecordReader.java:219) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ScanBatch.<init>(ScanBatch.java:93) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:126) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:47) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:62) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:39) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitSubScan(AbstractPhysicalVisitor.java:113) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.store.parquet.ParquetRowGroupScan.accept(ParquetRowGroupScan.java:113) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:74) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:62) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:39) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitIteratorValidator(AbstractPhysicalVisitor.java:196) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.config.IteratorValidator.accept(IteratorValidator.java:34) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:74) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:62) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:39) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitProducerConsumer(AbstractPhysicalVisitor.java:191) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.config.ProducerConsumer.accept(ProducerConsumer.java:42) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:74) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:62) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:39) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitIteratorValidator(AbstractPhysicalVisitor.java:196) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.config.IteratorValidator.accept(IteratorValidator.java:34) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:74) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:59) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.visitOp(ImplCreator.java:39) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitStore(AbstractPhysicalVisitor.java:118) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitScreen(AbstractPhysicalVisitor.java:176) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.config.Screen.accept(Screen.java:95) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:81) ~[drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at org.apache.drill.exec.work.WorkManager$RunnableWrapper.run(WorkManager.java:242) [drill-java-exec-1.0.0-m2-incubating-SNAPSHOT-rebuffed.jar:1.0.0-m2-incubating-SNAPSHOT]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_60]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_60]
      	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]
      

      I am able to verify that it has repeated data by dumping the parquet file using parquet-tools

      ./parquet-tools dump badpigparquet 
      row group 0 
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      recipe:       BINARY UNCOMPRESSED DO:0 FPO:4 SZ:85/85/1.00 VC:6 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
      ingredients: 
      .bag:        
      ..name:       BINARY UNCOMPRESSED DO:0 FPO:89 SZ:120/120/1.00 VC:15 ENC:RLE,PLAIN_DICTIONARY
      inventor:    
      .name:        BINARY UNCOMPRESSED DO:0 FPO:209 SZ:74/74/1.00 VC:6 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
      .age:         INT32 UNCOMPRESSED DO:0 FPO:283 SZ:64/64/1.00 VC:6 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
      
          recipe TV=6 RL=0 DL=1 DS:                2 DE:PLAIN_DICTIONARY
          -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          page 0:                                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:9 VC:6
      
          ingredients.bag.name TV=15 RL=1 DL=3 DS: 5 DE:PLAIN_DICTIONARY
          -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          page 0:                                   DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY SZ:21 VC:15
      
          inventor.name TV=6 RL=0 DL=2 DS:         2 DE:PLAIN_DICTIONARY
          -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          page 0:                                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:10 VC:6
      
          inventor.age TV=6 RL=0 DL=2 DS:          2 DE:PLAIN_DICTIONARY
          -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          page 0:                                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:10 VC:6
      
      BINARY recipe 
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      *** row group 1 of 1, values 1 to 6 *** 
      value 1: R:0 D:1 V:Tacos
      value 2: R:0 D:1 V:TomatoSoup
      value 3: R:0 D:1 V:Tacos
      value 4: R:0 D:1 V:TomatoSoup
      value 5: R:0 D:1 V:Tacos
      value 6: R:0 D:1 V:TomatoSoup
      
      BINARY ingredients.bag.name 
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      *** row group 1 of 1, values 1 to 15 *** 
      value 1:  R:0 D:3 V:Beef
      value 2:  R:1 D:3 V:Lettuce
      value 3:  R:1 D:3 V:Cheese
      value 4:  R:0 D:3 V:Tomatoes
      value 5:  R:1 D:3 V:Milk
      value 6:  R:0 D:3 V:Beef
      value 7:  R:1 D:3 V:Lettuce
      value 8:  R:1 D:3 V:Cheese
      value 9:  R:0 D:3 V:Tomatoes
      value 10: R:1 D:3 V:Milk
      value 11: R:0 D:3 V:Beef
      value 12: R:1 D:3 V:Lettuce
      value 13: R:1 D:3 V:Cheese
      value 14: R:0 D:3 V:Tomatoes
      value 15: R:1 D:3 V:Milk
      
      BINARY inventor.name 
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      *** row group 1 of 1, values 1 to 6 *** 
      value 1: R:0 D:2 V:Alex
      value 2: R:0 D:2 V:Steve
      value 3: R:0 D:2 V:Alex
      value 4: R:0 D:2 V:Steve
      value 5: R:0 D:2 V:Alex
      value 6: R:0 D:2 V:Steve
      
      INT32 inventor.age 
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      *** row group 1 of 1, values 1 to 6 *** 
      value 1: R:0 D:2 V:25
      value 2: R:0 D:2 V:23
      value 3: R:0 D:2 V:25
      value 4: R:0 D:2 V:23
      value 5: R:0 D:2 V:25
      value 6: R:0 D:2 V:23
      

      Attachments

        1. complex.parquet
          0.8 kB
          Amit Katti

        Activity

          People

            Unassigned Unassigned
            amitskatti Amit Katti
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: