Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4223

LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of hive table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.9.0
    • None
    • None
    • Hive 0.9.0

    Description

      The LazySimpleSerDe will throw IndexOutOfBoundsException if the column structure is struct containing array of struct.
      I have a table with one column defined like this:

      columnA
      array <
      struct<
      col1:primiType,
      col2:primiType,
      col3:primiType,
      col4:primiType,
      col5:primiType,
      col6:primiType,
      col7:primiType,
      col8:array<
      struct<
      col1:primiType,
      col2::primiType,
      col3::primiType,
      col4:primiType,
      col5:primiType,
      col6:primiType,
      col7:primiType,
      col8:primiType,
      col9:primiType
      >
      >
      >
      >

      In this example, the outside struct has 8 columns (including the array), and the inner struct has 9 columns. As long as the outside struct has LESS column count than the inner struct column count, I think we will get the following exception as stracktrace in LazeSimpleSerDe when it tries to serialize a row:

      Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8
      at java.util.ArrayList.RangeCheck(ArrayList.java:547)
      at java.util.ArrayList.get(ArrayList.java:322)
      at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
      at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:443)
      at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381)
      at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365)
      at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568)
      at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
      at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
      at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
      at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132)
      at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
      at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
      at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
      at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)
      ... 9 more

      I am not very sure about exactly the reason of this problem. I believe that the public static void serialize(ByteStream.Output out, Object obj,ObjectInspector objInspector, byte[] separators, int level, Text nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape) is recursively invoking itself when facing nest structure. But for the nested struct structure, the list reference will mass up, and the size() will return wrong data.

      In the above example case I faced,
      for these 2 lines:

      List<? extends StructField> fields = soi.getAllStructFieldRefs();
      list = soi.getStructFieldsDataAsList(obj);

      my StructObjectInspector(soi) will return the CORRECT data for getAllStructFieldRefs() and getStructFieldsDataAsList() methods. For example, for one row, for the outsider 8 columns struct, I have 2 elements in the inner array of struct, and each element will have 9 columns (as there are 9 columns in the inner struct). During runtime, after I added more logging in the LazySimpleSerDe, I will see the following behavior in the logging:

      for 8 outside column, loop
      for 9 inside columns, loop for serialize
      for 9 inside columns, loop for serialize
      code broken here, for the outside loop, it will try to access the 9th element,which not exist in the outside loop, as you will see the stracktrace as it tried to access location 8 of size 8 of list.

      What I did is to change the following line of code, it look like fixing this problem. But I don't know if it is the right way, but it did fix this problem, and I did it on hive 0.9.0 version of code:

      481c481,482
      < for (int i = 0; i < list.size(); i++) {

      > int listSize = list.size();
      > for (int i = 0; i < listSize; i++) {

      I believe the reason of this bug is that if the code did the current way like
      for (int i = 0; i < list.size(); i++)

      the method list.size() will be invoked for every loop. But in the nest structure, the list.size() will return different result during the recursive call, and that caused the problem I am currently facing.

      Thanks

      Yong Zhang

      Attachments

        1. nest_struct.data
          0.9 kB
          Chaoyu Tang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              java8964 Yong Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: