Hive
  1. Hive
  2. HIVE-337

LazySimpleSerDe should support multi-level nested array, map, struct types

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.3.0
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      HIVE-337. LazySimpleSerDe to support multi-level nested array, map, struct types. (zshao)
      Show
      HIVE-337 . LazySimpleSerDe to support multi-level nested array, map, struct types. (zshao)

      Description

      Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

      1. HIVE-337.1.patch
        171 kB
        Zheng Shao
      2. HIVE-337.2.patch
        173 kB
        Zheng Shao
      3. HIVE-337.5.patch
        624 kB
        Zheng Shao

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          Committed revision 758089.

          Show
          Zheng Shao added a comment - Committed revision 758089.
          Hide
          Raghotham Murthy added a comment -

          btw can you attach the patch file with all the tests (just for completeness)?

          Show
          Raghotham Murthy added a comment - btw can you attach the patch file with all the tests (just for completeness)?
          Hide
          Raghotham Murthy added a comment -

          +1

          looks good

          Show
          Raghotham Murthy added a comment - +1 looks good
          Hide
          Zheng Shao added a comment - - edited

          Done with all the comments except 6.

          I also renamed the setAll() function to init() to make it clearer.

          Because we now pass TypeInfo around in LazyObject hierarchy, we don't even need to create the LazyObject for an array element if that element is never accessed (we can create it on demand when it's accessed).

          The current code works fine without the change of 6. The change of 6 requires either 12 bytes more storage per primitive object (by adding the byte[], int, int to the LazyPrimitive), or more complicated logic in removing the int start and int length from LazyNonPrimitive (we will have to parse the data right in init(..) but we don't have access to the separators because it's in the next-level ObjectInspectors - unless we add the pointers from LazyObject to ObjectInspector, but that's another overhead and complicates the data structure).

          After all, the implementation of init() is private to the class and I don't think there is a strong need to make the implementation the same across LazyPrimitive and LazyNonPrimitive. The fact that the parsing of LazyPrimitive does not require delimiters and LazyNonPrimitive requires is good enough for them to have different implementations.

          Future improvements include:
          1. Support escaping: HIVE-136;
          2. Columnar storage: HIVE-352;
          3. Use Writable/Text for values: HIVE-266;
          4. Short-circuit serialization: HIVE-358;
          5. Short-circuit expression evaluation: HIVE-359.
          6. Common expression evaluation: HIVE-364

          Show
          Zheng Shao added a comment - - edited Done with all the comments except 6. I also renamed the setAll() function to init() to make it clearer. Because we now pass TypeInfo around in LazyObject hierarchy, we don't even need to create the LazyObject for an array element if that element is never accessed (we can create it on demand when it's accessed). The current code works fine without the change of 6. The change of 6 requires either 12 bytes more storage per primitive object (by adding the byte[], int, int to the LazyPrimitive), or more complicated logic in removing the int start and int length from LazyNonPrimitive (we will have to parse the data right in init(..) but we don't have access to the separators because it's in the next-level ObjectInspectors - unless we add the pointers from LazyObject to ObjectInspector, but that's another overhead and complicates the data structure). After all, the implementation of init() is private to the class and I don't think there is a strong need to make the implementation the same across LazyPrimitive and LazyNonPrimitive. The fact that the parsing of LazyPrimitive does not require delimiters and LazyNonPrimitive requires is good enough for them to have different implementations. Future improvements include: 1. Support escaping: HIVE-136 ; 2. Columnar storage: HIVE-352 ; 3. Use Writable/Text for values: HIVE-266 ; 4. Short-circuit serialization: HIVE-358 ; 5. Short-circuit expression evaluation: HIVE-359 . 6. Common expression evaluation: HIVE-364
          Hide
          Zheng Shao added a comment -

          Comments from code review with Ashish, Raghu, and Namit:

          1. Move factory methods in LazyUtils to LazyFactory;

          2. LazyArray etc should use Factory methods (with TypeInfo as parameter) to create elements;

          3. Comment that LazySimpleSerDe only supports 10 levels of separators currently.

          4. Use a global pointer to byte[], and all LazyObjects should point to the global pointer.
          This helps with garbage collection.

          5. More Javadoc.

          6. Move the 3 fields (bytes, start, length) from LazyNonPrimitive to LazyObject, add "parsed" to LazyPrimitive.
          This makes the code cleaner because all LazyObjects now behave in the same way - nothing happens in setAll (but cached the values), and actual parsing happens later.
          The trade-off is that LazyPrimitive takes more memory - 12 bytes for bytes, start, and length.

          For 6, a better way would be to always directly parse the data when "setAll" is called (basically, convert the way LazyNonPrimitive works to the way that LazyPrimitive works.) The reason is that there is no point of "setAll" if we don't need to parse the data later, so why don't we directly parse the data in the "setAll" call?

          Show
          Zheng Shao added a comment - Comments from code review with Ashish, Raghu, and Namit: 1. Move factory methods in LazyUtils to LazyFactory; 2. LazyArray etc should use Factory methods (with TypeInfo as parameter) to create elements; 3. Comment that LazySimpleSerDe only supports 10 levels of separators currently. 4. Use a global pointer to byte[], and all LazyObjects should point to the global pointer. This helps with garbage collection. 5. More Javadoc. 6. Move the 3 fields (bytes, start, length) from LazyNonPrimitive to LazyObject, add "parsed" to LazyPrimitive. This makes the code cleaner because all LazyObjects now behave in the same way - nothing happens in setAll (but cached the values), and actual parsing happens later. The trade-off is that LazyPrimitive takes more memory - 12 bytes for bytes, start, and length. For 6, a better way would be to always directly parse the data when "setAll" is called (basically, convert the way LazyNonPrimitive works to the way that LazyPrimitive works.) The reason is that there is no point of "setAll" if we don't need to parse the data later, so why don't we directly parse the data in the "setAll" call?
          Hide
          Zheng Shao added a comment -

          Second cut. Fixed 2 problems popped up from the review:

          1. Make sure we don't do "setAll" on the field/element/value again if "setAll" has been called already.
          2. When getting the value out of a LazyMap using a key (which is assumed to be a primitive), we are now comparing the key directly with the keys deserialized from the map (instead of comparing the serialized version of keys to avoid matching problems between "01" and "1" as an int value).

          This will be the patch intended for check-in (except the test results are not overwritten yet, in order to help with the review).

          Show
          Zheng Shao added a comment - Second cut. Fixed 2 problems popped up from the review: 1. Make sure we don't do "setAll" on the field/element/value again if "setAll" has been called already. 2. When getting the value out of a LazyMap using a key (which is assumed to be a primitive), we are now comparing the key directly with the keys deserialized from the map (instead of comparing the serialized version of keys to avoid matching problems between "01" and "1" as an int value). This will be the patch intended for check-in (except the test results are not overwritten yet, in order to help with the review).
          Hide
          Zheng Shao added a comment -

          First cut.

          The code is written in a way to make sure it's very easy to extend to nested types.

          I also replaced DynamicSerDe/TBinaryProtocol for map-reduce value to LazySimpleSerDe. On a typical table with 8 string columns, we are seeing the amount of data passed from mappers to reducers decreased from 224MB to 148MB (roughly a 1/3 improvement).

          Escaping are not supported yet, but it shouldn't take too much time to add.

          Show
          Zheng Shao added a comment - First cut. The code is written in a way to make sure it's very easy to extend to nested types. I also replaced DynamicSerDe/TBinaryProtocol for map-reduce value to LazySimpleSerDe. On a typical table with 8 string columns, we are seeing the amount of data passed from mappers to reducers decreased from 224MB to 148MB (roughly a 1/3 improvement). Escaping are not supported yet, but it shouldn't take too much time to add.
          Hide
          Zheng Shao added a comment -

          Ok, let's discuss tomorrow whether our current focus is only to fix this blocker, or we want to add new features at the same time.

          Show
          Zheng Shao added a comment - Ok, let's discuss tomorrow whether our current focus is only to fix this blocker, or we want to add new features at the same time.
          Hide
          Raghotham Murthy added a comment -

          @Zheng. I agree with what you said except the last statement.

          > At this point, the goal is only to support what is already supported (by Meta../TCTL...). For new features, let's keep the discussion on a separate jira.

          The reason is that if our current focus is only to support what is already supported by Meta..., we will have to visit the same problem again of creating a new SerDe to support arrays of arrays. Then another serde to support maps of arrays and so on. I'd rather we make the decision and the code changes now such that LazySerDe becomes the default general serde for nested maps and arrays. I am not saying that you need to make lazy serde support everything as part of this jira. But, what I am hoping for is the decision to add support for nested structures to LazySimpleSerDe rather than creating yet another wafer thin serde.

          Show
          Raghotham Murthy added a comment - @Zheng. I agree with what you said except the last statement. > At this point, the goal is only to support what is already supported (by Meta../TCTL...). For new features, let's keep the discussion on a separate jira. The reason is that if our current focus is only to support what is already supported by Meta..., we will have to visit the same problem again of creating a new SerDe to support arrays of arrays. Then another serde to support maps of arrays and so on. I'd rather we make the decision and the code changes now such that LazySerDe becomes the default general serde for nested maps and arrays. I am not saying that you need to make lazy serde support everything as part of this jira. But, what I am hoping for is the decision to add support for nested structures to LazySimpleSerDe rather than creating yet another wafer thin serde.
          Hide
          Zheng Shao added a comment -

          It's true that LazySimpleSerDe does not share code with MetadataTypedColumnsetSerDe or DynamicSerDe, but the reason of that is because the design principle is completely different and there is no shared code at all.

          @Raghu: I thought you were saying automatically detect the format based on data (not metadata). If we can rely on metadata, then we are on the same track.

          @Namit: The current goal of LazySimpleSerDe is to replace MetaDataTypedColumnsetSerDe and TCTLSeparatedProtocol. LazySerDe is a thin class on top of a lot of utility classes that can be reused for any lazily-initialized serdes.

          I think overall we are on agreement:
          1. We should reuse code as much as possible.
          2. We should rely on metadata to find out the serialization format difference (instead of automatically figure it out from the data).

          The difference between parametering a class and 2 classes are so small - I can write a wrapper to wrap 2 classes. This is a minor issue.
          This is different from reusing the code, because most of the code are in utility classes that will be reused.

          At this point, the goal is only to support what is already supported (by Meta../TCTL...). For new features, let's keep the discussion on a separate jira.

          Show
          Zheng Shao added a comment - It's true that LazySimpleSerDe does not share code with MetadataTypedColumnsetSerDe or DynamicSerDe, but the reason of that is because the design principle is completely different and there is no shared code at all. @Raghu: I thought you were saying automatically detect the format based on data (not metadata). If we can rely on metadata, then we are on the same track. @Namit: The current goal of LazySimpleSerDe is to replace MetaDataTypedColumnsetSerDe and TCTLSeparatedProtocol. LazySerDe is a thin class on top of a lot of utility classes that can be reused for any lazily-initialized serdes. I think overall we are on agreement: 1. We should reuse code as much as possible. 2. We should rely on metadata to find out the serialization format difference (instead of automatically figure it out from the data). The difference between parametering a class and 2 classes are so small - I can write a wrapper to wrap 2 classes. This is a minor issue. This is different from reusing the code, because most of the code are in utility classes that will be reused. At this point, the goal is only to support what is already supported (by Meta../TCTL...). For new features, let's keep the discussion on a separate jira.
          Hide
          Namit Jain added a comment -

          Isnt the goal to use LazySerde as the default for all kinds of schemas ? If yes, we should support all nestings.

          If that is not the goal, then as Raghu said, LazySerde might be deprecated soon. It might be painful to have separate serde as a default based on the schema.

          Show
          Namit Jain added a comment - Isnt the goal to use LazySerde as the default for all kinds of schemas ? If yes, we should support all nestings. If that is not the goal, then as Raghu said, LazySerde might be deprecated soon. It might be painful to have separate serde as a default based on the schema.
          Hide
          Raghotham Murthy added a comment -

          I see that the trend is to keep writing new SerDes either for performance (MetadataTyped... and possibly others) or for expressibility (DynamicSerDe etc). Eventually, most of these SerDes are not used at all other than for backward compatibility with existing data. Isnt it better to find a balance between performance and and expressibility in a single serde which can be used in general. Of course, if people want more performance/expressibility they can use/write other serdes which use binary formats etc.

          Using array begin and end markers dont decrease human-readability imo (people are fine with reading json right?) and they allow expressing nested structures. I am not sure there is a disadvantage to this. What I was hoping was that LazySimpleSerDe becomes the default SerDe for most requirements.

          Specifically, I have a requirement for arrays of maps. If you dont provide support for that in LazySimpleSerDe (which is probably not a big change, i might be mistaken though), then for my requirement, we would have to go through the process of creating a new SerDe. And once we create that serde I'd rather use it than LazySimpleSerDe for all of my future requirements. I am guessing that pretty soon we would have to deprecate LazySimpleSerDe in favor of this new serde because of its expressibility.

          Regarding automatically detecting the serialization format for arrays in the data, maybe I am mistaken, but arent you already using some logic to create LazySimpleSerDe when the metastore has MetadataTypedColumnSetSerDe for that table? In that same logic, cant you add a parameter to the lazy serde to indicate which array serialization format to use?

          Again, I am not suggesting that we should add several serialization formats to the same SerDe. All I am suggesting is that there is a middle ground between proliferating class for each small feature difference and putting all features into a single class.

          Show
          Raghotham Murthy added a comment - I see that the trend is to keep writing new SerDes either for performance (MetadataTyped... and possibly others) or for expressibility (DynamicSerDe etc). Eventually, most of these SerDes are not used at all other than for backward compatibility with existing data. Isnt it better to find a balance between performance and and expressibility in a single serde which can be used in general. Of course, if people want more performance/expressibility they can use/write other serdes which use binary formats etc. Using array begin and end markers dont decrease human-readability imo (people are fine with reading json right?) and they allow expressing nested structures. I am not sure there is a disadvantage to this. What I was hoping was that LazySimpleSerDe becomes the default SerDe for most requirements. Specifically, I have a requirement for arrays of maps. If you dont provide support for that in LazySimpleSerDe (which is probably not a big change, i might be mistaken though), then for my requirement, we would have to go through the process of creating a new SerDe. And once we create that serde I'd rather use it than LazySimpleSerDe for all of my future requirements. I am guessing that pretty soon we would have to deprecate LazySimpleSerDe in favor of this new serde because of its expressibility. Regarding automatically detecting the serialization format for arrays in the data, maybe I am mistaken, but arent you already using some logic to create LazySimpleSerDe when the metastore has MetadataTypedColumnSetSerDe for that table? In that same logic, cant you add a parameter to the lazy serde to indicate which array serialization format to use? Again, I am not suggesting that we should add several serialization formats to the same SerDe. All I am suggesting is that there is a middle ground between proliferating class for each small feature difference and putting all features into a single class.
          Hide
          Zheng Shao added a comment -

          For the migration path, we can easily create new tables with new SerDes, and everything will work transparently. If you are suggesting letting LazySerDe automatically figure out the old/new format, I don't think that's even possible, and if it is, users will be easily confused by that.

          Delimited format is meant to be simple and human-readable, and it is only good for simple data. If the structure really gets complicated, we should store the data in binary format instead of delimited format. For example, we can use Thrift etc.

          If we really want to write a new SerDe that shares a lot with LazySimpleSerDe (with an extended delimited format), we can easily do that by reusing a lot of the classes introduced by LazySimpleSerDe. There is not much to copy - and if there is, it's better to factor the common code out, instead of pushing all logics (new format/old format) into the same class.

          Let's open another jira for discussions on new features like this.

          So the question here is that we have to make a choice between the two: whether treat "" to be an empty array or an array with an empty string as the only element (and the same question for NULL).

          Show
          Zheng Shao added a comment - For the migration path, we can easily create new tables with new SerDes, and everything will work transparently. If you are suggesting letting LazySerDe automatically figure out the old/new format, I don't think that's even possible, and if it is, users will be easily confused by that. Delimited format is meant to be simple and human-readable, and it is only good for simple data. If the structure really gets complicated, we should store the data in binary format instead of delimited format. For example, we can use Thrift etc. If we really want to write a new SerDe that shares a lot with LazySimpleSerDe (with an extended delimited format), we can easily do that by reusing a lot of the classes introduced by LazySimpleSerDe. There is not much to copy - and if there is, it's better to factor the common code out, instead of pushing all logics (new format/old format) into the same class. Let's open another jira for discussions on new features like this. So the question here is that we have to make a choice between the two: whether treat "" to be an empty array or an array with an empty string as the only element (and the same question for NULL).
          Hide
          Raghotham Murthy added a comment -

          I am not suggesting changing the serialization format for everything. just for maps and arrays. If I have to write a new SerDe I'd have to copy over code from LazySimpleSerDe for everything other than for maps and arrays. Can you make LazySimpleSerDe parameterizable so that it can deserialize the old format (just single delimiters) as well as the new format (with begin-end delimiters say) and always serialize into the new format? This will give us a migration path to a format which is compatible with nested arrays and maps. Thoughts?

          Show
          Raghotham Murthy added a comment - I am not suggesting changing the serialization format for everything. just for maps and arrays. If I have to write a new SerDe I'd have to copy over code from LazySimpleSerDe for everything other than for maps and arrays. Can you make LazySimpleSerDe parameterizable so that it can deserialize the old format (just single delimiters) as well as the new format (with begin-end delimiters say) and always serialize into the new format? This will give us a migration path to a format which is compatible with nested arrays and maps. Thoughts?
          Hide
          Zheng Shao added a comment -

          By null array, I mean case 1 (as you mentioned above).

          There is no plan for LazySerDe to support nested array at this moment - it is used to replace MetaDataTypeColumnsetSerDe and TCTLSeparatedProtocol by exactly the same serialization format.

          If we want to change the serialization format, that will be a new SerDe.

          Show
          Zheng Shao added a comment - By null array, I mean case 1 (as you mentioned above). There is no plan for LazySerDe to support nested array at this moment - it is used to replace MetaDataTypeColumnsetSerDe and TCTLSeparatedProtocol by exactly the same serialization format. If we want to change the serialization format, that will be a new SerDe.
          Hide
          Raghotham Murthy added a comment -

          I am not sure what you mean by null array.

          Given an array column, ideally, we should distinguish between the following cases (I am repeating them for clarity):

          1. NULL - array column is null (is this what you mean by null array?)
          2. [NULL] - array containing one element (NULL)
          3. [''] - array containing one element (empty string)
          4. [] - array containing no elements

          Is there are plan for LazySimpleSerDe to support nested arrays? If so, we cant really have a single delimiter for arrays and maps. We should introduce array begin and end markers in the serialization format. Alternatively, we could store the number of bytes in the array before the array column value itself.

          Show
          Raghotham Murthy added a comment - I am not sure what you mean by null array. Given an array column, ideally, we should distinguish between the following cases (I am repeating them for clarity): 1. NULL - array column is null (is this what you mean by null array?) 2. [NULL] - array containing one element (NULL) 3. [''] - array containing one element (empty string) 4. [] - array containing no elements Is there are plan for LazySimpleSerDe to support nested arrays? If so, we cant really have a single delimiter for arrays and maps. We should introduce array begin and end markers in the serialization format. Alternatively, we could store the number of bytes in the array before the array column value itself.
          Hide
          Zheng Shao added a comment -

          We use "
          N" to represent NULL, so we can still distinguish between empty array and null array.

          Show
          Zheng Shao added a comment - We use " N" to represent NULL, so we can still distinguish between empty array and null array.
          Hide
          Prasad Chakka added a comment -

          just to clarify what i meant,

          a null array, an empty array and an array with one empty string or null are all same. So users can't rely on being able to differentiate between the 4 cases. All of them are represented by an empty string in the serialized string of the array. such an array should be deserialzied to either null or empty array. i prefer a null array since it is easier to test for in conditions.

          Show
          Prasad Chakka added a comment - just to clarify what i meant, a null array, an empty array and an array with one empty string or null are all same. So users can't rely on being able to differentiate between the 4 cases. All of them are represented by an empty string in the serialized string of the array. such an array should be deserialzied to either null or empty array. i prefer a null array since it is easier to test for in conditions.
          Hide
          Zheng Shao added a comment -

          2 pitfalls for the delimited format:

          1. Empty array has exactly the same serialized format as an array with a single element which is an empty String.

          2. Null array has exactly the same serialized format as an array with a single element which is Null.

          We have to make a choice between the two. After some discussions with Prasad, we've got the consensus that we should support empty array and null array (which means we are NOT going to support an array with a single element that is empty or null.)

          This is the same as what TCTLSeparatedProtocol is doing.

          Show
          Zheng Shao added a comment - 2 pitfalls for the delimited format: 1. Empty array has exactly the same serialized format as an array with a single element which is an empty String. 2. Null array has exactly the same serialized format as an array with a single element which is Null. We have to make a choice between the two. After some discussions with Prasad, we've got the consensus that we should support empty array and null array (which means we are NOT going to support an array with a single element that is empty or null.) This is the same as what TCTLSeparatedProtocol is doing.

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development