Hive
  1. Hive
  2. HIVE-136

SerDe should escape some special characters

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.4.0
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
      Otherwise the data will look corrupted.

      We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
      So LazySimpleSerDe needs to have the capability of escaping and unescaping.

      1. HIVE-136.5.patch
        204 kB
        Zheng Shao
      2. HIVE-136.4.patch
        204 kB
        Zheng Shao
      3. HIVE-136.3.patch
        204 kB
        Zheng Shao
      4. HIVE-136.2.patch
        202 kB
        Zheng Shao
      5. HIVE-136.1.patch
        200 kB
        Zheng Shao

        Issue Links

          Activity

          Hide
          Namit Jain added a comment -

          Committed. Thanks Zheng

          Show
          Namit Jain added a comment - Committed. Thanks Zheng
          Hide
          Zheng Shao added a comment -

          Removed LazyString from GroupByOperator.

          Show
          Zheng Shao added a comment - Removed LazyString from GroupByOperator.
          Hide
          Namit Jain added a comment -

          Other than that, it looks good.

          +1

          I will commit if after that change if tests pass.

          Show
          Namit Jain added a comment - Other than that, it looks good. +1 I will commit if after that change if tests pass.
          Hide
          Namit Jain added a comment -

          Also, remove the reference to LazyString from GroupByOperator.

          All primitives have 3 representations: Integer, IntWritable, LazyInteger
          Same goes for String.

          We know in groupby, since we have converted to Standard Writable, it cannot be LazyString.

          Show
          Namit Jain added a comment - Also, remove the reference to LazyString from GroupByOperator. All primitives have 3 representations: Integer, IntWritable, LazyInteger Same goes for String. We know in groupby, since we have converted to Standard Writable, it cannot be LazyString.
          Hide
          Zheng Shao added a comment -

          Added the missing functionality. Also added some comments in GroupByOperator.

          Show
          Zheng Shao added a comment - Added the missing functionality. Also added some comments in GroupByOperator.
          Hide
          Namit Jain added a comment -

          LazyFactory does not support boolean

          Show
          Namit Jain added a comment - LazyFactory does not support boolean
          Hide
          Zheng Shao added a comment -

          Added some comments.

          Show
          Zheng Shao added a comment - Added some comments.
          Hide
          Zheng Shao added a comment -

          This patch fixed all test failures (except the existing failure in trunk).

          Show
          Zheng Shao added a comment - This patch fixed all test failures (except the existing failure in trunk).
          Hide
          Zheng Shao added a comment -

          This patch adds the capability of escaping, following the MySQL way.

          The command to create an escaped table is:

          CREATE TABLE table1 (a STRING, b STRING)
          ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '\\';
          
          Show
          Zheng Shao added a comment - This patch adds the capability of escaping, following the MySQL way. The command to create an escaped table is: CREATE TABLE table1 (a STRING, b STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '\\';
          Hide
          Zheng Shao added a comment -

          In HIVE-270 I put a link to how MySQL does escaping. That seems to be a clean solution although it won't support string.split or even splitting lines using "\n".

          If we change "\005" with "
          " in your case, it won't work with the symmetric escaping/unescaping logic as well.

          What about this? We explicitly allow users to enable/disable escaping. If the user disable it, then we just do nothing (and the data may look corrupt to them - but that's the only thing we can do). If the user enable it, then we do the escaping/unescaping logic as I mentioned (not symmetric, but try to guess what the user really means).

          Show
          Zheng Shao added a comment - In HIVE-270 I put a link to how MySQL does escaping. That seems to be a clean solution although it won't support string.split or even splitting lines using "\n". If we change "\005" with " " in your case, it won't work with the symmetric escaping/unescaping logic as well. What about this? We explicitly allow users to enable/disable escaping. If the user disable it, then we just do nothing (and the data may look corrupt to them - but that's the only thing we can do). If the user enable it, then we do the escaping/unescaping logic as I mentioned (not symmetric, but try to guess what the user really means).
          Hide
          Joydeep Sen Sarma added a comment -

          sorry for dropping the ball on this:

          the case that concerned me was the case of transform. say we have an input file with "\005" that was produced outside hive. based on what i understand - the proposal is to convert it to code point 5. i feel uncomfortable with this - i would rather that we pass thru this data and let the user (either in the transform script or via an explicit UDF) deal with it. same concern with \0. (let's say i dump out a directory listing from a windows machine - "\0" fragment would be for a filename beginning with char '0' - instead if we unescape it to character 0 and pass it to the script - it would make it difficult to analyze this kind of data. I am also not sure how different languages would deal with a null character - some (like C) might drop the part beyond the null character altogether).

          so my vote would be to keep the unescaping to the bare minimum required and provide other functions to provide any enhanced semantics.

          Show
          Joydeep Sen Sarma added a comment - sorry for dropping the ball on this: the case that concerned me was the case of transform. say we have an input file with "\005" that was produced outside hive. based on what i understand - the proposal is to convert it to code point 5. i feel uncomfortable with this - i would rather that we pass thru this data and let the user (either in the transform script or via an explicit UDF) deal with it. same concern with \0. (let's say i dump out a directory listing from a windows machine - "\0" fragment would be for a filename beginning with char '0' - instead if we unescape it to character 0 and pass it to the script - it would make it difficult to analyze this kind of data. I am also not sure how different languages would deal with a null character - some (like C) might drop the part beyond the null character altogether). so my vote would be to keep the unescaping to the bare minimum required and provide other functions to provide any enhanced semantics.
          Hide
          Zheng Shao added a comment -

          > if more aggressive unescaping is required - we can always provide unescape UDFs. That would be much better since we could have some standard semantics for the unescaping (json/html/xml etc)

          Let's say the content of the file contains something like "\ \ 0 0 5", and also "\ 0 0 5". After the unescaping, both of them will be "\ 0 0 5" so UDF won't be able to unescape it further.

          I think it's OK that the escaping and unescaping are not exactly the same, given that a string escaped using our logic will be unescaped back to the original value.

          The extra logic in unescaping is only used to deal with "impossible" cases - the cases that will never happen using our own escaping logic. In that case we can either throw an error or "guess" what the user means. My guess is that most other escaping logic will escape "\" to "\ \" (if they escape something using "\", they have to escape "\" too). And they don't do special handling according to the actual separator (most of them escape all special characters). So if there is actually a "\ 0 0 5", it is most likely be the ascii code 5.

          I am also OK if we just escape (and unescape reflectively) all characters outside the range of 32-127, plus "\" and all separators. That makes the logic simpler. But it is still a question what if we see an "impossible" sequence of characters in the escaped stream, e.g. "\ 0 4 0 " (0x20, 32, space).

          Show
          Zheng Shao added a comment - > if more aggressive unescaping is required - we can always provide unescape UDFs. That would be much better since we could have some standard semantics for the unescaping (json/html/xml etc) Let's say the content of the file contains something like "\ \ 0 0 5", and also "\ 0 0 5". After the unescaping, both of them will be "\ 0 0 5" so UDF won't be able to unescape it further. I think it's OK that the escaping and unescaping are not exactly the same, given that a string escaped using our logic will be unescaped back to the original value. The extra logic in unescaping is only used to deal with "impossible" cases - the cases that will never happen using our own escaping logic. In that case we can either throw an error or "guess" what the user means. My guess is that most other escaping logic will escape "\" to "\ \" (if they escape something using "\", they have to escape "\" too). And they don't do special handling according to the actual separator (most of them escape all special characters). So if there is actually a "\ 0 0 5", it is most likely be the ascii code 5. I am also OK if we just escape (and unescape reflectively) all characters outside the range of 32-127, plus "\" and all separators. That makes the logic simpler. But it is still a question what if we see an "impossible" sequence of characters in the escaped stream, e.g. "\ 0 4 0 " (0x20, 32, space).
          Hide
          Joydeep Sen Sarma added a comment -

          ignore the comment on '\n' -> '
          n' - looks like it's already covered. the jira is not showing the backslashes - but i happened to look at an unread email and it much more readable.

          Show
          Joydeep Sen Sarma added a comment - ignore the comment on '\n' -> ' n' - looks like it's already covered. the jira is not showing the backslashes - but i happened to look at an unread email and it much more readable.
          Hide
          Joydeep Sen Sarma added a comment -

          for #1 - so if the row separator can be other than newline - i am guessing u want to generalize to escaping to whatever the row separator is (as opposed to specifically escaping newlines) - right?

          also - if there is a literal '\n' sequence in the original string - are we going to convert into '
          n' during serialization and then match '
          ' back in deserialization (so as to leave '\n' back in deserialized string). generalize to other separators as well of course.

          #2 - I am uncomfortable unescaping more than what we escape. for example - let's say there's a \XXX symbol in some text file. Now we need to feed it into a transform script. we unescape when reading the text file - but we do not escape it back when sending to script. So the user script does not see the raw data in the file.

          Presumably - if the user had put \XXX in the file - then they knew how to handle it in their scripts and we are needlessly tampering with this data.

          if more aggressive unescaping is required - we can always provide unescape UDFs. That would be much better since we could have some standard semantics for the unescaping (json/html/xml etc)

          Show
          Joydeep Sen Sarma added a comment - for #1 - so if the row separator can be other than newline - i am guessing u want to generalize to escaping to whatever the row separator is (as opposed to specifically escaping newlines) - right? also - if there is a literal '\n' sequence in the original string - are we going to convert into ' n' during serialization and then match ' ' back in deserialization (so as to leave '\n' back in deserialized string). generalize to other separators as well of course. #2 - I am uncomfortable unescaping more than what we escape. for example - let's say there's a \XXX symbol in some text file. Now we need to feed it into a transform script. we unescape when reading the text file - but we do not escape it back when sending to script. So the user script does not see the raw data in the file. Presumably - if the user had put \XXX in the file - then they knew how to handle it in their scripts and we are needlessly tampering with this data. if more aggressive unescaping is required - we can always provide unescape UDFs. That would be much better since we could have some standard semantics for the unescaping (json/html/xml etc)
          Hide
          Zheng Shao added a comment -

          1. The code does allow ROW SEPARATOR to be different from '\n';

          2. For deserialization, we might deserialize from user-generated data (data generated outside of hive). If that data do escape a single back-slash to be double back-slashes, then

          Basically the principle of escaping in the serialization phase is to escape as less as possible, so in most cases, the user does not need to care about escaping when reading the data (if the data do not contain these special characters). And if the user does do unescaping, he just needs to follow the steps in the deserialization mentioned above - without considering what is the actual column separator.

          The principle of unescaping in the deserialization phase is to unescape as much as possible (well, if the user writes single backslash to be double backslash, then the only reason that there is still single backslash in the file is because user escaped something.) This also helps the user in that his escaping code does not need to be changed for different column separators.

          Show
          Zheng Shao added a comment - 1. The code does allow ROW SEPARATOR to be different from '\n'; 2. For deserialization, we might deserialize from user-generated data (data generated outside of hive). If that data do escape a single back-slash to be double back-slashes, then Basically the principle of escaping in the serialization phase is to escape as less as possible, so in most cases, the user does not need to care about escaping when reading the data (if the data do not contain these special characters). And if the user does do unescaping, he just needs to follow the steps in the deserialization mentioned above - without considering what is the actual column separator. The principle of unescaping in the deserialization phase is to unescape as much as possible (well, if the user writes single backslash to be double backslash, then the only reason that there is still single backslash in the file is because user escaped something.) This also helps the user in that his escaping code does not need to be changed for different column separators.
          Hide
          Joydeep Sen Sarma added a comment -

          looks like in right direction. questions:

          • i like the logic of escaping row/column/key/item separators. do we allow anything other than newline to be row separator?
          • for deserialization - shouldn't we only unescape what we might have escaped during serialization? (so should not be unescaping octals and nulls for example)
          Show
          Joydeep Sen Sarma added a comment - looks like in right direction. questions: i like the logic of escaping row/column/key/item separators. do we allow anything other than newline to be row separator? for deserialization - shouldn't we only unescape what we might have escaped during serialization? (so should not be unescaping octals and nulls for example)
          Hide
          Zheng Shao added a comment -

          Proposal:

          1. For serialization:

          • \ ->
          • newline -> \n
          • carriage return -> \r

          The following characters are escaped only if they are column/item/key separators or quotations:

          • null character -> \000 (octal number)
          • ^A -> \001
          • ^B -> \002
            ...
          • tab -> \t
            ...

          2. For deserialization:


          • -> \
          • \n -> newline
          • \r -> carriage return
          • \xxx (where xxx are octal number from 000 to 177 (127 in decimal) )
          • \0 -> null character
          • \ (without a match above) -> \

          In this proposal, we don't support quotation (" and '). Quotation will be required to read mysql/oracle dumped data, but I hope to address it in a different serde since we need to distinguish the

          Show
          Zheng Shao added a comment - Proposal: 1. For serialization: \ -> newline -> \n carriage return -> \r The following characters are escaped only if they are column/item/key separators or quotations: null character -> \000 (octal number) ^A -> \001 ^B -> \002 ... tab -> \t ... 2. For deserialization: -> \ \n -> newline \r -> carriage return \xxx (where xxx are octal number from 000 to 177 (127 in decimal) ) \0 -> null character \ (without a match above) -> \ In this proposal, we don't support quotation (" and '). Quotation will be required to read mysql/oracle dumped data, but I hope to address it in a different serde since we need to distinguish the
          Hide
          Jeff Hammerbacher added a comment -

          Adding to "Serializers/Deserializers" component.

          Show
          Jeff Hammerbacher added a comment - Adding to "Serializers/Deserializers" component.
          Hide
          Zheng Shao added a comment -

          The problem breaks down to 3 items:

          1. When loading the data, the user would be responsible for providing data in the expected escaped format. For example, in order to represent in the text file one field containing a carriage return, it should be written as "\n". Sequence file does not have to escape "\n" but that will be recommended. Both file formats would need to escape column/item/key separator.

          2. When the serde is reading the data, it should unescape the data. This means string.split need to be replaced by a scan, which will have a little performance loss.

          3. When the serde is writing the data, it should escape the data. This also means a scan of the string and conversion to a new string, which also have a little performance loss.

          Show
          Zheng Shao added a comment - The problem breaks down to 3 items: 1. When loading the data, the user would be responsible for providing data in the expected escaped format. For example, in order to represent in the text file one field containing a carriage return, it should be written as "\n". Sequence file does not have to escape "\n" but that will be recommended. Both file formats would need to escape column/item/key separator. 2. When the serde is reading the data, it should unescape the data. This means string.split need to be replaced by a scan, which will have a little performance loss. 3. When the serde is writing the data, it should escape the data. This also means a scan of the string and conversion to a new string, which also have a little performance loss.

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development