Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8763

Support for use of enclosed quotes in LazySimpleSerde

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11.0, 0.12.0, 0.13.0, 0.13.1
    • Fix Version/s: None
    • Labels:
      None
    • Environment:

      many - verified on Centos / Redhat with CDH

    • Tags:
      CSV

      Description

      Currently the LazySimpleSerde does not support the use of quotes for delimited fields to allow use of separators within a quoted field - this means having to use alternatives for many common use cases for CSV style data.

      Key scenarios that do not work include:
      (3 column row for int, string, float delimited by ',')
      100,"3.5 inch hard drive, quantity 10",2650.30
      100,"3.5 \" hard drive, quantity 10",2650.30
      100, "3.5 "" hard drive, quantity 10", 2650.30
      100,"3.5 "" hard drive, quantity 10",2650.30

      There are a number of fixes that I have implemented support in the deserialization stage to a copy of the Lazy simple serde to address this:

      For serialization, the code is unchanged with the relevant embedded characters being escaped.

      Assuming a row with 3 fields - SKU ID, description, price, delimited by ','

      1) allow use of enclosed quotes around a string field
      For example

      100,"3.5 inch hard drive, quantity 10",2650.30

      2) support escaping of quotes within field to allow use of embedded quote
      100,"3.5 \" hard drive, quantity 10",2650.30

      3) support for old style CSV embedded quotes
      for example

      100,"3.5 "" hard drive, quantity 10",2650.30

      4) support for skipping of leading spaces in field
      For example (note space between first ',' and opening quote)

      100, "3.5 "" hard drive, quantity 10", 2650.30

      In each case, with the changes these are evaluated as though the delimiters and embedded quotes were escaped:
      e.g

      100, 3.5 \" hard drive\, quantity 10, 2650.30

      All of these are enabled or disabled using serde properties for quotechar, whether enclosed quotes is supported, whether double embedded quotes are treated as single quote (of same char type)

      1. HIVE-8763.1.patch
        72 kB
        ronan stokes

        Issue Links

          Activity

          Hide
          rstokes@cloudera.com ronan stokes added a comment -

          To avoid any performance issues, the SERDE modifications will not support embedded record delimiters in quoted strings . For example if the source data uses newline (UTF-8 0x0a) as the record delimiter, the modifications will not do anything specifically to handle that - nor will they disallow it.

          As handling of embedded record delimiters requires changes to the underlying input format, I am not proposing to handle embedded record delimiters with these modifications.

          Show
          rstokes@cloudera.com ronan stokes added a comment - To avoid any performance issues, the SERDE modifications will not support embedded record delimiters in quoted strings . For example if the source data uses newline (UTF-8 0x0a) as the record delimiter, the modifications will not do anything specifically to handle that - nor will they disallow it. As handling of embedded record delimiters requires changes to the underlying input format, I am not proposing to handle embedded record delimiters with these modifications.
          Hide
          rstokes@cloudera.com ronan stokes added a comment -

          Initial patch for changes

          Adds 4 new properties to LazySimpleSerde

          "field.doublequotes.as.quote" (true/false - default false)
          "field.rtrim"; - (true/false - default false)
          "field.ltrim" - (true/false - default false)

          and uses "quote.delim" to specify quote

          Currently support is for read only and applies to CHAR, VARCHAR and STRING inside in CSV style data

          ltrim and rtrim options will allow both string and non string data to be parsed correctly when leading or trailing spaces or tabs are present. For strings ltrim/rstrim will skip whitespace either side of quotes if quote enclosures are being used.

          Show
          rstokes@cloudera.com ronan stokes added a comment - Initial patch for changes Adds 4 new properties to LazySimpleSerde "field.doublequotes.as.quote" (true/false - default false) "field.rtrim"; - (true/false - default false) "field.ltrim" - (true/false - default false) and uses "quote.delim" to specify quote Currently support is for read only and applies to CHAR, VARCHAR and STRING inside in CSV style data ltrim and rtrim options will allow both string and non string data to be parsed correctly when leading or trailing spaces or tabs are present. For strings ltrim/rstrim will skip whitespace either side of quotes if quote enclosures are being used.
          Hide
          Ferd Ferdinand Xu added a comment -

          Hi Ronan Stokes, can you please create a review board entry for your patch?

          Show
          Ferd Ferdinand Xu added a comment - Hi Ronan Stokes , can you please create a review board entry for your patch?

            People

            • Assignee:
              Unassigned
              Reporter:
              rstokes@cloudera.com ronan stokes
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development