Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22337

Improve and Expand Text-Based SerDes

    XMLWordPrintableJSON

Details

    Description

      • Add new SerDe package just for text-based formats: org.apache.hadoop.hive.serde2.text.*
      • Add new SerDe package just for text-based log formats: org.apache.hadoop.hive.serde2.text.log.*
      • Create a coherent hierarchy for processing delimited data: AbstractSerDe -> TextSerDe -> EncodingAwareTextSerde -> DelimitedSerDe -> CsvTextSerDe
      • Create a coherent hierarchy for processing regex'ed data: AbstractSerDe -> TextSerDe -> EncodingAwareTextSerde -> RegexSerDe -> CommonFormatLogSerDe
      • Create some standard text processors for super-quick out-of-the-box processing: TSV SerDe and CSV SerDe
      • Create some standard log processors for super-quick out-of-the-box processing: Apache Common Log Format and Apache Combined Log Format (Apache HTTP Server Log Parsers)
      • Better default behaviors for processing text

      The default behavior should allow users to quick query data without any failures.

      1. When a blank line is encountered, insert a 'null' value for each column
      2. When there are fewer fields in the data than defined in the table schema, shift all available fields left, and fill in 'null' values for all remaining fields
      3. When there are too many fields in the data, the last field in the results will contain all remaining values. Currently, the data is silently swallows and a warning is issued in the YARN logs. A normal user will never see this warning, especially if the job completes successfully. Better to (by default) provide them all the data than to hide anything.
      CSV SerDe
      "1,2,3"    = ["1","2","3"]
      "1,2,"     = ["1","2",null]
      ""         = [null,null,null]
      "1,2,3,4"  = ["1","2","3,4"]
      

      Attachments

        1. HIVE-22337.1.patch
          83 kB
          David Mollitor
        2. HIVE-22337.2.patch
          85 kB
          David Mollitor

        Issue Links

          Activity

            People

              belugabehr David Mollitor
              belugabehr David Mollitor
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m