[HIVE-22337] Improve and Expand Text-Based SerDes - ASF JIRA

XML

Word

Printable

JSON

Add new SerDe package just for text-based formats: org.apache.hadoop.hive.serde2.text.*
Add new SerDe package just for text-based log formats: org.apache.hadoop.hive.serde2.text.log.*
Create a coherent hierarchy for processing delimited data: AbstractSerDe -> TextSerDe -> EncodingAwareTextSerde -> DelimitedSerDe -> CsvTextSerDe
Create a coherent hierarchy for processing regex'ed data: AbstractSerDe -> TextSerDe -> EncodingAwareTextSerde -> RegexSerDe -> CommonFormatLogSerDe
Create some standard text processors for super-quick out-of-the-box processing: TSV SerDe and CSV SerDe
Create some standard log processors for super-quick out-of-the-box processing: Apache Common Log Format and Apache Combined Log Format (Apache HTTP Server Log Parsers)
Better default behaviors for processing text

The default behavior should allow users to quick query data without any failures.

When a blank line is encountered, insert a 'null' value for each column
When there are fewer fields in the data than defined in the table schema, shift all available fields left, and fill in 'null' values for all remaining fields
When there are too many fields in the data, the last field in the results will contain all remaining values. Currently, the data is silently swallows and a warning is issued in the YARN logs. A normal user will never see this warning, especially if the job completes successfully. Better to (by default) provide them all the data than to hide anything.

CSV SerDe

"1,2,3"    = ["1","2","3"]
"1,2,"     = ["1","2",null]
""         = [null,null,null]
"1,2,3,4"  = ["1","2","3,4"]

relates to

HIVE-22360 MultiDelimitSerDe returns wrong results in last column when the loaded file has more columns than those in table schema

HIVE-15826 Add 'serialization.encoding' To All SerDes

links to

GitHub Pull Request #815

Estimated:

Not Specified

Remaining:

Logged:

40m