Details
-
Improvement
-
Status: Resolved
-
Trivial
-
Resolution: Won't Do
-
Impala 1.1, Impala 2.5.0
-
None
-
None
Description
Currently, the precise way to create a CSV or TSV table is to specify:
row format delimited fields terminated by ',' stored as textfile
row format delimited fields terminated by '\t' stored as textfile
Some parts are optional (STORED AS TEXTFILE), others are redundant (is there any TEXTFILE that is not ROW FORMAT DELIMITED? could you have FIELDS TERMINATED BY without ROW FORMAT DELIMITED?) or confusing (could STORED AS or FIELDS TERMINATED BY go ahead of ROW FORMAT?).
Since the 3 clauses STORED AS TEXTFILE, ROW FORMAT DELIMITED, and FIELDS TERMINATED BY always go hand-in-hand, how about some shorthand notation for the most common cases, for example:
STORED AS CSV
STORED AS TSV
to allow creating tables for comma-separated and tab-separated data without the verbose and redundant clauses?
I did a little research online and all the other common separators do not have well-known acronyms. Sometimes space is a separator, but that is a recipe for trouble with most kinds of string data. Pipe is popular, but there is no tradition of a .psv file extension or PSV acronym.* Colon is sometimes used, but any acronym would conflict with CSV for comma.
Cf: http://en.wikipedia.org/wiki/Delimiter-separated_values
- Now that I've had some more experience with choosing delimiter characters, I'm voting for STORED AS PSV to make a table with pipe-separated values. Sometimes you need a character that's visible, so tab won't do. So pipe is the logical fallback choice if the data contains any commas.