Hive
  1. Hive
  2. HIVE-634

ctrl-A is the only output delimiter used, regardless of the Hive table structure

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 0.3.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      No matter what the table format, INSERT OVERWRITE LOCAL DIRECTORY will always use ctrl-A delimiters ('\001' ).

      INSERT OVERWRITE LOCAL DIRECTORY '/mnt/daily_timelines' SELECT * FROM daily_timelines;

      where daily_timelines is defined as tab delimited

      CREATE TABLE daily_timelines (
      page_id BIGINT,
      dates STRING,
      pageviews STRING,
      total_pageviews BIGINT)
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '\t'
      STORED AS TEXTFILE;

      This page also indicates Hive uses a fixed delimiter, and should be updated:

      http://wiki.apache.org/hadoop/Hive/LanguageManual/DML

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          Good observation. The LazySimpleSerDe that is used to serialize the data already support customized "serialization.null.format" - we just need to expose that functionality to our user.

          Show
          Zheng Shao added a comment - Good observation. The LazySimpleSerDe that is used to serialize the data already support customized "serialization.null.format" - we just need to expose that functionality to our user.
          Hide
          Neal Richter added a comment -

          Note that both methods seem to use '\N' in the output files instead of 'NULL'.

          It would be better for the export to use 'NULL' for full compatibility with MySQL and postgres. I tested '\N' and it's not recognized as 'NULL' with using MySQL's LOAD DATA INFILE. Easy enough for sed to fix with another pass through the data... at the cost of another pass through the data.

          Can one also accomplish this goal with a custom Transform?
          http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform

          Show
          Neal Richter added a comment - Note that both methods seem to use '\N' in the output files instead of 'NULL'. It would be better for the export to use 'NULL' for full compatibility with MySQL and postgres. I tested '\N' and it's not recognized as 'NULL' with using MySQL's LOAD DATA INFILE. Easy enough for sed to fix with another pass through the data... at the cost of another pass through the data. Can one also accomplish this goal with a custom Transform? http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
          Hide
          Zheng Shao added a comment -

          There won't be any difference - it's kind of a syntax sugar.

          Show
          Zheng Shao added a comment - There won't be any difference - it's kind of a syntax sugar.
          Hide
          Neal Richter added a comment -

          What would be the output difference between the proposed feature and doing a two step process as follows?

          DROP TABLE tmp_exports_daily_timelines;
          CREATE TABLE tmp_exports_daily_timelines (
          page_id BIGINT,
          dates STRING,
          pageviews STRING,
          total_pageviews BIGINT
          ) ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\t'
          LINES TERMINATED BY '\n'
          STORED AS TEXTFILE
          LOCATION '/tmp/exports/daily_timelines';

          INSERT OVERWRITE TABLE tmp_exports_daily_timelines
          SELECT page_id, dates, pageviews, total_pageviews
          FROM daily_timelines
          [ some WHERE clause here]

          Show
          Neal Richter added a comment - What would be the output difference between the proposed feature and doing a two step process as follows? DROP TABLE tmp_exports_daily_timelines; CREATE TABLE tmp_exports_daily_timelines ( page_id BIGINT, dates STRING, pageviews STRING, total_pageviews BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/tmp/exports/daily_timelines'; INSERT OVERWRITE TABLE tmp_exports_daily_timelines SELECT page_id, dates, pageviews, total_pageviews FROM daily_timelines [ some WHERE clause here]
          Hide
          Ashish Thusoo added a comment -

          +1 to the proposal.

          Show
          Ashish Thusoo added a comment - +1 to the proposal.
          Hide
          Zheng Shao added a comment -

          The format of the local file has nothing to do with the format of the source table.

          We should add the formatting sub clause for insert overwrite.

          INSERT OVERWRITE LOCAL DIRECTORY '/mnt/daily_timelines' 
          [ ROW FORMAT DELIMITED | SERDE ... ]
          [ FILE FORMAT ...]
          SELECT * FROM daily_timelines;
          
          Show
          Zheng Shao added a comment - The format of the local file has nothing to do with the format of the source table. We should add the formatting sub clause for insert overwrite. INSERT OVERWRITE LOCAL DIRECTORY '/mnt/daily_timelines' [ ROW FORMAT DELIMITED | SERDE ... ] [ FILE FORMAT ...] SELECT * FROM daily_timelines;

            People

            • Assignee:
              Unassigned
              Reporter:
              Peter Skomoroch
            • Votes:
              9 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development