Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-5774

INSERT OVERWRITE DYNAMIC PARTITION on LARGE DATA

    XMLWordPrintableJSON

Details

    • hive insert overwrite partition dynamic static

    Description

      After several forensic analysis, we are convinced that there is a bug when rebuilding using dynamic partition over more than 30 days. Row counts do not match.

      In details:
      Part A – original_table
      2013-01-01; 394,755 rows
      2013-01-02; 424,448
      2013-01-03; 427,201
      ...
      2013-10-30; 3,234,472

      Part B – copy_of_original_table_new
      2013-01-01; 372,628 rows
      2013-01-02; 400,553
      2013-01-03; 403,495
      ...
      2013-10-30; 2,865,877

      The query that is used to populate the original table is the same for populating the "copy_of_original_table_new" table. When we rebuilt for 1 day, e.g. 2013-01-01, the number of row counts of the copy_of_original_table_new matched up exactly with orignal_table.
      When we rebuilt for 7 days, the number of row counts matched up exactly.
      When we rebuilt for 15 days, the number of row counts matched up exactly.
      When we rebuilt for 303 days (10 months), everything fxxked up. No matches.
      When we rebuilt for 35 days, 80% matched up exactly. The other 20% are out from hundreds to tens of thousands of rows (a variance of up to 3%)

      In other words, the more days that are specified in the WHERE dt BETWEEN dateStart AND dateEnd, the dates will be out, i.e. no matching row count with original_table.

      However, of those 20% that are out, we rebuilt each of them statically with the corresponding date. The result is astonishingly surprising – they matched the original_table row count!

      Apologize in advance if this is not technical enough, but I hope the message is clear. We believe there is a bug. Not sure how to check our Hive version, but our Hadoop's version is "Hadoop 2.0.0-cdh4.1.1"

      For a glimpse of the INSERT OVERWRITE sql, it's here – http://pastebin.com/g1qxsUm2

      Attachments

        Activity

          People

            Unassigned Unassigned
            da789 Danny Teok
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: