Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-5590

select and get duplicated records with hive when a .defalte file greater than 64MB was loaded to a hive table

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Environment:

      cdh4

    • Tags:
      64M hive hdfs count(*) duplited records

      Description

      we occasionally have some compressed file larger than 160MB in .deflate format. And it was load to hive using an external table, say table T_A.
      when select count from T_A we got more records,70% more! compared with that we use "hadoop fs -text /xxxxx |wc -l" to check the file.
      any clue for this? how could it happened?

      the large .deflate file was due to imperfect processing , when we fixed it and get files less than 64M. the above problem did not come up. But since it is not guaranteed that a larger file would not show up again. is there any way to avoid this subject ?

      cheers!
      eye

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              eye eye
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified