Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
We encountered an interesting bug with Impala/Hive where sometimes Impala shows double the number of results.
Here's what happened.
- Impala issues DROP TABLE, which moves the table to the trash directory.
- At the same time, Ozone's trash policy moves the trash directory .Trash/Current to .Trash/<timestamp>.
- HMS checks .Trash/Current, thinking it exists, and moves the table directory under .Trash/Current, which is being moved away.
- The rename fails, Ozone rename returns false instead of exception. The trash policy doesn't check return value and simply assumes it succeeded.
- HMS fails to delete files, so they’re left behind in a standard location for this table.
- Impala creates a new table. This is purely a metadata operation, it doesn’t do anything in the filesystem. When a table is partitioned, partition existence is metadata. The table starts with 0 partitions.
- Because there are 0 partitions, when select count is first run Impala sees 0 partitions so it doesn’t look for files. It immediately returns 0 rows.
- INSERT OVERWRITE adds a partition. That partition location is in a standard location, which happens to already have files. OVERWRITE does nothing because previously Impala had no partitions (this might be considered a bug, weird corner case). INSERT adds new files with unique names to the partition directory. We now have double the number of files.
- Next select count reads all the files in the partition.