Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
ghx-label-3
Description
Partition values are incorrectly URL-encoded in backend for unicode characters, e.g. '运营业务数据' is encoded to '�%FFFFFFBF�营业务数据' which is wrong.
To reproduce the issue, first create a partition table:
create table my_part_tbl (id int) partitioned by (p string) stored as parquet;
Then insert data into it using partition values containing '运'. They will fail:
[localhost:21050] default> insert into my_part_tbl partition(p='运营业务数据') values (0); Query: insert into my_part_tbl partition(p='运营业务数据') values (0) Query submitted at: 2022-08-16 10:03:56 (Coordinator: http://quanlong-OptiPlex-BJ:25000) Query progress can be monitored at: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=404ac3027c4b7169:39d16a2d00000000 ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op (RENAME hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d00000000/.404ac3027c4b7169-39d16a2d00000000_1475855322_dir/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq TO hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq) failed, error was: hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d00000000/.404ac3027c4b7169-39d16a2d00000000_1475855322_dir/p=�%FFFFFFBF�营业务数据/404ac3027c4b7169-39d16a2d00000000_1585092794_data.0.parq Error(5): Input/output error [localhost:21050] default> insert into my_part_tbl partition(p='运') values (0); Query: insert into my_part_tbl partition(p='运') values (0) Query submitted at: 2022-08-16 10:04:22 (Coordinator: http://quanlong-OptiPlex-BJ:25000) Query progress can be monitored at: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=a64e5883473ec28d:86e7e33500000000 ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op (RENAME hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e33500000000/.a64e5883473ec28d-86e7e33500000000_1582623091_dir/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq TO hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq) failed, error was: hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e33500000000/.a64e5883473ec28d-86e7e33500000000_1582623091_dir/p=�%FFFFFFBF�/a64e5883473ec28d-86e7e33500000000_163454510_data.0.parq Error(5): Input/output error
However, partition value without the character '运' is OK:
[localhost:21050] default> insert into my_part_tbl partition(p='营业务数据') values (0); Query: insert into my_part_tbl partition(p='营业务数据') values (0) Query submitted at: 2022-08-16 10:04:13 (Coordinator: http://quanlong-OptiPlex-BJ:25000) Query progress can be monitored at: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=b04894bfcfc3836a:b1ac903600000000 Modified 1 row(s) in 0.21s
Hive is able to execute all these statements.
I'm able to narrow down the issue into Backend, where we URL-encode the partition value in HdfsTableSink::InitOutputPartition():
string value_str; partition_key_expr_evals_[j]->PrintValue(value, &value_str); // Directory names containing partition-key values need to be UrlEncoded, in // particular to avoid problems when '/' is part of the key value (which might // occur, for example, with date strings). Hive will URL decode the value // transparently when Impala's frontend asks the metastore for partition key values, // which makes it particularly important that we use the same encoding as Hive. It's // also not necessary to encode the values when writing partition metadata. You can // check this with 'show partitions <tbl>' in Hive, followed by a select from a // decoded partition key value. string encoded_str; UrlEncode(value_str, &encoded_str, true); string part_key_value = (encoded_str.empty() ? table_desc_->null_partition_key_value() : encoded_str); // If the string is empty, map it to nullptr (mimicking Hive's behaviour) partition_name_ss << part_key_value; if (is_external_part) { external_partition_name_ss << part_key_value; }
Current code of UrlEncode seems not handling UTF-8 encoded strings correctly:
static inline void UrlEncode(const char* in, int in_len, string* out, bool hive_compat) { (*out).reserve(in_len); stringstream ss; for (int i = 0; i < in_len; ++i) { const char ch = in[i]; // Escape the character iff a) we are in Hive-compat mode and the // character is in the Hive whitelist or b) we are not in // Hive-compat mode, and the character is not alphanumeric or one // of the four commonly excluded characters. if ((hive_compat && HiveShouldEscape(ch)) || (!hive_compat && !(isalnum(ch) || ShouldNotEscape(ch)))) { ss << '%' << uppercase << hex << static_cast<uint32_t>(ch); } else { ss << ch; } } (*out) = ss.str(); }
With adding some more loggings, I can see "运营业务数据" is encoded into "�%FFFFFFBF�营业务数据" by this method.
Attachments
Issue Links
- is related to
-
IMPALA-12987 Errors with \0 character in partition values
- Open