[ORC-1116] Csv-import tool exported field become empty - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.7.3
Fix Version/s: 1.7.4
Component/s: tools
Labels:
None

Language:
- C++

Description

When exporting a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or exported data in column "a" becomes empty string.

Here is me trying to explain the code, maybe totally not correct, please bear with me.

Following the code in CSVFileImport.cc, when writing to a orc file, all string type columns is using one databuffer inside function fillStringValues(). When one data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all references and iterators into buffer.data() become invalid.

In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch pointing to buffer.data() is no longer valid.

A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. (As the attached file) Or let all previous stringBatch re-points to new databuffer's address.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CSVFileImport.dif
11/Feb/22 10:25
2 kB
kyle

Issue Links

links to

GitHub Pull Request #1044

Activity

People

Assignee:: kyle

Reporter:: kyle

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Feb/22 13:35

Updated:: 15/Apr/22 23:26

Resolved:: 21/Feb/22 01:55