[ANY23-413] CSV Extractor attempts to be too smart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.3
Fix Version/s: 2.8
Component/s: extractors
Labels:
None

Description

Currently, our CSV extractor tries to figure out what the datatype of each cell is simply by attempting to parse a float or integer from the cell and falling back on "string".

This is problematic because cells that look like numbers may not, in fact, be numbers.

Consider a column of version numbers, such as:
4
4.1
4.1.1
etc.

Currently our csv extractor will assign the following datatypes to this column:
4 -> integer
4.1 -> float
4.1.1 -> string

We could improve this guessing ability by parsing the entire column before assigning a datatype, and then using the least-specific datatype encountered. However, this solution would also be problematic because then we'd have to hold the entire table in memory before generating any triples. And it still wouldn't guarantee correctness.

Without structured data telling us what the original datatype was, I don't think assigning any datatypes other than "string" to string values is worthwhile.

Another problem is that the extractor strips leading and trailing whitespaces from all values, including string values. While this behavior probably wouldn't present a problem for most use-cases, it does mean that the algorithm is lossy.

Cf. ANY23-218

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Hans Brende

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/Oct/18 17:24

Updated:: 21/Feb/22 18:24