Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.3
-
None
Description
Currently, our CSV extractor tries to figure out what the datatype of each cell is simply by attempting to parse a float or integer from the cell and falling back on "string".
This is problematic because cells that look like numbers may not, in fact, be numbers.
Consider a column of version numbers, such as:
4
4.1
4.1.1
etc.
Currently our csv extractor will assign the following datatypes to this column:
4 -> integer
4.1 -> float
4.1.1 -> string
We could improve this guessing ability by parsing the entire column before assigning a datatype, and then using the least-specific datatype encountered. However, this solution would also be problematic because then we'd have to hold the entire table in memory before generating any triples. And it still wouldn't guarantee correctness.
Without structured data telling us what the original datatype was, I don't think assigning any datatypes other than "string" to string values is worthwhile.
Another problem is that the extractor strips leading and trailing whitespaces from all values, including string values. While this behavior probably wouldn't present a problem for most use-cases, it does mean that the algorithm is lossy.
Cf. ANY23-218