Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-413

CSV Extractor attempts to be too smart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.3
    • 2.8
    • extractors
    • None

    Description

      Currently, our CSV extractor tries to figure out what the datatype of each cell is simply by attempting to parse a float or integer from the cell and falling back on "string".

      This is problematic because cells that look like numbers may not, in fact, be numbers.

      Consider a column of version numbers, such as:
      4
      4.1
      4.1.1
      etc.

      Currently our csv extractor will assign the following datatypes to this column:
      4 -> integer
      4.1 -> float
      4.1.1 -> string

      We could improve this guessing ability by parsing the entire column before assigning a datatype, and then using the least-specific datatype encountered. However, this solution would also be problematic because then we'd have to hold the entire table in memory before generating any triples. And it still wouldn't guarantee correctness.

      Without structured data telling us what the original datatype was, I don't think assigning any datatypes other than "string" to string values is worthwhile.

      Another problem is that the extractor strips leading and trailing whitespaces from all values, including string values. While this behavior probably wouldn't present a problem for most use-cases, it does mean that the algorithm is lossy.

      Cf. ANY23-218

      Attachments

        Activity

          People

            Unassigned Unassigned
            hansbrende Hans Brende
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: