Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6842

Export to CSV using CREATE TABLE AS (CTAS) wrong parsed

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.14.0
    • Fix Version/s: None
    • Labels:
    • Environment:

      Description

      When you export to a CSV using CTAS the result of a query, most of the time the generated file is OK, but if you have in the results text columns with "," characters, the resulting CSV file is broken, because does not enclose the cells with commas inside with the " character.

      Steps to reproduce the bug:

      Lets say you have the following table in some source of data, maybe a CSV file too:

      /tmp/input.csv
      product_ean,product_name,product_brand
      12345678900,IPhone X,Apple
      99999911100,"Samsung S9, Black",Samsung
      11111223456,Smartwatch XY,Some Brand
      

      Note that the second row of data, in the column "product_name", it has a value with a comma inside (Samsung S9, Black), so all the cell value is enclosed with " characters, while the rest of the column cells aren't, despite they could be enclosed too.

      So if you query this file, Drill will interpret correctly the file and does not interpret that comma inside the cell as a separator like the rest of the commas in the file:

      0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/input.csv`;
      +--------------+--------------------+----------------+
      | product_ean  |    product_name    | product_brand  |
      +--------------+--------------------+----------------+
      | 12345678900  | IPhone X           | Apple          |
      | 99999911100  | Samsung S9, Black  | Samsung        |
      | 11111223456  | Smartwatch XY      | Some Brand     |
      +--------------+--------------------+----------------+
      3 rows selected (1.874 seconds)
      

      But now, if you want to query the file and export the result as CSV using the CTAS feature, using the following steps:

      0: jdbc:drill:zk=local> USE dfs.tmp;
      +-------+--------------------------------------+
      |  ok   |               summary                |
      +-------+--------------------------------------+
      | true  | Default schema changed to [dfs.tmp]  |
      +-------+--------------------------------------+
      1 row selected (0.13 seconds)
      0: jdbc:drill:zk=local> ALTER SESSION SET `store.format`='csv';
      +-------+------------------------+
      |  ok   |        summary         |
      +-------+------------------------+
      | true  | store.format updated.  |
      +-------+------------------------+
      1 row selected (0.094 seconds)
      0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.my_output AS SELECT * FROM dfs.`/tmp/input.csv`;
      +-----------+----------------------------+
      | Fragment  | Number of records written  |
      +-----------+----------------------------+
      | 0_0       | 3                          |
      +-----------+----------------------------+
      1 row selected (0.453 seconds)
      

      The output file is this:

      /tmp/my_output/0_0_0.csv
      product_ean,product_name,product_brand
      12345678900,IPhone X,Apple
      99999911100,Samsung S9, Black,Samsung
      11111223456,Smartwatch XY,Some Brand
      

      The text Samsung S9, Black in the cell is not quoted, so any CSV interpreter like an office tool, a Java/Python/... library will interpret it as two cell instead of one. Even Apache Drill will interpret it wrong:

      0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/my_output/0_0_0.csv`;
      +--------------+----------------+----------------+
      | product_ean  |  product_name  | product_brand  |
      +--------------+----------------+----------------+
      | 12345678900  | IPhone X       | Apple          |
      | 99999911100  | Samsung S9     |  Black         |
      | 11111223456  | Smartwatch XY  | Some Brand     |
      +--------------+----------------+----------------+
      3 rows selected (0.175 seconds)
      

      Note that the ending part _ Black_ was interpreted as a following cell, and the real following cell is not showed, but it's not an error in the Drill interpreter, it's an error of how Drill exported the result that now in the last query was used as input.

      Here is how the file is interpreted by LibreOffice Calc:

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mrsarm Mariano Ruiz
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: