Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7389

JSON empty list avoid Parquet creation

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.16.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      With a JSON file with only one row with an empty list as below, it's possible to request the file but there is an error when trying to create a Parquet
      File ANIMALS_1.json:

      {"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
      
      SELECT * FROM ....`ANIMALS_1.json`;
      +-------------------------------------------------------------------------------+
      |                                animals                                        |
      +-------------------------------------------------------------------------------+
      | {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}} |
      +-------------------------------------------------------------------------------+
      
      CREATE TABLE ....`ANIMALS_1_pqt` AS 
      (SELECT * FROM ....`ANIMALS_1.json`);
      =>
      Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with an empty group: optional group Horse {}
      

       
      But if the json file contains a second line with a non-empty list for "Horse", it's possible to request file and create the Parquet
      File ANIMALS_2.json:

      {"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
      {"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{"detected":false,"gender":"1","obsdate":"20171229"}}}
      
      SELECT * FROM ....`ANIMALS_2.json`;
      +-------------------------------------------------------------------------------+
      |                                     animals                          +-------------------------------------------------------------------------------+
      | {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}} |
      | {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{"detected":"false","gender":"1","obsdate":"20171229"}} |
      +-------------------------------------------------------------------------------+
      
      CREATE TABLE ....`ANIMALS_2_pqt` AS 
      (SELECT * FROM ....`ANIMALS_2.json`);
      +----------+---------------------------+
      | Fragment | Number of records written |
      +----------+---------------------------+
      | 0_0      | 2                         |
      +----------+---------------------------+
      

       
      Many problems appears with this when manipulating multiple JSON with "rare" value (and when do not master the generation).

      It's very annoying to have no possibility push data in parquet where there is missing/null value in JSON.
      The possibility to cast in varchar (DRILL-7375) the data could allow the parquet storage

      In the simple case of the example discussed here, it's possible to change the type of the input file from JSON to CSV and it will work. But it does not answer all the problems and it doesn't allow to keep some part in "json" and some other in "text"

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              benj641 benj
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: