Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7113

Issue with filtering null values from MapRDB-JSON

    XMLWordPrintableJSON

Details

    Description

      When the Drill is querying documents from MapRDBJSON that contain fields with null value, it returns the wrong result.
      The issue is locally reproduced.

      Please find the repro steps:
      [1] Create a MaprDBJSON table. Say '/tmp/dmdb2/'.

      [2] Insert the following sample records to table:

      insert --table /tmp/dmdb2/ --value '{"_id": "1", "label": "person", "confidence": 0.24}'
      insert --table /tmp/dmdb2/ --value '{"_id": "2", "label": "person2"}'
      insert --table /tmp/dmdb2/ --value '{"_id": "3", "label": "person3", "confidence": 0.54}'
      insert --table /tmp/dmdb2/ --value '{"_id": "4", "label": "person4", "confidence": null}'
      

      We can see that for field 'confidence' document 1 has value 0.24, document 3 has value 0.54, document 2 does not have the field and document 4 has the field with value null.

      [3] Query the table from DRILL.
      Query 1:

      0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2;
      +----------+-------------+
      |  label   | confidence  |
      +----------+-------------+
      | person   | 0.24        |
      | person2  | null        |
      | person3  | 0.54        |
      | person4  | null        |
      +----------+-------------+
      4 rows selected (0.2 seconds)
      
      

      Query 2:

      0: jdbc:drill:> select * from dfs.tmp.dmdb2;
      +------+-------------+----------+
      | _id  | confidence  |  label   |
      +------+-------------+----------+
      | 1    | 0.24        | person   |
      | 2    | null        | person2  |
      | 3    | 0.54        | person3  |
      | 4    | null        | person4  |
      +------+-------------+----------+
      4 rows selected (0.174 seconds)
      
      

      Query 3:

      0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence is not null;
      +----------+-------------+
      |  label   | confidence  |
      +----------+-------------+
      | person   | 0.24        |
      | person3  | 0.54        |
      | person4  | null        |
      +----------+-------------+
      3 rows selected (0.192 seconds)
      
      

      Query 4:

      0: jdbc:drill:> select label,confidence from dfs.tmp.dmdb2 where confidence is  null;
      +----------+-------------+
      |  label   | confidence  |
      +----------+-------------+
      | person2  | null        |
      +----------+-------------+
      1 row selected (0.262 seconds)
      
      

      As you can see, Query 3 which queries for all documents with confidence value 'is not null', returns a document with null value.

      Other observation:
      Querying the same data using DRILL without MapRDB provides the correct result.
      For example, create 4 different JSON files with following data:

      {"label": "person", "confidence": 0.24}

      {"label": "person2"} {"label": "person3", "confidence": 0.54} {"label": "person4", "confidence": null}

      Query it directly using DRILL:

      Query 5:

      0: jdbc:drill:> select label,confidence from dfs.tmp.t2;
      +----------+-------------+
      |  label   | confidence  |
      +----------+-------------+
      | person4  | null        |
      | person3  | 0.54        |
      | person2  | null        |
      | person   | 0.24        |
      +----------+-------------+
      4 rows selected (0.203 seconds)
      
      

      Query 6:

      0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is null;
      +----------+-------------+
      |  label   | confidence  |
      +----------+-------------+
      | person4  | null        |
      | person2  | null        |
      +----------+-------------+
      2 rows selected (0.352 seconds)
      
      

      Query 7:

      0: jdbc:drill:> select label,confidence from dfs.tmp.t2 where confidence is not null;
      +----------+-------------+
      |  label   | confidence  |
      +----------+-------------+
      | person3  | 0.54        |
      | person   | 0.24        |
      +----------+-------------+
      2 rows selected (0.265 seconds)
      
      

      As seen in query 6 & 7, it returns the correct result.

      I believe the issue is at the MapRDB layer where it is fetching the results.

      Attachments

        Issue Links

          Activity

            People

              amansinha100 Aman Sinha
              hanu.ncr Hanumath Rao Maduri
              Hanumath Rao Maduri Hanumath Rao Maduri
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: