Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-1315

Invalid results are returned when a source table consists of multiple csv files

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: Storage
    • Labels:
      None

      Description

      See the title.
      Here are some examples related to this bug.

      default> \dfs -ls /customer.tbl
      Found 19 items
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000001
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000002
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000003
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000004
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000005
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000006
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000007
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000008
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000009
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000010
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000011
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000012
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000013
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000014
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000015
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:25 /customer.tbl/000016
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:26 /customer.tbl/000017
      -rw-r--r--   3 hadoop supergroup  134217728 2015-01-26 20:26 /customer.tbl/000018
      -rw-r--r--   3 hadoop supergroup   47571167 2015-01-26 20:26 /customer.tbl/000019
      
      default> create external table test (C_CUSTKEY bigint, C_NAME text, C_ADDRESS text, C_NATIONKEY bigint, C_PHONE text, C_ACCTBAL double, C_MKTSEGMENT text, C_COMMENT text) using csv with ('csvfile.delimiter'='|') location 'hdfs://192.168.0.1:7020/customer.tbl';
      OK
      default> \d test
      
      table name: tpch_swift.test
      table path: hdfs://192.168.0.1:7020/customer.tbl
      store type: CSV
      number of rows: unknown
      volume: 2.5 GB
      Options: 
      	'text.delimiter'='|'
      
      schema: 
      c_custkey	INT8
      c_name	TEXT
      c_address	TEXT
      c_nationkey	INT8
      c_phone	TEXT
      c_acctbal	FLOAT8
      c_mktsegment	TEXT
      c_comment	TEXT
      
      default> select count(*) from test;
      ?count
      -------------------------------
      15000017
      (1 rows, 3.2 sec, 9 B selected)
      

      As you can see, the expected result is 15000000, but the real result was 15000017.

      So, I investigated error tuples as follows.

      default> select c_custkey, count(*) as cnt from customer2 group by c_custkey having cnt > 1;
      c_custkey,  cnt
      -------------------------------
      ,  14
      114575,  2
      14711665,  2
      34,  2
      (4 rows, 16.681 sec, 29 B selected)
      
      default> select * from customer2 where c_custkey is null or c_custkey = 114575 or c_custkey = 14711665 or c_custkey = 34;
      c_custkey,  c_name,  c_address,  c_nationkey,  c_phone,  c_acctbal,  c_mktsegment,  c_comment
      -------------------------------
      34,  Customer#000000034,  Q6G9wZ6dnczmtOx509xgE,M2KV,  15,  25-344-968-5422,  8589.7,  HOUSEHOLD,  nder against the even, pending accounts. even
      114575,  Customer#000114575,  xqLzTzY0,QvqwlSPI8OLxjRQ4s2W7pkSWwK,  16,  26-303-921-2836,  6663.68,  AUTOMOBILE,  le fluffily final deposits. furiously regu
      ,  21,  31-264-911-5053,  ,  HOUSEHOLD,  0.0,  ,  
      ,  IexCQQNp7tsMK63QKrGw37H3JJXGPaXBk,  18,  ,  4313.01,  0.0,   the never pending accounts. slyly fluffy pinto beans run fluffily. furiously ,  
      ,  ,  ,  ,  ,  ,  ,  
      ,  152.95,  MACHINERY,  ,  ,  ,  ,  
      ,  t the ironic, close accounts are careful,  ,  ,  ,  ,  ,  
      ,  20,  30-481-475-8163,  ,  AUTOMOBILE,  0.0,  ,  
      ,  ,  ,  ,  ,  ,  ,  
      ,  MACHINERY,  ts use slyly even dependencie,  ,  ,  ,  ,  
      ,  ,  ,  ,  ,  ,  ,  
      ,  24,  34-639-456-9692,  ,  FURNITURE,  0.0,  ,  
      ,  ,  ,  ,  ,  ,  ,  
      114575,  ,  ,  ,  ,  ,  ,  
      34,  Customer#011457534,  wFUkCU67OxuxvfQeSdvSMDtMB7DWt7jiw,  2,  12-145-168-8442,  145.78,  MACHINERY,  ic accounts. ironic, final ideas sleep qu
      ,  XPP8pRDTDs4MFMP7SSlv,  17,  ,  5437.09,  0.0,  egular requests cajole slyly after the ,  
      ,  blithely along the regular, daring deposits. ironic acco,  ,  ,  ,  ,  ,  
      ,  12,  22-656-233-3821,  ,  HOUSEHOLD,  0.0,  ,  
      14711665,  Customer#0,  ,  ,  ,  ,  ,  
      14711665,  QKTarsTkX7,  19,  ,  7017.62,  0.0,  ly after the carefully ironic theodolites. pending requests are slyly across the deposits. even accounts boost. fina,  
      (20 rows, 8.964 sec, 1.2 KiB selected)
      

        Activity

        Hide
        jihoonson Jihoon Son added a comment -

        I tested the above query with HDFS, LFS, and Swift, and found the same bug.

        Show
        jihoonson Jihoon Son added a comment - I tested the above query with HDFS, LFS, and Swift, and found the same bug.
        Hide
        jihoonson Jihoon Son added a comment -

        This problem is due to the broken input data.

        Show
        jihoonson Jihoon Son added a comment - This problem is due to the broken input data.

          People

          • Assignee:
            Unassigned
            Reporter:
            jihoonson Jihoon Son
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development