Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-2521

Support duplicate rows in CSV Bulk Loader

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.5.2
    • None
    • None
    • None

    Description

      found out the map reduce csv bulk load tool doesn't behave the same as UPSERTs. Is it by design or a bug?

      Here is the queries for creating table and index:

       CREATE TABLE mySchema.mainTable (
      id varchar NOT NULL,
      name varchar,
      address varchar
      CONSTRAINT pk PRIMARY KEY (id)); 
       CREATE INDEX myIndex 
      ON mySchema.mainTable  (name, id) 
      INCLUDE (address); 

      if I execute two upserts where the second one update the name (which is the key for index), everything works fine (the record will be updated in both table and index table)

       UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'john', 'Montreal');
      UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'jack', 'Montreal');
      SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'jack'; 

      ==> one record

      SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'john';  

      ==> zero records

      But, if I load the date using org.apache.phoenix.mapreduce.CsvBulkLoadTool to the main table, it behaves different. The main table will be updated, but the new record will be appended to the index table:

      HADOOP_CLASSPATH=/usr/lib/hbase/lib/hbase-protocol-1.1.2.jar:/etc/hbase/conf hadoop jar /usr/lib/hbase/phoenix-4.5.2-HBase-1.1-bin/phoenix-4.5.2-HBase-1.1-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -d',' -s mySchema -t mainTable -i /tmp/input.txt

      input.txt:
      2,tomas,montreal
      2,george,montreal

      (I have tried it both with/without -it and got the same result)

      SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'tomas' 

      ==> one record;

       SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'george' 

      ==> one record;

      Attachments

        Activity

          People

            Unassigned Unassigned
            moazami.afshin@gmail.com Afshin Moazami
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: