1. Hive
  2. HIVE-1515

archive is not working when multiple partitions inside one table are archived.


    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.7.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:


      set hive.exec.compress.output = true;
      set mapred.min.split.size=256;
      set mapred.min.split.size.per.node=256;
      set mapred.min.split.size.per.rack=256;
      set mapred.max.split.size=256;

      set hive.archive.enabled = true;

      drop table combine_3_srcpart_seq_rc;

      create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;

      insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;

      insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;

      ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
      ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");

      select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;

      drop table combine_3_srcpart_seq_rc;

      will fail. Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har

      The reason it fails is because:
      there are 2 input paths (one for each partition) for the above query:
      1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
      2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
      But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har

      The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

      1. hive-1515.2.patch
        7 kB
        He Yongqiang
      2. hive-1515.1.patch
        10 kB
        He Yongqiang


        He Yongqiang created issue -
        He Yongqiang made changes -
        Field Original Value New Value
        Assignee He Yongqiang [ he yongqiang ]
        He Yongqiang made changes -
        Attachment hive-1515.1.patch [ 12451450 ]
        He Yongqiang made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.7.0 [ 12315150 ]
        He Yongqiang made changes -
        Attachment hive-1515.2.patch [ 12451690 ]
        He Yongqiang made changes -
        Attachment hive-1515.2.patch [ 12451690 ]
        Paul Yang made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        He Yongqiang made changes -
        Attachment hive-1515.2.patch [ 12451974 ]
        He Yongqiang made changes -
        Assignee He Yongqiang [ he yongqiang ]


          • Assignee:
            He Yongqiang
          • Votes:
            0 Vote for this issue
            2 Start watching this issue


            • Created: