Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21924

Split text files even if header/footer exists

    XMLWordPrintableJSON

Details

    Description

      https://github.com/apache/hive/blob/967a1cc98beede8e6568ce750ebeb6e0d048b8ea/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494-L503

          int headerCount = 0;
          int footerCount = 0;
          if (table != null) {
            headerCount = Utilities.getHeaderCount(table);
            footerCount = Utilities.getFooterCount(table, conf);
            if (headerCount != 0 || footerCount != 0) {
              // Input file has header or footer, cannot be splitted.
              HiveConf.setLongVar(conf, ConfVars.MAPREDMINSPLITSIZE, Long.MAX_VALUE);
            }
          }
      

      this piece of code makes the CSV (or any text files with header/footer) files not splittable if header or footer is present.
      If only header is present, we can find the offset after first line break and use that to split. Similarly for footer, may be read few KB's of data at the end and find the last line break offset. Use that to determine the data range which can be used for splitting. Few reads during split generation are cheaper than not splitting the file at all.

      Attachments

        1. HIVE-21924.patch
          46 kB
          Mustafa İman
        2. HIVE-21924.2.patch
          51 kB
          Mustafa İman
        3. HIVE-21924.3.patch
          60 kB
          Mustafa İman
        4. HIVE-21924.4.patch
          60 kB
          Mustafa İman
        5. HIVE-21924.5.patch
          61 kB
          Mustafa İman
        6. HIVE-21924.6.patch
          61 kB
          Mustafa İman

        Issue Links

          Activity

            People

              mustafaiman Mustafa İman
              prasanth_j Prasanth Jayachandran
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 40m
                  4h 40m