Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21924

Split text files even if header/footer exists

    XMLWordPrintableJSON

    Details

      Description

      https://github.com/apache/hive/blob/967a1cc98beede8e6568ce750ebeb6e0d048b8ea/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494-L503

          int headerCount = 0;
          int footerCount = 0;
          if (table != null) {
            headerCount = Utilities.getHeaderCount(table);
            footerCount = Utilities.getFooterCount(table, conf);
            if (headerCount != 0 || footerCount != 0) {
              // Input file has header or footer, cannot be splitted.
              HiveConf.setLongVar(conf, ConfVars.MAPREDMINSPLITSIZE, Long.MAX_VALUE);
            }
          }
      

      this piece of code makes the CSV (or any text files with header/footer) files not splittable if header or footer is present.
      If only header is present, we can find the offset after first line break and use that to split. Similarly for footer, may be read few KB's of data at the end and find the last line break offset. Use that to determine the data range which can be used for splitting. Few reads during split generation are cheaper than not splitting the file at all.

        Attachments

        1. HIVE-21924.2.patch
          51 kB
          Mustafa İman
        2. HIVE-21924.3.patch
          60 kB
          Mustafa İman
        3. HIVE-21924.4.patch
          60 kB
          Mustafa İman
        4. HIVE-21924.5.patch
          61 kB
          Mustafa İman
        5. HIVE-21924.6.patch
          61 kB
          Mustafa İman
        6. HIVE-21924.patch
          46 kB
          Mustafa İman

          Issue Links

            Activity

              People

              • Assignee:
                mustafaiman Mustafa İman
                Reporter:
                prasanth_j Prasanth Jayachandran
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 40m
                  4h 40m