Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: UDF
    • Labels:
      None

      Description

      Paraphrasing what was reported by Carter Shanklin -

      I used the attached Perl script to generate 500 million two-character strings which always included a space. I loaded it using:
      create table letters (l string);
      load data local inpath '/home/sandbox/data.csv' overwrite into table letters;
      Then I ran this SQL script:
      select count(l) from letters where l = 'l ';
      select count(l) from letters where trim(l) = 'l';

      First query = 170 seconds
      Second query = 514 seconds

      1. temp.pl
        0.1 kB
        Thejas M Nair

        Activity

        Hide
        Anandha L Ranganathan added a comment -

        Thejas M Nair/Carter Shanklin

        Could you provide data.csv file that caused the problem. Otherwise provide example of the data.

        Show
        Anandha L Ranganathan added a comment - Thejas M Nair / Carter Shanklin Could you provide data.csv file that caused the problem. Otherwise provide example of the data.
        Hide
        Thejas M Nair added a comment -

        temp.pl - perl file for generating data

        Show
        Thejas M Nair added a comment - temp.pl - perl file for generating data
        Hide
        Eric Hanson added a comment -

        This may not be relevant for you, but if you can use ORC then you can enable vectorized execution, and benefit from the vectorized implementation of TRIM, which should be much faster. See org.apache.hadoop.hive.ql.exec.vector.expressions.StringTrim.

        Show
        Eric Hanson added a comment - This may not be relevant for you, but if you can use ORC then you can enable vectorized execution, and benefit from the vectorized implementation of TRIM, which should be much faster. See org.apache.hadoop.hive.ql.exec.vector.expressions.StringTrim.
        Hide
        Anandha L Ranganathan added a comment -

        Here is the system configuration
        4 core, 8 GB RAM.
        file format: Text
        compression : NONE

        1) select count(l) from letters where l = 'l ';
        around 100 seconds.

        2) select count(l) from letters where trim(l) = 'l';
        230 seconds

        3)I created GenericUDF function for trim and the result was
        select count(l) from letters where gentrim(l) = 'l';
        220 seconds.

        This evaluate function is taking around 1500 nano seconds processing for each record. This nano seconds accumulates and takes 230 seconds when we use UDF function for 500M records.

        This is the code that is used in evaluate.

        if (arguments[0].get() == null)

        { return null; }

        input = (Text) converters[0].convert(arguments[0].get());
        input.set(input.toString().trim());

        Eric Hanson]
        I haven't tried ORC file format. I will try later.

        Show
        Anandha L Ranganathan added a comment - Here is the system configuration 4 core, 8 GB RAM. file format: Text compression : NONE 1) select count(l) from letters where l = 'l '; around 100 seconds. 2) select count(l) from letters where trim(l) = 'l'; 230 seconds 3)I created GenericUDF function for trim and the result was select count(l) from letters where gentrim(l) = 'l'; 220 seconds. This evaluate function is taking around 1500 nano seconds processing for each record. This nano seconds accumulates and takes 230 seconds when we use UDF function for 500M records. This is the code that is used in evaluate. if (arguments [0] .get() == null) { return null; } input = (Text) converters [0] .convert(arguments [0] .get()); input.set(input.toString().trim()); Eric Hanson ] I haven't tried ORC file format. I will try later.

          People

          • Assignee:
            Anandha L Ranganathan
            Reporter:
            Thejas M Nair
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development