Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: UDF
    • Labels:

      Description

      Min Zhou added a comment - 21/Jul/09 07:34 AM
      It's very useful for us .
      some comments:

      1. Can you implement it directly with Text ? Avoiding string decoding and encoding would be faster. Of course that trick may lead to another problem, as String.split uses a regular expression for splitting.
      2. getDisplayString() always return a string in lowercase.

      [ Show » ]
      Min Zhou added a comment - 21/Jul/09 07:34 AM It's very useful for us . some comments:

      1. Can you implement it directly with Text ? Avoiding string decoding and encoding would be faster. Of course that trick may lead to another problem, as String.split uses a regular expression for splitting.
      2. getDisplayString() always return a string in lowercase.

      [ Permlink | « Hide ]
      Namit Jain added a comment - 21/Jul/09 09:22 AM
      Committed. Thanks Emil
      [ Show » ]
      Namit Jain added a comment - 21/Jul/09 09:22 AM Committed. Thanks Emil

      [ Permlink | « Hide ]
      Emil Ibrishimov added a comment - 21/Jul/09 10:48 AM
      There are some easy (compromise) ways to optimize split:

      1. Check if the regex argument actually contains some "regex specific characters" and if it doesn't, do a straightforward split without converting to strings.
      2. Assume some default value for the second argument (for example - split(str) to be equivalent to split(str, ' ') and optimize for this value
      3. Have two separate split functions - one that does regex and one that splits around plain text.

      I think that 1 is a good choice and can be done rather quickly.
      [ Show » ]
      Emil Ibrishimov added a comment - 21/Jul/09 10:48 AM There are some easy (compromise) ways to optimize split: 1. Check if the regex argument actually contains some "regex specific characters" and if it doesn't, do a straightforward split without converting to strings. 2. Assume some default value for the second argument (for example - split(str) to be equivalent to split(str, ' ') and optimize for this value 3. Have two separate split functions - one that does regex and one that splits around plain text. I think that 1 is a good choice and can be done rather quickly.

      1. HIVE-664.3.patch.txt
        6 kB
        Teddy Choi
      2. HIVE-664.2.patch.txt
        6 kB
        Teddy Choi
      3. HIVE-664.1.patch.txt
        6 kB
        Teddy Choi

        Issue Links

          Activity

          Hide
          Navis added a comment -

          Ran a simple micro test on splitting only and found it's not faster significantly (max 15%?) than current implementation (even slower for sometimes). But reusing previous pattern string seemed good idea. Furthermore, if OI for regex is constant type, comparing itself can be ignored. Could you do that too?

          Show
          Navis added a comment - Ran a simple micro test on splitting only and found it's not faster significantly (max 15%?) than current implementation (even slower for sometimes). But reusing previous pattern string seemed good idea. Furthermore, if OI for regex is constant type, comparing itself can be ignored. Could you do that too?
          Hide
          Teddy Choi added a comment -

          mapreduce_stack_trace_hadoop20.q passed on my computer. The test doesn't call split(), so it is not related with this patch.

          Show
          Teddy Choi added a comment - mapreduce_stack_trace_hadoop20.q passed on my computer. The test doesn't call split(), so it is not related with this patch.
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12607946/HIVE-664.3.patch.txt

          ERROR: -1 due to 1 failed/errored test(s), 4392 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_mapreduce_stack_trace_hadoop20
          

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1105/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1105/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests failed with: TestsFailedException: 1 tests failed
          

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12607946/HIVE-664.3.patch.txt ERROR: -1 due to 1 failed/errored test(s), 4392 tests executed Failed tests: org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_mapreduce_stack_trace_hadoop20 Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1105/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1105/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests failed with: TestsFailedException: 1 tests failed This message is automatically generated.
          Hide
          Teddy Choi added a comment -

          I wrote 2nd patch while assuming Text#getBytes returns entire valid bytes, but it was wrong. Text#getBytes returns valid bytes up to Text#getLength length. Following bytes are invalid. I fixed code to use Text#getLength to ensure data validity. This patch passes global_limit.q and udf_split.q tests.

          Show
          Teddy Choi added a comment - I wrote 2nd patch while assuming Text#getBytes returns entire valid bytes, but it was wrong. Text#getBytes returns valid bytes up to Text#getLength length. Following bytes are invalid. I fixed code to use Text#getLength to ensure data validity. This patch passes global_limit.q and udf_split.q tests.
          Hide
          Ashutosh Chauhan added a comment -

          Test global_limit.q failed.

          Show
          Ashutosh Chauhan added a comment - Test global_limit.q failed.
          Hide
          Ashutosh Chauhan added a comment -

          +1

          Show
          Ashutosh Chauhan added a comment - +1
          Hide
          Teddy Choi added a comment -
          Show
          Teddy Choi added a comment - Review request on https://reviews.apache.org/r/13702/
          Hide
          Teddy Choi added a comment -

          I implemented 1 and 3. Additionally, it caches a compiled Pattern object to reuse.

          Show
          Teddy Choi added a comment - I implemented 1 and 3. Additionally, it caches a compiled Pattern object to reuse.

            People

            • Assignee:
              Teddy Choi
              Reporter:
              Namit Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development