Tika
  1. Tika
  2. TIKA-529

IBM420 charset detection's isLamAlef is allocation-happy

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 1.1
    • Component/s: parser
    • Labels:
      None

      Description

      Two IBM420 charset detectors (rtl and ltr) run isLamAlef() for each byte of detection buffer.

      The code is allocating and filling a bytes array every time it runs, which makes it responsible for approximately 70% of all object allocations in my current test case (many text files).

      Since array is identical every time, and the entire thing can be achieved without any array, this is wasteful.

        Activity

        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.1 [ 12318849 ]
        Resolution Fixed [ 1 ]
        Hide
        Michael McCandless added a comment -

        I committed the patch (swapping the first 2 checks) and added a comment explaining the negative numbers...

        Show
        Michael McCandless added a comment - I committed the patch (swapping the first 2 checks) and added a comment explaining the negative numbers...
        Michael McCandless made changes -
        Assignee Ken Krugler [ kkrugler ] Michael McCandless [ mikemccand ]
        Hide
        Michael McCandless added a comment -

        This patch looks safe, and avoids crazy allocations inside this detector.... can we commit it (reversing the first 2 conditions)?

        Show
        Michael McCandless added a comment - This patch looks safe, and avoids crazy allocations inside this detector.... can we commit it (reversing the first 2 conditions)?
        Hide
        Radek added a comment -

        Actually, if you accept the patch, could you reverse the two first conditions. This way ascii characters would immediately fail first test (faster). I meant to code it this way but got lost in all the negative numbers involved.

        Show
        Radek added a comment - Actually, if you accept the patch, could you reverse the two first conditions. This way ascii characters would immediately fail first test (faster). I meant to code it this way but got lost in all the negative numbers involved.
        Radek made changes -
        Attachment isLamAlef.diff [ 12456917 ]
        Radek made changes -
        Attachment isLamAlef.diff [ 12456925 ]
        Hide
        Radek added a comment -

        real real patch.
        Sorry >_< my latest "real patch" was the very same file as the wrong patch...

        Show
        Radek added a comment - real real patch. Sorry >_< my latest "real patch" was the very same file as the wrong patch...
        Ken Krugler made changes -
        Assignee Ken Krugler [ kkrugler ]
        Radek made changes -
        Attachment isLamAlef.diff [ 12456918 ]
        Radek made changes -
        Attachment isLamAlef.diff [ 12456918 ]
        Hide
        Radek added a comment -

        real patch
        sign extension comprehension fail on my part, sorry for bugspam

        Show
        Radek added a comment - real patch sign extension comprehension fail on my part, sorry for bugspam
        Radek made changes -
        Field Original Value New Value
        Attachment isLamAlef.diff [ 12456917 ]
        Hide
        Radek added a comment -

        proposed patch

        Show
        Radek added a comment - proposed patch
        Radek created issue -

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Radek
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development