Tika
  1. Tika
  2. TIKA-780

Optimize loading of the media type registry

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: mime
    • Labels:
      None

      Description

      Parsing of our pretty large media type registry takes quite a while (hundreds of milliseconds), which can be a problem for some applications. There's a lot of ways in which we could optimize the loading of the type registry.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        19h 37m 1 Jukka Zitting 11/Nov/11 12:30
        Jukka Zitting made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Jukka Zitting [ jukkaz ]
        Fix Version/s 1.1 [ 12318849 ]
        Resolution Fixed [ 1 ]
        Hide
        Jukka Zitting added a comment -

        With various refactorings I was able to significantly speed up the following benchmark:

        long a = System.nanoTime();
        new Tika();
        long b = System.nanoTime();
        for (int i = 0; i < 100; i++) {
            new Tika();
        }
        long c = System.nanoTime();
        

        The average time between a and b (i.e. initial loading of the default configuration) is down from 655ms to 377ms on my computer. It looks like any further improvements would probably require precompiling the tika-mimetypes.xml file to another format to avoid the XML parsing overhead. That's a topic for another issue.

        And thanks to the fact that the default media type registry is now memorized at first load, the average time for creating a hundred more default Tika instances went down from 4277ms to just 43ms!

        Show
        Jukka Zitting added a comment - With various refactorings I was able to significantly speed up the following benchmark: long a = System .nanoTime(); new Tika(); long b = System .nanoTime(); for ( int i = 0; i < 100; i++) { new Tika(); } long c = System .nanoTime(); The average time between a and b (i.e. initial loading of the default configuration) is down from 655ms to 377ms on my computer. It looks like any further improvements would probably require precompiling the tika-mimetypes.xml file to another format to avoid the XML parsing overhead. That's a topic for another issue. And thanks to the fact that the default media type registry is now memorized at first load, the average time for creating a hundred more default Tika instances went down from 4277ms to just 43ms!
        Jukka Zitting created issue -

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development