Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1689

Parser sort order change in TIKA-1517 breaks parser override capability

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: 1.10
    • Component/s: core
    • Labels:
      None

      Description

      In Tika 1.9, the comparator used to sort parsers (in ServiceLoaderUtils) now returns them in the reverse order from how they were returned in prior versions, when the comparator was in DefaultParser. This work was done under TIKA-1517.

      This change broke one of our customizations in which we use our own parser instead of Tika's HtmlParser to process html. We use the service loader logic (creating our own META-INF/services/org.apache.tika.parser.Parser file) and rely on the order in which the list returned by DefaultParser.getDefaultParsers() is evaluated. Expecting that when Tika builds the map of mime types to parsers it first puts in entries for HtmlParser, then overwrites these with our custom parser.

      I realize relying on this is brittle. And I found a valid workaround to the problem in Tika 1.9 is to blacklist HtmlParser. However, in case this parser ordering change was not intentional, I figured I'd mention it.

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #800 (See https://builds.apache.org/job/tika-trunk-jdk1.7/800/)
        TIKA-1689: with mention in Changes.txt (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692565)

        • /tika/trunk/CHANGES.txt
          TIKA-1689: revert mistakenly flipped sort order of parsers from r1677328 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692564)
        • /tika/trunk/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java
        • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/utils
        • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/utils/ServiceLoaderUtilsTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #800 (See https://builds.apache.org/job/tika-trunk-jdk1.7/800/ ) TIKA-1689 : with mention in Changes.txt (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692565 ) /tika/trunk/CHANGES.txt TIKA-1689 : revert mistakenly flipped sort order of parsers from r1677328 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692564 ) /tika/trunk/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/utils /tika/trunk/tika-parsers/src/test/java/org/apache/tika/utils/ServiceLoaderUtilsTest.java
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        In r1692564, I flipped the sort order back to what it was before r1677328. If that change in sort order was made intentionally, please let us know.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited In r1692564, I flipped the sort order back to what it was before r1677328. If that change in sort order was made intentionally, please let us know.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        To confirm David Warren's point and to supplement with detail...it turns out this change happened in some cleanup to TIKA-1517 (r1677328). Before that patch we had in DefaultParser:

                    public int compare(Parser p1, Parser p2) {
                        String n1 = p1.getClass().getName();
                        String n2 = p2.getClass().getName();
                        boolean t1 = n1.startsWith("org.apache.tika.");
                        boolean t2 = n2.startsWith("org.apache.tika.");
                        if (t1 == t2) {
                            return n1.compareTo(n2);
                        } else if (t1) {
                            return -1;
                        } else {
                            return 1;
                        }
                    }
        

        When that was pulled into ServiceLoaderUtils, it became:

                     public int compare(T c1, T c2) {
         
                         String n1 = c1.getClass().getName(); 
                         String n2 = c2.getClass().getName();
         
                         boolean t1 = n1.startsWith("org.apache.tika."); 
                         boolean t2 = n2.startsWith("org.apache.tika."); 
                         if (t1 == t2) { 
                             return n1.compareTo(n2); 
                         } else if (t1) {
                             return 1; 
                         } else {
                             return -1; 
                        }
                      } 
        

        Nick Burch, can you think of any problems if I flip the -1/1 back to where they were?

        Show
        tallison@mitre.org Tim Allison added a comment - To confirm David Warren 's point and to supplement with detail...it turns out this change happened in some cleanup to TIKA-1517 ( r1677328 ). Before that patch we had in DefaultParser: public int compare(Parser p1, Parser p2) { String n1 = p1.getClass().getName(); String n2 = p2.getClass().getName(); boolean t1 = n1.startsWith("org.apache.tika."); boolean t2 = n2.startsWith("org.apache.tika."); if (t1 == t2) { return n1.compareTo(n2); } else if (t1) { return -1; } else { return 1; } } When that was pulled into ServiceLoaderUtils, it became: public int compare(T c1, T c2) { String n1 = c1.getClass().getName(); String n2 = c2.getClass().getName(); boolean t1 = n1.startsWith("org.apache.tika."); boolean t2 = n2.startsWith("org.apache.tika."); if (t1 == t2) { return n1.compareTo(n2); } else if (t1) { return 1; } else { return -1; } } Nick Burch , can you think of any problems if I flip the -1/1 back to where they were?
        Hide
        gagravarr Nick Burch added a comment -

        I believe we should still be preferring non-Tika parsers/detectors over Tika ones, even without the much-discussed proposed changes to let ordering be config-controlled for those who need full control.

        If we really have accidentally swapped the ordering, I'd say we should change it back + add a unit test to ensure that non-Tika ones come first. (The Ogg parsers are always included, but in a non-Tika namespace, so can be used for checking this logic with)

        Show
        gagravarr Nick Burch added a comment - I believe we should still be preferring non-Tika parsers/detectors over Tika ones, even without the much-discussed proposed changes to let ordering be config-controlled for those who need full control. If we really have accidentally swapped the ordering, I'd say we should change it back + add a unit test to ensure that non-Tika ones come first. (The Ogg parsers are always included, but in a non-Tika namespace, so can be used for checking this logic with)
        Hide
        tallison@mitre.org Tim Allison added a comment -

        David Warren, thank you for raising this.

        Chris A. Mattmann and Luke sh, this seems like a pretty serious change in the API, no?

        Show
        tallison@mitre.org Tim Allison added a comment - David Warren , thank you for raising this. Chris A. Mattmann and Luke sh , this seems like a pretty serious change in the API, no?

          People

          • Assignee:
            Unassigned
            Reporter:
            dwarren David Warren
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development