Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:
      None
    • Environment:

      All

      Description

      It might be a good idea to support the CHM File format of Windows. Some information about http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. The CHM format contains HTML files which can be parsed by Tika. So the "only" problem is to extract the data from the CHM file.

      1. TIKA-245.oleg.20110806.PATCH
        258 kB
        Oleg Tikhonov
      2. TIKA-245.tikhonov.04082011.patch.txt
        178 kB
        Oleg Tikhonov
      3. TIKA-245.tikhonov.20103107.patch.txt
        32 kB
        Oleg Tikhonov
      4. TIKA-245.tikhonov.20112603.txt
        243 kB
        Oleg Tikhonov
      5. TIKA-245.tikhonov.20112703.txt
        223 kB
        Oleg Tikhonov

        Activity

        Hide
        Jukka Zitting added a comment -

        See http://www.russotto.net/chm/chmformat.html for a description of the CHM format. Quick browsing didn't reveal any Java-based parser libraries that we could use to parse CHM files.

        Show
        Jukka Zitting added a comment - See http://www.russotto.net/chm/chmformat.html for a description of the CHM format. Quick browsing didn't reveal any Java-based parser libraries that we could use to parse CHM files.
        Hide
        Luciano Leggieri added a comment -

        Hi, I've started to use TIKA to parse some files I have and sadly several of them are CHM. Have you tried http://sourceforge.net/projects/jchm/ to see it if works?

        Show
        Luciano Leggieri added a comment - Hi, I've started to use TIKA to parse some files I have and sadly several of them are CHM. Have you tried http://sourceforge.net/projects/jchm/ to see it if works?
        Hide
        Jukka Zitting added a comment -

        jchm looks promising, thanks for the pointer!

        Is anyone interested in implementing a Tika Parser warpper for jchm? As a starting point it would be nice if the jchm jar was made available on Maven central.

        Show
        Jukka Zitting added a comment - jchm looks promising, thanks for the pointer! Is anyone interested in implementing a Tika Parser warpper for jchm? As a starting point it would be nice if the jchm jar was made available on Maven central.
        Hide
        Oleg Tikhonov added a comment - - edited

        tika 245 patch, chm parser using jchm.jar

        Show
        Oleg Tikhonov added a comment - - edited tika 245 patch, chm parser using jchm.jar
        Hide
        Oleg Tikhonov added a comment - - edited

        Hi, I've implemented a chm parser, please review it and share what you think.
        here is a link: https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12427752

        There are some open issues, I would like to discuss.
        1. Metadata - chm file contains a lot of different files such as: images, htmls, css, js etc.
        2. Currently it does not support multi threading execution.
        3. jchm itself has bugs, I fixed one, ArrayIndexOutOfBoundsException, the question is how to insert and publish the changes?

        I've written to the author Feng Yu (yfbio@hotmail.com), but still have no answers.

        I would like to get your feedback.
        Thanks,
        Oleg.

        Show
        Oleg Tikhonov added a comment - - edited Hi, I've implemented a chm parser, please review it and share what you think. here is a link: https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12427752 There are some open issues, I would like to discuss. 1. Metadata - chm file contains a lot of different files such as: images, htmls, css, js etc. 2. Currently it does not support multi threading execution. 3. jchm itself has bugs, I fixed one, ArrayIndexOutOfBoundsException, the question is how to insert and publish the changes? I've written to the author Feng Yu (yfbio@hotmail.com), but still have no answers. I would like to get your feedback. Thanks, Oleg.
        Hide
        Nick Burch added a comment -

        JCHM seems to be under the CDDL license, so we're fine to use the Jar as a runtime dependency as per http://www.apache.org/legal/3party.html

        However, before we can use it, we'll need to get the jar into maven central. One problem would appear to be the lack of response from the authors, as seen from the lack of recent commits, and the problems you've had with getting your bug fixes applied

        Personally, I'd suggest you try a bit more to get hold of the original authors - try emails, sourceforge trackers etc. If you can get in touch, hopefully they'll make you a maintainer. That would allow you to apply patches, and request the maven central upload

        Otherwise, I guess the only option is for you to fork the project, probably on sourceforge (the license doesn't permit it to be hosted by apache). You can apply the patches to your fork, and have it uploaded into maven central.

        Once a patched version is in maven, we can add the dependency to Tika, and apply your parser patch!

        Show
        Nick Burch added a comment - JCHM seems to be under the CDDL license, so we're fine to use the Jar as a runtime dependency as per http://www.apache.org/legal/3party.html However, before we can use it, we'll need to get the jar into maven central. One problem would appear to be the lack of response from the authors, as seen from the lack of recent commits, and the problems you've had with getting your bug fixes applied Personally, I'd suggest you try a bit more to get hold of the original authors - try emails, sourceforge trackers etc. If you can get in touch, hopefully they'll make you a maintainer. That would allow you to apply patches, and request the maven central upload Otherwise, I guess the only option is for you to fork the project, probably on sourceforge (the license doesn't permit it to be hosted by apache). You can apply the patches to your fork, and have it uploaded into maven central. Once a patched version is in maven, we can add the dependency to Tika, and apply your parser patch!
        Hide
        Oleg Tikhonov added a comment - - edited

        A couple of weeks ago I received the answer from SourceForge.net:
        "My apologies for not passing this message on sooner, however the project admin has responded that he is not willing to give up this project at this time. As such, we are not fulfilling this takeover request."

        The library as it is today contains critical bugs, because the fact that project is abandoned I cannot fix its bugs, thus I would exclude it as an option.

        Other option - 7-Zip-JBinding (http://sourceforge.net/projects/sevenzipjbind/develop/). I've implemented chm parser using this library, it works pretty well, the throughput of html extracting is about 5mb/sec. However, it's licensed under LGPL. I've asked Boris Brodski (the developer of that library) if he could re-license it for us. Here is a link to the discussion between him and Igor Pavlov (the author of 7Zip).
        http://sourceforge.net/projects/sevenzip/forums/forum/45797/topic/3983892

        What do you think?

        BR,
        Oleg

        Show
        Oleg Tikhonov added a comment - - edited A couple of weeks ago I received the answer from SourceForge.net: "My apologies for not passing this message on sooner, however the project admin has responded that he is not willing to give up this project at this time. As such, we are not fulfilling this takeover request." The library as it is today contains critical bugs, because the fact that project is abandoned I cannot fix its bugs, thus I would exclude it as an option. Other option - 7-Zip-JBinding ( http://sourceforge.net/projects/sevenzipjbind/develop/ ). I've implemented chm parser using this library, it works pretty well, the throughput of html extracting is about 5mb/sec. However, it's licensed under LGPL. I've asked Boris Brodski (the developer of that library) if he could re-license it for us. Here is a link to the discussion between him and Igor Pavlov (the author of 7Zip). http://sourceforge.net/projects/sevenzip/forums/forum/45797/topic/3983892 What do you think? BR, Oleg
        Hide
        Oleg Tikhonov added a comment -

        I've implemented chm extractor, based on the same extraction algo as chmlib. Before committing the patch I would like to discuss about packaging. How do you think can we package it, like a stand-alone library or as part of chm parser. Or what's an appropriate way doing such a thing?

        Show
        Oleg Tikhonov added a comment - I've implemented chm extractor, based on the same extraction algo as chmlib. Before committing the patch I would like to discuss about packaging. How do you think can we package it, like a stand-alone library or as part of chm parser. Or what's an appropriate way doing such a thing?
        Hide
        Nick Burch added a comment -

        Eventually we might need to separate out some of the file format parts from the tika parser package. For now though, I'd vote for doing the same as we've done for MP3, AutoCad etc, and put the file format reader in with the parser in the parser package.

        Show
        Nick Burch added a comment - Eventually we might need to separate out some of the file format parts from the tika parser package. For now though, I'd vote for doing the same as we've done for MP3, AutoCad etc, and put the file format reader in with the parser in the parser package.
        Hide
        Oleg Tikhonov added a comment -

        Here is a patch.
        Things to be done:
        i. Improve extracting capabilities;
        ii.Support multi-threading execution;
        iii. Add some mechanism for handling: /graphics/122fig01.jpgáÄ5?¼%3/032133678X/images/032133678X/graphics/125fig01.jpgã€Z‚?A
        iv. Find an approach for extracting data from block types: 3 - 7. Originally it's marked as invalid blocks;
        v. Improve the caching;
        vi. Change/add/refactor classes/methods/messages
        vii. Improve performance - depends on amount of extracted items.

        Show
        Oleg Tikhonov added a comment - Here is a patch. Things to be done: i. Improve extracting capabilities; ii.Support multi-threading execution; iii. Add some mechanism for handling: /graphics/122fig01.jpgáÄ5?¼%3/032133678X/images/032133678X/graphics/125fig01.jpgã€Z‚?A iv. Find an approach for extracting data from block types: 3 - 7. Originally it's marked as invalid blocks; v. Improve the caching; vi. Change/add/refactor classes/methods/messages vii. Improve performance - depends on amount of extracted items.
        Hide
        Oleg Tikhonov added a comment -

        Improved address block calculation.
        Performance for single thread in my dual core laptop is about 2Mb/sec.

        Show
        Oleg Tikhonov added a comment - Improved address block calculation. Performance for single thread in my dual core laptop is about 2Mb/sec.
        Hide
        Tran Nam Quang added a comment -

        Hello guys,

        Here's another CHM library for Java, licensed under the LGPL: http://sourceforge.net/projects/chm4j/

        Best regards
        Tran Nam Quang

        Show
        Tran Nam Quang added a comment - Hello guys, Here's another CHM library for Java, licensed under the LGPL: http://sourceforge.net/projects/chm4j/ Best regards Tran Nam Quang
        Hide
        Chris A. Mattmann added a comment -

        Guys, we'd prefer to not use LGPL (and there are serious limitations and impacts for using it at Apache), so LGPL is a no-go.

        Show
        Chris A. Mattmann added a comment - Guys, we'd prefer to not use LGPL (and there are serious limitations and impacts for using it at Apache), so LGPL is a no-go.
        Hide
        Oleg Tikhonov added a comment -

        1. Changed chm parser implementation according to AbstractParser.
        2. Improved directory listing enumeration

        Show
        Oleg Tikhonov added a comment - 1. Changed chm parser implementation according to AbstractParser. 2. Improved directory listing enumeration
        Hide
        Chris A. Mattmann added a comment -

        Hi Oleg,

        Looking over this patch, I have a few recommendations:

        1. the patch should be applied to the Tika source tree format (e.g., tika-parsers/src/main/java/org/apache/tika/parsers/chm)
        2. Many of the class-top-level comments can probably be removed and thrown up on the Tika Wiki
        3. it would be nice to include at least a unit test or 2 to know this is working. It's a huge patch, and I don't have a lot of CHM files to test it out on (being a Mac guy )

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - Hi Oleg, Looking over this patch, I have a few recommendations: the patch should be applied to the Tika source tree format (e.g., tika-parsers/src/main/java/org/apache/tika/parsers/chm) Many of the class-top-level comments can probably be removed and thrown up on the Tika Wiki it would be nice to include at least a unit test or 2 to know this is working. It's a huge patch, and I don't have a lot of CHM files to test it out on (being a Mac guy ) Cheers, Chris
        Hide
        Chris A. Mattmann added a comment -

        OK, in r1133047, 1133048, 1133049 and 1133050, I put my money where my mouth was

        I think we're good here, no?

        Oleg, first off: awesome patch. I didn't realize how much code you actually put together to do this. There is one more update I am going to make which is to add the CHM parser to the list of Tika parsers in the SPI text file, but I think we're good here as an initial revision.

        Jukka, thanks for the motivation!

        Show
        Chris A. Mattmann added a comment - OK, in r1133047, 1133048, 1133049 and 1133050, I put my money where my mouth was I think we're good here, no? Oleg, first off: awesome patch. I didn't realize how much code you actually put together to do this. There is one more update I am going to make which is to add the CHM parser to the list of Tika parsers in the SPI text file, but I think we're good here as an initial revision. Jukka, thanks for the motivation!
        Hide
        Jukka Zitting added a comment -

        Nice!

        Next time, Oleg, feel free to commit your changes directly without waiting (too long) for review. It's much easier for people to find problems or to suggest improvements when the code is already in svn.

        Show
        Jukka Zitting added a comment - Nice! Next time, Oleg, feel free to commit your changes directly without waiting (too long) for review. It's much easier for people to find problems or to suggest improvements when the code is already in svn.
        Hide
        Oleg Tikhonov added a comment -

        support of Java 5

        Show
        Oleg Tikhonov added a comment - support of Java 5
        Hide
        Oleg Tikhonov added a comment -

        Committed revision 1133556.

        Show
        Oleg Tikhonov added a comment - Committed revision 1133556.
        Hide
        Mattmann, Chris A (388J) added a comment -

        Awesome!

        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        Chris Mattmann, Ph.D.
        Senior Computer Scientist
        NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
        Office: 171-266B, Mailstop: 171-246
        Email: chris.a.mattmann@nasa.gov
        WWW: http://sunset.usc.edu/~mattmann/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        Adjunct Assistant Professor, Computer Science Department
        University of Southern California, Los Angeles, CA 90089 USA
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

        Show
        Mattmann, Chris A (388J) added a comment - Awesome! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattmann@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        Hide
        Tran Nam Quang added a comment -

        @ Oleg
        I tested the CHM parser from Tika 0.10 on a few sample CHM files and found that many valid CHM entries are skipped. For comparison, I ran the same test with the chm4j library, which does not skip these entries. Do you know about this problem?

        Show
        Tran Nam Quang added a comment - @ Oleg I tested the CHM parser from Tika 0.10 on a few sample CHM files and found that many valid CHM entries are skipped. For comparison, I ran the same test with the chm4j library, which does not skip these entries. Do you know about this problem?
        Hide
        Tejas Patil added a comment -

        I am working on NUTCH-1454 and I am observing that tika is not able to extract contents from chm documents. (i tried with several chm files but it worked for none). Chm viewer however could show entire contents of the file. I am not the only guy who is facing this issue (see here)

        Show
        Tejas Patil added a comment - I am working on NUTCH-1454 and I am observing that tika is not able to extract contents from chm documents. (i tried with several chm files but it worked for none). Chm viewer however could show entire contents of the file. I am not the only guy who is facing this issue (see here )
        Hide
        Jukka Zitting added a comment -

        tika is not able to extract contents from chm documents

        This was probably due to TIKA-1110, now fixed.

        Show
        Jukka Zitting added a comment - tika is not able to extract contents from chm documents This was probably due to TIKA-1110 , now fixed.
        Hide
        Prashanth Ramaswamy added a comment -

        Hi, I still get the Array index exception in trying to parse CHM files.

        Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range: -1
        at java.util.ArrayList.elementData(ArrayList.java:382)
        at java.util.ArrayList.get(ArrayList.java:395)
        at org.apache.tika.parser.chm.core.ChmExtractor.<init>(ChmExtractor.java:178)

        There was an old comment that this was fixed? Is this so, or is the bug still there?

        Show
        Prashanth Ramaswamy added a comment - Hi, I still get the Array index exception in trying to parse CHM files. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range: -1 at java.util.ArrayList.elementData(ArrayList.java:382) at java.util.ArrayList.get(ArrayList.java:395) at org.apache.tika.parser.chm.core.ChmExtractor.<init>(ChmExtractor.java:178) There was an old comment that this was fixed? Is this so, or is the bug still there?
        Hide
        Nick Burch added a comment -

        Prashanth - you might be best off opening a new bug for this problem, and uploading a problematic file which shows the issue

        Show
        Nick Burch added a comment - Prashanth - you might be best off opening a new bug for this problem, and uploading a problematic file which shows the issue
        Hide
        Prashanth Ramaswamy added a comment -

        Nick, Thanks for your response. Unfortunately, I am constrained from uploading the chm file for which I'm encountering the exception. I may have to see if there are other chm files for which the same exception gets thrown.

        Show
        Prashanth Ramaswamy added a comment - Nick, Thanks for your response. Unfortunately, I am constrained from uploading the chm file for which I'm encountering the exception. I may have to see if there are other chm files for which the same exception gets thrown.

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Karl Heinz Marbaise
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development