Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD. XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.

      In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (<, ", &) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib. The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and "<" and the offsets could be wrong. The upgraded version is in the patch.

      1. SOLR-2597.patch
        26 kB
        Mike Sokolov
      2. SOLR-2597.patch
        31 kB
        Mike Sokolov

        Activity

        Hide
        Mike Sokolov added a comment -

        I tried to include the upgraded Woodstox jars, but I don't think I figured how to put binaries in the patch actually. What's needed are: http://repository.codehaus.org/org/codehaus/woodstox/woodstox-core-asl/4.1.1/woodstox-core-asl-4.1.1.jar and http://repository.codehaus.org/org/codehaus/woodstox/stax2-api/3.1.1/stax2-api-3.1.1.jar
        which replace the existing wstx-asl-xxx.jar.

        Show
        Mike Sokolov added a comment - I tried to include the upgraded Woodstox jars, but I don't think I figured how to put binaries in the patch actually. What's needed are: http://repository.codehaus.org/org/codehaus/woodstox/woodstox-core-asl/4.1.1/woodstox-core-asl-4.1.1.jar and http://repository.codehaus.org/org/codehaus/woodstox/stax2-api/3.1.1/stax2-api-3.1.1.jar which replace the existing wstx-asl-xxx.jar.
        Hide
        Hoss Man added a comment -

        Mike: thanks for the patch!

        as Koji mentioned on the mailing list, might want to consider naming this XmlStripCharFilter ... that was my first opinion, but reading the docs the "include" and "exclude" options definitely make it a bit more generic, so i'm leaning towards the opinion that XmlCharFilter is better.

        (there's an argument to be made that we should have an XmlStripCharFilter that only removes pi/comments/whitespace and resolves entities, and then a distinct XmlTagCharFilter that does the include/exclude – but i'm guessing that would be less efficient since this makes it possible to do in one pass, and anyone who wants include/exclude at the "tag" level is almost certainly going to want the striping/entities as well)

        skiming the patch i'm +1 except for the "new Random" in the test case ... if you take a look at the existing test cases you'll see how you can hook into the solr test framework to get random values that are consistent with a global seed – that way if a test fails, it will report which seed was used and people can reproduce it using system properties.

        would also be nice to have a test of the Factory (using a schema.xml declaration) but that's not nearly as important.

        and of course: would be great if "the xml policeman" uwe could review.

        I tried to include the upgraded Woodstox jars, but I don't think I figured how to put binaries in the patch actually.

        it's not possible, so don't worry about it. the important thing is noting in a comment (like you did) exactly what new/upgraded jars are needed.

        Show
        Hoss Man added a comment - Mike: thanks for the patch! as Koji mentioned on the mailing list, might want to consider naming this XmlStripCharFilter ... that was my first opinion, but reading the docs the "include" and "exclude" options definitely make it a bit more generic, so i'm leaning towards the opinion that XmlCharFilter is better. (there's an argument to be made that we should have an XmlStripCharFilter that only removes pi/comments/whitespace and resolves entities, and then a distinct XmlTagCharFilter that does the include/exclude – but i'm guessing that would be less efficient since this makes it possible to do in one pass, and anyone who wants include/exclude at the "tag" level is almost certainly going to want the striping/entities as well) skiming the patch i'm +1 except for the "new Random" in the test case ... if you take a look at the existing test cases you'll see how you can hook into the solr test framework to get random values that are consistent with a global seed – that way if a test fails, it will report which seed was used and people can reproduce it using system properties. would also be nice to have a test of the Factory (using a schema.xml declaration) but that's not nearly as important. and of course: would be great if "the xml policeman" uwe could review. I tried to include the upgraded Woodstox jars, but I don't think I figured how to put binaries in the patch actually. it's not possible, so don't worry about it. the important thing is noting in a comment (like you did) exactly what new/upgraded jars are needed.
        Hide
        Robert Muir added a comment -

        just one comment: taking a look at the patch, it currently won't compile because the analysis module has no dependencies and thus no woodstox or whatever.
        (but, thanks for trying to integrate it here!!!)

        One step would be, rather than have this thing static, can we just have the ctors to this thing take a general XMLInputFactory instead, e.g.

        public XmlCharFilter (CharStream reader, XMLInputFactory inputFactory) {
        

        The corresponding Solr CharFilterFactory could then configure it with all the woodstox-specific parameters.
        But, this still wouldn't solve the issue that all of lucene and modules are on java5 (and it looks like this uses java6-specific APIs).

        I don't think it makes sense to block the patch for these issues, so one workaround would be to just add it to Solr-only.
        If/when we ever move to java 6 in lucene we could then move it into the analysis module.
        Another option would be if the XML policeman knows some workaround (sorry, not my thing).

        Show
        Robert Muir added a comment - just one comment: taking a look at the patch, it currently won't compile because the analysis module has no dependencies and thus no woodstox or whatever. (but, thanks for trying to integrate it here!!!) One step would be, rather than have this thing static, can we just have the ctors to this thing take a general XMLInputFactory instead, e.g. public XmlCharFilter (CharStream reader, XMLInputFactory inputFactory) { The corresponding Solr CharFilterFactory could then configure it with all the woodstox-specific parameters. But, this still wouldn't solve the issue that all of lucene and modules are on java5 (and it looks like this uses java6-specific APIs). I don't think it makes sense to block the patch for these issues, so one workaround would be to just add it to Solr-only. If/when we ever move to java 6 in lucene we could then move it into the analysis module. Another option would be if the XML policeman knows some workaround (sorry, not my thing).
        Hide
        Mike Sokolov added a comment -

        OK - I can extend LuceneTestCase, use its random, add can certainly a test for the Factory.

        I'm not sure what the right package for this code is; working in Eclipse of course, all the jars get mushed into one giant classpath. I guess I should build w/ant to see the dependency issues? But it does sound as if it needs to move somewhere where solr/lib contents can be a dependent.

        Apparently there is another jar you can get (http://woodstox.codehaus.org/stax-api-1.0.1.jar) to provide the javax.xml.stream package (StaX) for Java 5, but it doesn't sound as if it would be worth the trouble if this moves into solr land - is that right, can we rely on Java 6 there?

        I agree that having a static parser is distasteful, but it's a performance optimization. It tends to be expensive to instantiate these parsers. I'm not clear on what the object lifecycle for the XmlCharFilter is exactly - Robert are you saying the factory is long-lived, but the filter is not?

        Show
        Mike Sokolov added a comment - OK - I can extend LuceneTestCase, use its random, add can certainly a test for the Factory. I'm not sure what the right package for this code is; working in Eclipse of course, all the jars get mushed into one giant classpath. I guess I should build w/ant to see the dependency issues? But it does sound as if it needs to move somewhere where solr/lib contents can be a dependent. Apparently there is another jar you can get ( http://woodstox.codehaus.org/stax-api-1.0.1.jar ) to provide the javax.xml.stream package (StaX) for Java 5, but it doesn't sound as if it would be worth the trouble if this moves into solr land - is that right, can we rely on Java 6 there? I agree that having a static parser is distasteful, but it's a performance optimization. It tends to be expensive to instantiate these parsers. I'm not clear on what the object lifecycle for the XmlCharFilter is exactly - Robert are you saying the factory is long-lived, but the filter is not?
        Hide
        Robert Muir added a comment -

        yes, the factories are long-lived and do expensive things up-front to configure themselves (parsing files etc)

        Show
        Robert Muir added a comment - yes, the factories are long-lived and do expensive things up-front to configure themselves (parsing files etc)
        Hide
        Mike Sokolov added a comment -

        Updated patch addresses (most of) Robert and Hoss' comments (thanks for the speedy review!):

        Test now uses the random in the test framework

        I added a test for the factory (actually all the tests now use the factory since it is now used to create the parser), but I haven't plumbed this all the way through to a schema declaration.

        Moved to org.apache.solr.analysis: I don't know if this is the right place for this, but at least it should resolve any jar and java 1.6 dependency problems - I think? - at least I can compile and run the tests from both eclipse and ant command line although I'm not sure what that proves exactly.

        The parser is now created in the factory rather than being maintained as a static in the reader class.

        Show
        Mike Sokolov added a comment - Updated patch addresses (most of) Robert and Hoss' comments (thanks for the speedy review!): Test now uses the random in the test framework I added a test for the factory (actually all the tests now use the factory since it is now used to create the parser), but I haven't plumbed this all the way through to a schema declaration. Moved to org.apache.solr.analysis: I don't know if this is the right place for this, but at least it should resolve any jar and java 1.6 dependency problems - I think? - at least I can compile and run the tests from both eclipse and ant command line although I'm not sure what that proves exactly. The parser is now created in the factory rather than being maintained as a static in the reader class.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mike Sokolov
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development