Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.9
    • Fix Version/s: None
    • Component/s: indexer, parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The aim of this plugin is to use XSLT to extract metadata from HTML DOM structures.

      Your Data --> Parse-html plugin or TIKA plugin --> DOM structure --> XSLT plugin

      The main advantage is that:

      • You won't have to produce any java code, only XSLT and configuration
      • It can process DOM structure from DocumentFragment (@see NekoHtml and @see TagSoup)
      • It is HtmlParseFilter plugin compatible and can be plugged as any other plugin (parse-js, parse-swf, etc...)

      This topic has been discussed on http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html

      1. xsl-parse-plugin2.patch
        68 kB
        Albinscode
      2. xsl-parse-plugin.patch
        273 kB
        Albinscode
      3. nutch-site.xml
        1 kB
        Albinscode
      4. NUTCH-1870-trunk-v4.patch
        55 kB
        Sebastian Nagel
      5. NUTCH-1870-trunk-v3.patch
        50 kB
        Sebastian Nagel

        Issue Links

          Activity

          Hide
          Albinscode Albinscode added a comment -

          As suggested by Sebastian Nagel and Chris Mattmann I'm providing this small plugin as a patch to see if it is valuable for Nutch community.

          To keep an ASCII only patch I've disabled in the build.xml the possibility of generating java classes with jaxb (to see how to integrate them).

          As you will see there are some unit tests strongly related to sites I'm crawling. If too specific I can take time to crawl some more relevant sites and provide more examples.

          Show
          Albinscode Albinscode added a comment - As suggested by Sebastian Nagel and Chris Mattmann I'm providing this small plugin as a patch to see if it is valuable for Nutch community. To keep an ASCII only patch I've disabled in the build.xml the possibility of generating java classes with jaxb (to see how to integrate them). As you will see there are some unit tests strongly related to sites I'm crawling. If too specific I can take time to crawl some more relevant sites and provide more examples.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Thanks, Albinscode, for the patch. Looks nice, code is well formatted, ... I'll continue testing, but a few first comments:

          • could load resources in setConf(conf) and not do it on-demand in the "filter" method:
            • setConf() is called early, so failures in reading configuration resources are reported soon
            • filter() may be called concurrently because for every plugin only one instance is hold per extension point
          • thread-safety: the filter() method must be thread-safe, and so must be all used object instances. Transformer instances are not safe and may not be shared by threads. That's also true for other DOM/XML related classes, cf. 1, 2, or NUTCH-1596. Possible solutions are, e.g., to make this variables local or thread local.

          > there are some unit tests strongly related to sites I'm crawling

          • would be better to take sample pages where we are sure not to violate any copyright
          Show
          wastl-nagel Sebastian Nagel added a comment - Thanks, Albinscode , for the patch. Looks nice, code is well formatted, ... I'll continue testing, but a few first comments: could load resources in setConf(conf) and not do it on-demand in the "filter" method: setConf() is called early, so failures in reading configuration resources are reported soon filter() may be called concurrently because for every plugin only one instance is hold per extension point thread-safety: the filter() method must be thread-safe, and so must be all used object instances. Transformer instances are not safe and may not be shared by threads. That's also true for other DOM/XML related classes, cf. 1 , 2 , or NUTCH-1596 . Possible solutions are, e.g., to make this variables local or thread local . > there are some unit tests strongly related to sites I'm crawling would be better to take sample pages where we are sure not to violate any copyright
          Hide
          Albinscode Albinscode added a comment -

          Sebastian Nagel, I've updated code according to your (very accurate) review:

          • all properties resources are loaded not in setConf(conf) method
          • thread safety: the RulesManager is no more a singleton (so all required xsl transformers are instanciated for the current plugin instance)

          Then I'll remove my specific site tests and put some samples data related to crawling a library (I think it is a common use case study).

          I'll keep you in touch.

          Albin

          Show
          Albinscode Albinscode added a comment - Sebastian Nagel , I've updated code according to your (very accurate) review: all properties resources are loaded not in setConf(conf) method thread safety: the RulesManager is no more a singleton (so all required xsl transformers are instanciated for the current plugin instance) Then I'll remove my specific site tests and put some samples data related to crawling a library (I think it is a common use case study). I'll keep you in touch. Albin
          Hide
          jnioche Julien Nioche added a comment -

          Hi Albin

          Could you please generate your patch against the trunk? see https://wiki.apache.org/nutch/HowToContribute#Creating_a_patch

          @author tags are not encouraged for contributions to Apache projects - don't have a specific reference to any discussion on this but it has been mentioned before

          Apologies if I've missed the explanation before but why do you need an indexingfilter as well as a HtmlParseFilter? If the extracted data are in the parse metadata then the index-metadata plugin should be able to do the indexing.

          Am I right in thinking that the selection of the rule is based on a regexp on the URL? Could the mechanism be extended so that the selection is done via an arbitrary metadatum? for instance if a document has a given key value in its metadata, which could be set at injection or obtained from a source URL, we would be able to select the extraction patterns based on it.

          Thanks

          Show
          jnioche Julien Nioche added a comment - Hi Albin Could you please generate your patch against the trunk? see https://wiki.apache.org/nutch/HowToContribute#Creating_a_patch @author tags are not encouraged for contributions to Apache projects - don't have a specific reference to any discussion on this but it has been mentioned before Apologies if I've missed the explanation before but why do you need an indexingfilter as well as a HtmlParseFilter? If the extracted data are in the parse metadata then the index-metadata plugin should be able to do the indexing. Am I right in thinking that the selection of the rule is based on a regexp on the URL? Could the mechanism be extended so that the selection is done via an arbitrary metadatum? for instance if a document has a given key value in its metadata, which could be set at injection or obtained from a source URL, we would be able to select the extraction patterns based on it. Thanks
          Hide
          Albinscode Albinscode added a comment -

          Hello Julien,

          1. Yes, I'll regenerate one patch soon. If Sebastian has nothing more to add, i'll provide it at end of the week.
          2. You're totally right, I've removed some additional @author tags but yeah I would say it's hard to give my baby
          3. It is a good point. I've created a specific indexer to index all metadata that are provided in the xsl used for transformation. It allows one people to avoid specifying another time in the global nutch conf which metadata to index as far as this is already specified in the xsl file. It is really a matter of philosophy. If you find it is redundant and that it is clearer to explicitly write metadata to extract in the global conf we can remove it.
          4. This is another good point A very interesting approach. We could for example specify the rule method attribute (with value "url" or "field"). I'll write it down to my TODO file!
          Thanks a lot for all these remarks!

          Show
          Albinscode Albinscode added a comment - Hello Julien, 1. Yes, I'll regenerate one patch soon. If Sebastian has nothing more to add, i'll provide it at end of the week. 2. You're totally right, I've removed some additional @author tags but yeah I would say it's hard to give my baby 3. It is a good point. I've created a specific indexer to index all metadata that are provided in the xsl used for transformation. It allows one people to avoid specifying another time in the global nutch conf which metadata to index as far as this is already specified in the xsl file. It is really a matter of philosophy. If you find it is redundant and that it is clearer to explicitly write metadata to extract in the global conf we can remove it. 4. This is another good point A very interesting approach. We could for example specify the rule method attribute (with value "url" or "field"). I'll write it down to my TODO file! Thanks a lot for all these remarks!
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          > 1. I'll regenerate one patch soon. If Sebastian has nothing more to add, i'll provide it at end of the week.
          Please, go on. There are enough changes, to wait for...
          > 2. it's hard to give my baby
          you'll get mentioned in CHANGES.txt
          > 3. (implement own indexing filter instead of using index-metadata to add extracted fields)
          It's better if plugins do not depend on other plugins. Parse-xsl is more powerful than parse-metatags (but more difficult to configure). So if you need parse-xsl, you'll probably also use it simple meta tag extraction.

          Neko converts element names to uppercase, that's why xpath = xpath.toUpperCase();, right? However, that breaks XPath statements containing attributes (canonically lowercase, cf. [1]). To selectively convert only element names in XPath statements would require to parse them - hard (if impossible) without a library. Also, .toUpperCase() without explicit locale (here: Locale.ENGLISH) is sensitive to the system's locale, see NUTCH-1807.

          Show
          wastl-nagel Sebastian Nagel added a comment - > 1. I'll regenerate one patch soon. If Sebastian has nothing more to add, i'll provide it at end of the week. Please, go on. There are enough changes, to wait for... > 2. it's hard to give my baby you'll get mentioned in CHANGES.txt > 3. (implement own indexing filter instead of using index-metadata to add extracted fields) It's better if plugins do not depend on other plugins. Parse-xsl is more powerful than parse-metatags (but more difficult to configure). So if you need parse-xsl, you'll probably also use it simple meta tag extraction. Neko converts element names to uppercase, that's why xpath = xpath.toUpperCase(); , right? However, that breaks XPath statements containing attributes (canonically lowercase, cf. [ 1 ]). To selectively convert only element names in XPath statements would require to parse them - hard (if impossible) without a library. Also, .toUpperCase() without explicit locale (here: Locale.ENGLISH) is sensitive to the system's locale, see NUTCH-1807 .
          Hide
          Albinscode Albinscode added a comment -

          @Sebastian, for the toUpperCase remark you are right but if you look at the code it is only applied on the "html" string so no attribute problem. This is only a trick to get document at "html" tag level (it seems that Neko is embedding it with other tags).

          I've removed author tags and I've created a new sample of tests only dedicated to nutch to avoid crawling coyprighted sites.

          I'm releasing the patch as soon as possible.

          Thank you everyone.

          Show
          Albinscode Albinscode added a comment - @Sebastian, for the toUpperCase remark you are right but if you look at the code it is only applied on the "html" string so no attribute problem. This is only a trick to get document at "html" tag level (it seems that Neko is embedding it with other tags). I've removed author tags and I've created a new sample of tests only dedicated to nutch to avoid crawling coyprighted sites. I'm releasing the patch as soon as possible. Thank you everyone.
          Hide
          Albinscode Albinscode added a comment -

          New patch that strips authors tags, provides non copyrighted tests, is thread safety compliant.

          Show
          Albinscode Albinscode added a comment - New patch that strips authors tags, provides non copyrighted tests, is thread safety compliant.
          Hide
          Albinscode Albinscode added a comment - - edited

          @Sebastian the version of the jaxb implementation is the 2.2.7. Hope this helps, otherwise let me know.

          In attachment, I've provided a sample conf that will work with sample files provided in src/tests/files/sample1.

          Show
          Albinscode Albinscode added a comment - - edited @Sebastian the version of the jaxb implementation is the 2.2.7. Hope this helps, otherwise let me know. In attachment, I've provided a sample conf that will work with sample files provided in src/tests/files/sample1.
          Hide
          Albinscode Albinscode added a comment -

          Sample conf file.

          Show
          Albinscode Albinscode added a comment - Sample conf file.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Hi Albinscode, simple and funny example !

          I've added a patch which

          • includes boilerplate to build, test, generate javadoc
          • make the tests running (but only from src/plugin/parse-xsl via "ant test")
          • various minor changes
          • javadoc
            • added package.info in org.apache.nutch.parse.xsl
            • auto-generated JAXB packages are suppressed. Or do we need javadocs for these classes?
          • attribute "filterUrlsWithNoRule" belongs to the element "rules", right? -> changed in the sample

          The plugin is working now! I'll continue testing with more complex transforms (to get the full power of XSL).

          Meanwhile a few points which could require review or rework:

          • load all configuration files from class path, e.g.
            Reader reader = conf.getConfResourceAsReader(rulesFile);
            

            That's important if Nutch is run via Hadoop: class and configuration files are wrapped into one single job file. There are no "real" files which can be load.
            This also applies to running the unit tests: we cannot rely that they are executed from a specific working directory.

          • reading config files on-demand and multiple times is not really efficient. It's better to read and parse all configuration files during setConf(). Sorry, maybe my comment before was not 100% clear at this point, but setConf() should be the best place:
            • errors in configuration are catched early, and are less likely to oversee than if it happens somewhere in the middle of parsing a segment
            • inside setConf() you do not take care of thread-safety
            • setConf() is called only once
            • parsing should be fast and there is strict timeout (30 sec. per default)
          • regarding thread-safety: the trade-off should be minimal. Making RulesManager a local variable seems too much and is in contradiction to the previous point (loading config files). Wouldn't it be sufficient to make only those objects thread-local which are unsafe and need to be used from filter(). E.g., javax.xml.transform.Transformer is definitely not thread-safe (we need to check other javax classes). But it should be possible to get a Transformer without reading the xsl file again every time.
          • what about fields with multiple values? A expression can match multiple times, but looks like only the first match is extracted.
          Show
          wastl-nagel Sebastian Nagel added a comment - Hi Albinscode , simple and funny example ! I've added a patch which includes boilerplate to build, test, generate javadoc make the tests running (but only from src/plugin/parse-xsl via "ant test") various minor changes javadoc added package.info in org.apache.nutch.parse.xsl auto-generated JAXB packages are suppressed. Or do we need javadocs for these classes? attribute "filterUrlsWithNoRule" belongs to the element "rules", right? -> changed in the sample The plugin is working now! I'll continue testing with more complex transforms (to get the full power of XSL). Meanwhile a few points which could require review or rework: load all configuration files from class path, e.g. Reader reader = conf.getConfResourceAsReader(rulesFile); That's important if Nutch is run via Hadoop: class and configuration files are wrapped into one single job file. There are no "real" files which can be load. This also applies to running the unit tests: we cannot rely that they are executed from a specific working directory. reading config files on-demand and multiple times is not really efficient. It's better to read and parse all configuration files during setConf(). Sorry, maybe my comment before was not 100% clear at this point, but setConf() should be the best place: errors in configuration are catched early, and are less likely to oversee than if it happens somewhere in the middle of parsing a segment inside setConf() you do not take care of thread-safety setConf() is called only once parsing should be fast and there is strict timeout (30 sec. per default) regarding thread-safety: the trade-off should be minimal. Making RulesManager a local variable seems too much and is in contradiction to the previous point (loading config files). Wouldn't it be sufficient to make only those objects thread-local which are unsafe and need to be used from filter(). E.g., javax.xml.transform.Transformer is definitely not thread-safe (we need to check other javax classes). But it should be possible to get a Transformer without reading the xsl file again every time. what about fields with multiple values? A expression can match multiple times, but looks like only the first match is extracted.
          Hide
          Albinscode Albinscode added a comment -

          ok Sebastian Nagel I think I have to review a little bit these points quietly because I didn't figure out the modifications to make at my first update . As I'm running nutch in a standalone mode (without hadoop) I didn't really think about thread safety.

          I'll keep you in touch.

          Show
          Albinscode Albinscode added a comment - ok Sebastian Nagel I think I have to review a little bit these points quietly because I didn't figure out the modifications to make at my first update . As I'm running nutch in a standalone mode (without hadoop) I didn't really think about thread safety. I'll keep you in touch.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Hi Albinscode, any news and progress? Alternatively, I could/would try to get the last things done to make this really useful plugin available.

          Show
          wastl-nagel Sebastian Nagel added a comment - Hi Albinscode , any news and progress? Alternatively, I could/would try to get the last things done to make this really useful plugin available.
          Hide
          Albinscode Albinscode added a comment -

          Hi Sebastian Nagel,

          I really did nothing since I change my job last year and it is really a shame .
          Furthermore I'm not really comfortable with thread safety. If I can help I think I would be more productive by fixing configuration points

          Show
          Albinscode Albinscode added a comment - Hi Sebastian Nagel , I really did nothing since I change my job last year and it is really a shame . Furthermore I'm not really comfortable with thread safety. If I can help I think I would be more productive by fixing configuration points
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          New patch including:

          • load all configuration files from class path
          • load once initially via setConf(conf)
          • made javax.xml.transform.Transformer thread-local
          • renamed ParseTechnicalTests to make them run by ant (-> TestParseTechnical)

          Further / open points:

          • verified (I was wrong in my last comment): it's possible to fill fields with multiple values if an expression matches more than once
          • in the transformer file (cf. src/plugin/parse-xsl/: conf/documents.xsd resp. sample/sample1/transformer_book.xsl) there is "documents" node containing an open number of "document" nodes. This could also be used to (optionally) add multiple "subdocuments" (cf. NUTCH-443).
          • selection of root node via XPath "html" has one disadvantage: it makes it impossible to use parse-xsl for make pure XML documents indexable and – extract documents and fields (key-value pairs) from arbitrary XML. Also the special treatment of Neko (parse-html) and tagsoup (parse-tika) via lower/upper-cased tag names could be needless after NUTCH-1592.
          • there is still some extra debugging code (saveDOMOutput(), displayMemoryUsage, timing) which could be moved to some general utility classes, it could be useful also somewhere else.
          • ev. we should simplify how the parsers are called in AbstractCrawlTest: just call ParseUtil.parse(). This would also make the parse-xsl tests sensitive to changes in parse-html (or parse-tika).
          Show
          wastl-nagel Sebastian Nagel added a comment - New patch including: load all configuration files from class path load once initially via setConf(conf) made javax.xml.transform.Transformer thread-local renamed ParseTechnicalTests to make them run by ant (-> TestParseTechnical) Further / open points: verified (I was wrong in my last comment): it's possible to fill fields with multiple values if an expression matches more than once in the transformer file (cf. src/plugin/parse-xsl/: conf/documents.xsd resp. sample/sample1/transformer_book.xsl) there is "documents" node containing an open number of "document" nodes. This could also be used to (optionally) add multiple "subdocuments" (cf. NUTCH-443 ). selection of root node via XPath "html" has one disadvantage: it makes it impossible to use parse-xsl for make pure XML documents indexable and – extract documents and fields (key-value pairs) from arbitrary XML. Also the special treatment of Neko (parse-html) and tagsoup (parse-tika) via lower/upper-cased tag names could be needless after NUTCH-1592 . there is still some extra debugging code (saveDOMOutput(), displayMemoryUsage, timing) which could be moved to some general utility classes, it could be useful also somewhere else. ev. we should simplify how the parsers are called in AbstractCrawlTest: just call ParseUtil.parse(). This would also make the parse-xsl tests sensitive to changes in parse-html (or parse-tika).
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Hi Seb:

          ev. we should simplify how the parsers are called in AbstractCrawlTest: just call ParseUtil.parse(). This would also make the parse-xsl tests sensitive to changes in parse-html (or parse-tika).

          Totally agree! I remember whipping up ParseUtil a long time ago and I think it's still a good place to encapsulate calls to parsers.

          Show
          chrismattmann Chris A. Mattmann added a comment - Hi Seb: ev. we should simplify how the parsers are called in AbstractCrawlTest: just call ParseUtil.parse(). This would also make the parse-xsl tests sensitive to changes in parse-html (or parse-tika). Totally agree! I remember whipping up ParseUtil a long time ago and I think it's still a good place to encapsulate calls to parsers.
          Hide
          lewismc Lewis John McGibbney added a comment -

          Would be really nice to get your patch as a Github PR Sebastian Nagel. Are you able to do it or so you want me to?

          Show
          lewismc Lewis John McGibbney added a comment - Would be really nice to get your patch as a Github PR Sebastian Nagel . Are you able to do it or so you want me to?

            People

            • Assignee:
              Unassigned
              Reporter:
              Albinscode Albinscode
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development