Details

    • Type: New Feature New Feature
    • Status: In Progress
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This plugin should build on the Any23 library to provide us with a plugin which extracts RDF data from HTTP and file resources. Although as of writing Any23 not part of the ASF, the project is working towards integration into the Apache Incubator. Once the project proves its value, this would be an excellent addition to the Nutch 1.X codebase.

      1. NUTCH-1129.patch
        165 kB
        Lewis John McGibbney

        Activity

        Hide
        Julien Nioche added a comment -

        Any23 might graduate into a Tika subproject, if not it should available as a Tika parser and we'll get it automatically.

        Show
        Julien Nioche added a comment - Any23 might graduate into a Tika subproject, if not it should available as a Tika parser and we'll get it automatically.
        Hide
        Lewis John McGibbney added a comment -

        thanks Julien. To be honest it would be nice for the latter of your comments to materialise. I'll keep this issue open to track the progress.

        Show
        Lewis John McGibbney added a comment - thanks Julien. To be honest it would be nice for the latter of your comments to materialise. I'll keep this issue open to track the progress.
        Hide
        Hudson added a comment -

        Integrated in nutch-trunk-maven #69 (See https://builds.apache.org/job/nutch-trunk-maven/69/)
        NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

        markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/conf/log4j.properties
        Show
        Hudson added a comment - Integrated in nutch-trunk-maven #69 (See https://builds.apache.org/job/nutch-trunk-maven/69/ ) NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/log4j.properties
        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #1699 (See https://builds.apache.org/job/Nutch-trunk/1699/)
        NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

        markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/conf/log4j.properties
        Show
        Hudson added a comment - Integrated in Nutch-trunk #1699 (See https://builds.apache.org/job/Nutch-trunk/1699/ ) NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/log4j.properties
        Hide
        Markus Jelsma added a comment -

        Hi guys, anything new on this one?

        Show
        Markus Jelsma added a comment - Hi guys, anything new on this one?
        Hide
        Lewis John McGibbney added a comment -

        Hi Markus. I'm really gutted about this one, I've not had time to sort it out. I want to say the following things though.

        • Any23 is now available on repository.apache.org [1], however I think we need to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty trivial though.
        • Any23 already has a crawler plugin implementation (nothing like the stuff we offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch? [2] Unfortunately the documentation is not great at all as I'm sure you'll agree.

        [1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23
        [2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/

        Show
        Lewis John McGibbney added a comment - Hi Markus. I'm really gutted about this one, I've not had time to sort it out. I want to say the following things though. Any23 is now available on repository.apache.org [1] , however I think we need to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty trivial though. Any23 already has a crawler plugin implementation (nothing like the stuff we offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch? [2] Unfortunately the documentation is not great at all as I'm sure you'll agree. [1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23 [2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/
        Hide
        Lewis John McGibbney added a comment -

        This is a first ditch attempt at the parse-any23 plugin. In all honesty the patch is a monster due to a hugely excessive test suite. This will be cut down once I get the code implementation written properly.

        Show
        Lewis John McGibbney added a comment - This is a first ditch attempt at the parse-any23 plugin. In all honesty the patch is a monster due to a hugely excessive test suite. This will be cut down once I get the code implementation written properly.
        Hide
        Markus Jelsma added a comment -

        This is a parser plugin right? How will this work if we for example would like to parse microdata with any23 and use Tika's BoilerpipeContentHandler to extraction? In the current BP patch we use multiple content handlers to parse all in one go so i wonder if this could be implemented as such.

        Please correct me when wrong

        Show
        Markus Jelsma added a comment - This is a parser plugin right? How will this work if we for example would like to parse microdata with any23 and use Tika's BoilerpipeContentHandler to extraction? In the current BP patch we use multiple content handlers to parse all in one go so i wonder if this could be implemented as such. Please correct me when wrong
        Hide
        Lewis John McGibbney added a comment -

        Yeah your right Markus. The Any23 libraries are parsers for extracting stuff like microdata we would rely upon Tika for content extraction. Currently in Any23 I think were stuck way back at 0.6 or something so there is obviously work to be done here obviously. I've been looking at https://svn.apache.org/viewvc/nutch/trunk/src/plugin/microformats-reltag/
        I'll work towards reusing as much of the Tika stuff we have.

        Show
        Lewis John McGibbney added a comment - Yeah your right Markus. The Any23 libraries are parsers for extracting stuff like microdata we would rely upon Tika for content extraction. Currently in Any23 I think were stuck way back at 0.6 or something so there is obviously work to be done here obviously. I've been looking at https://svn.apache.org/viewvc/nutch/trunk/src/plugin/microformats-reltag/ I'll work towards reusing as much of the Tika stuff we have.
        Hide
        Lewis John McGibbney added a comment -

        I missed the boat on this one as we were focusing too much on actually getting Any23 moving... which did not happen.
        We are however moving Any23 over to Tika so the goodies will be coming once the transition is finished.

        Show
        Lewis John McGibbney added a comment - I missed the boat on this one as we were focusing too much on actually getting Any23 moving... which did not happen. We are however moving Any23 over to Tika so the goodies will be coming once the transition is finished.
        Hide
        Lewis John McGibbney added a comment -

        There has been a change of heart as of recent down in Any23land.
        I feel that the project has taken a turn for the better and things are looking much brighter for Any23.

        Show
        Lewis John McGibbney added a comment - There has been a change of heart as of recent down in Any23land. I feel that the project has taken a turn for the better and things are looking much brighter for Any23.
        Hide
        Lewis John McGibbney added a comment -

        First pass at this for 2.x HEAD.
        Some tests covering RDFa and Microdata extraction.
        I've documented the patch everywhere I could to make the Any23 functionality as clear as possible.

        For those wanting to test out this patch, please turn logging to debug and you will see a nice extractor report in with your logs. This is great for seeing which Any23 extractors were activated and used as well as how many triples were extracted and how long it took to do the job!

        Some con's which I would like to address. Right now by default we (Any23 code base) print out a rather bulky configuration message which is really unappealing as far as logging goes. I need to find a way of turning this off. It can maybe be done through configuration but I may also need to add a switch down in Any23 for it.

        So anyway, here is a first pass. If you are able to comment it would be great.
        Thanks

        Show
        Lewis John McGibbney added a comment - First pass at this for 2.x HEAD. Some tests covering RDFa and Microdata extraction. I've documented the patch everywhere I could to make the Any23 functionality as clear as possible. For those wanting to test out this patch, please turn logging to debug and you will see a nice extractor report in with your logs. This is great for seeing which Any23 extractors were activated and used as well as how many triples were extracted and how long it took to do the job! Some con's which I would like to address. Right now by default we (Any23 code base) print out a rather bulky configuration message which is really unappealing as far as logging goes. I need to find a way of turning this off. It can maybe be done through configuration but I may also need to add a switch down in Any23 for it. So anyway, here is a first pass. If you are able to comment it would be great. Thanks
        Hide
        Lewis John McGibbney added a comment -

        During ApacheCon I'll port this to trunk. Unless someone else wishes to do so

        Show
        Lewis John McGibbney added a comment - During ApacheCon I'll port this to trunk. Unless someone else wishes to do so
        Hide
        Lewis John McGibbney added a comment -

        Did anyone get an opportunity to try this out on 2.x?

        Show
        Lewis John McGibbney added a comment - Did anyone get an opportunity to try this out on 2.x?
        Hide
        Sebastian Nagel added a comment -

        Hi Lewis John McGibbney, not yet. But I head a look on the patch. Looks good, in general! Some comments:

        • dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We recently had a discussion about that topic @user.
        • all extracted triples are finally stored in one multi-valued field, each triple represented as string. That's not an optimal representation, regarding two (are there more?) possible use cases: extract and index key-value pairs as structured content (cf. @dev), index into some triple store (as new indexer back-end)
        • similar: isn't there a more efficient way to pass triples from parse to indexing filter than tab-separated in a huge string (there may be many triples in one document!)

        The latter two points aren't a blocker by no means. But we should think about evolving the plugin and make it really usable.

        Show
        Sebastian Nagel added a comment - Hi Lewis John McGibbney , not yet. But I head a look on the patch. Looks good, in general! Some comments: dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We recently had a discussion about that topic @user . all extracted triples are finally stored in one multi-valued field, each triple represented as string. That's not an optimal representation, regarding two (are there more?) possible use cases: extract and index key-value pairs as structured content (cf. @dev ), index into some triple store (as new indexer back-end) similar: isn't there a more efficient way to pass triples from parse to indexing filter than tab-separated in a huge string (there may be many triples in one document!) The latter two points aren't a blocker by no means. But we should think about evolving the plugin and make it really usable.

          People

          • Assignee:
            Lewis John McGibbney
            Reporter:
            Lewis John McGibbney
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development