Details

    • Type: New Feature New Feature
    • Status: In Progress
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This plugin should build on the Any23 library to provide us with a plugin which extracts RDF data from HTTP and file resources. Although as of writing Any23 not part of the ASF, the project is working towards integration into the Apache Incubator. Once the project proves its value, this would be an excellent addition to the Nutch 1.X codebase.

      1. NUTCH-1129.patch
        165 kB
        Lewis John McGibbney

        Activity

        Lewis John McGibbney created issue -
        Hide
        Julien Nioche added a comment -

        Any23 might graduate into a Tika subproject, if not it should available as a Tika parser and we'll get it automatically.

        Show
        Julien Nioche added a comment - Any23 might graduate into a Tika subproject, if not it should available as a Tika parser and we'll get it automatically.
        Hide
        Lewis John McGibbney added a comment -

        thanks Julien. To be honest it would be nice for the latter of your comments to materialise. I'll keep this issue open to track the progress.

        Show
        Lewis John McGibbney added a comment - thanks Julien. To be honest it would be nice for the latter of your comments to materialise. I'll keep this issue open to track the progress.
        Julien Nioche made changes -
        Field Original Value New Value
        Fix Version/s 1.5 [ 12318246 ]
        Fix Version/s 1.4 [ 12316519 ]
        Affects Version/s 1.4 [ 12316519 ]
        Hide
        Hudson added a comment -

        Integrated in nutch-trunk-maven #69 (See https://builds.apache.org/job/nutch-trunk-maven/69/)
        NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

        markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/conf/log4j.properties
        Show
        Hudson added a comment - Integrated in nutch-trunk-maven #69 (See https://builds.apache.org/job/nutch-trunk-maven/69/ ) NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/log4j.properties
        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #1699 (See https://builds.apache.org/job/Nutch-trunk/1699/)
        NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

        markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/conf/log4j.properties
        Show
        Hudson added a comment - Integrated in Nutch-trunk #1699 (See https://builds.apache.org/job/Nutch-trunk/1699/ ) NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/log4j.properties
        Hide
        Markus Jelsma added a comment -

        Hi guys, anything new on this one?

        Show
        Markus Jelsma added a comment - Hi guys, anything new on this one?
        Hide
        Lewis John McGibbney added a comment -

        Hi Markus. I'm really gutted about this one, I've not had time to sort it out. I want to say the following things though.

        • Any23 is now available on repository.apache.org [1], however I think we need to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty trivial though.
        • Any23 already has a crawler plugin implementation (nothing like the stuff we offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch? [2] Unfortunately the documentation is not great at all as I'm sure you'll agree.

        [1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23
        [2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/

        Show
        Lewis John McGibbney added a comment - Hi Markus. I'm really gutted about this one, I've not had time to sort it out. I want to say the following things though. Any23 is now available on repository.apache.org [1] , however I think we need to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty trivial though. Any23 already has a crawler plugin implementation (nothing like the stuff we offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch? [2] Unfortunately the documentation is not great at all as I'm sure you'll agree. [1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23 [2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/
        Lewis John McGibbney made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Lewis John McGibbney added a comment -

        This is a first ditch attempt at the parse-any23 plugin. In all honesty the patch is a monster due to a hugely excessive test suite. This will be cut down once I get the code implementation written properly.

        Show
        Lewis John McGibbney added a comment - This is a first ditch attempt at the parse-any23 plugin. In all honesty the patch is a monster due to a hugely excessive test suite. This will be cut down once I get the code implementation written properly.
        Lewis John McGibbney made changes -
        Attachment NUTCH-1129.patch [ 12514570 ]
        Hide
        Markus Jelsma added a comment -

        This is a parser plugin right? How will this work if we for example would like to parse microdata with any23 and use Tika's BoilerpipeContentHandler to extraction? In the current BP patch we use multiple content handlers to parse all in one go so i wonder if this could be implemented as such.

        Please correct me when wrong

        Show
        Markus Jelsma added a comment - This is a parser plugin right? How will this work if we for example would like to parse microdata with any23 and use Tika's BoilerpipeContentHandler to extraction? In the current BP patch we use multiple content handlers to parse all in one go so i wonder if this could be implemented as such. Please correct me when wrong
        Hide
        Lewis John McGibbney added a comment -

        Yeah your right Markus. The Any23 libraries are parsers for extracting stuff like microdata we would rely upon Tika for content extraction. Currently in Any23 I think were stuck way back at 0.6 or something so there is obviously work to be done here obviously. I've been looking at https://svn.apache.org/viewvc/nutch/trunk/src/plugin/microformats-reltag/
        I'll work towards reusing as much of the Tika stuff we have.

        Show
        Lewis John McGibbney added a comment - Yeah your right Markus. The Any23 libraries are parsers for extracting stuff like microdata we would rely upon Tika for content extraction. Currently in Any23 I think were stuck way back at 0.6 or something so there is obviously work to be done here obviously. I've been looking at https://svn.apache.org/viewvc/nutch/trunk/src/plugin/microformats-reltag/ I'll work towards reusing as much of the Tika stuff we have.
        Markus Jelsma made changes -
        Fix Version/s 1.6 [ 12319941 ]
        Fix Version/s 1.5 [ 12318246 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.7 [ 12323281 ]
        Fix Version/s 1.6 [ 12319941 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.9 [ 12324611 ]
        Fix Version/s 1.7 [ 12323281 ]
        Hide
        Lewis John McGibbney added a comment -

        I missed the boat on this one as we were focusing too much on actually getting Any23 moving... which did not happen.
        We are however moving Any23 over to Tika so the goodies will be coming once the transition is finished.

        Show
        Lewis John McGibbney added a comment - I missed the boat on this one as we were focusing too much on actually getting Any23 moving... which did not happen. We are however moving Any23 over to Tika so the goodies will be coming once the transition is finished.
        Lewis John McGibbney made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Lewis John McGibbney added a comment -

        There has been a change of heart as of recent down in Any23land.
        I feel that the project has taken a turn for the better and things are looking much brighter for Any23.

        Show
        Lewis John McGibbney added a comment - There has been a change of heart as of recent down in Any23land. I feel that the project has taken a turn for the better and things are looking much brighter for Any23.
        Lewis John McGibbney made changes -
        Resolution Won't Fix [ 2 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Lewis John McGibbney made changes -
        Fix Version/s 2.3 [ 12324325 ]
        Lewis John McGibbney made changes -
        Status Reopened [ 4 ] In Progress [ 3 ]
        Hide
        Lewis John McGibbney added a comment -

        First pass at this for 2.x HEAD.
        Some tests covering RDFa and Microdata extraction.
        I've documented the patch everywhere I could to make the Any23 functionality as clear as possible.

        For those wanting to test out this patch, please turn logging to debug and you will see a nice extractor report in with your logs. This is great for seeing which Any23 extractors were activated and used as well as how many triples were extracted and how long it took to do the job!

        Some con's which I would like to address. Right now by default we (Any23 code base) print out a rather bulky configuration message which is really unappealing as far as logging goes. I need to find a way of turning this off. It can maybe be done through configuration but I may also need to add a switch down in Any23 for it.

        So anyway, here is a first pass. If you are able to comment it would be great.
        Thanks

        Show
        Lewis John McGibbney added a comment - First pass at this for 2.x HEAD. Some tests covering RDFa and Microdata extraction. I've documented the patch everywhere I could to make the Any23 functionality as clear as possible. For those wanting to test out this patch, please turn logging to debug and you will see a nice extractor report in with your logs. This is great for seeing which Any23 extractors were activated and used as well as how many triples were extracted and how long it took to do the job! Some con's which I would like to address. Right now by default we (Any23 code base) print out a rather bulky configuration message which is really unappealing as far as logging goes. I need to find a way of turning this off. It can maybe be done through configuration but I may also need to add a switch down in Any23 for it. So anyway, here is a first pass. If you are able to comment it would be great. Thanks
        Lewis John McGibbney made changes -
        Attachment NUTCH-1129.patch [ 12637926 ]
        Lewis John McGibbney made changes -
        Attachment NUTCH-1129.patch [ 12514570 ]
        Lewis John McGibbney made changes -
        Patch Info Patch Available [ 10042 ]
        Hide
        Lewis John McGibbney added a comment -

        During ApacheCon I'll port this to trunk. Unless someone else wishes to do so

        Show
        Lewis John McGibbney added a comment - During ApacheCon I'll port this to trunk. Unless someone else wishes to do so
        Hide
        Lewis John McGibbney added a comment -

        Did anyone get an opportunity to try this out on 2.x?

        Show
        Lewis John McGibbney added a comment - Did anyone get an opportunity to try this out on 2.x?
        Hide
        Sebastian Nagel added a comment -

        Hi Lewis John McGibbney, not yet. But I head a look on the patch. Looks good, in general! Some comments:

        • dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We recently had a discussion about that topic @user.
        • all extracted triples are finally stored in one multi-valued field, each triple represented as string. That's not an optimal representation, regarding two (are there more?) possible use cases: extract and index key-value pairs as structured content (cf. @dev), index into some triple store (as new indexer back-end)
        • similar: isn't there a more efficient way to pass triples from parse to indexing filter than tab-separated in a huge string (there may be many triples in one document!)

        The latter two points aren't a blocker by no means. But we should think about evolving the plugin and make it really usable.

        Show
        Sebastian Nagel added a comment - Hi Lewis John McGibbney , not yet. But I head a look on the patch. Looks good, in general! Some comments: dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We recently had a discussion about that topic @user . all extracted triples are finally stored in one multi-valued field, each triple represented as string. That's not an optimal representation, regarding two (are there more?) possible use cases: extract and index key-value pairs as structured content (cf. @dev ), index into some triple store (as new indexer back-end) similar: isn't there a more efficient way to pass triples from parse to indexing filter than tab-separated in a huge string (there may be many triples in one document!) The latter two points aren't a blocker by no means. But we should think about evolving the plugin and make it really usable.
        Lewis John McGibbney made changes -
        Fix Version/s 2.4 [ 12324540 ]
        Fix Version/s 2.3 [ 12324325 ]
        Fix Version/s 1.9 [ 12324611 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        143d 1h 55m 1 Lewis John McGibbney 14/Feb/12 23:09
        In Progress In Progress Resolved Resolved
        667d 17h 53m 1 Lewis John McGibbney 13/Dec/13 17:02
        Resolved Resolved Reopened Reopened
        108d 4h 31m 1 Lewis John McGibbney 31/Mar/14 21:33
        Reopened Reopened In Progress In Progress
        51s 1 Lewis John McGibbney 31/Mar/14 21:34

          People

          • Assignee:
            Lewis John McGibbney
            Reporter:
            Lewis John McGibbney
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development