Details

    • Type: New Feature
    • Status: In Progress
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 2.5
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This plugin should build on the Any23 library to provide us with a plugin which extracts RDF data from HTTP and file resources. Although as of writing Any23 not part of the ASF, the project is working towards integration into the Apache Incubator. Once the project proves its value, this would be an excellent addition to the Nutch 1.X codebase.

      1. NUTCH-1129.patch
        165 kB
        Lewis John McGibbney

        Issue Links

          Activity

          Hide
          jnioche Julien Nioche added a comment -

          Any23 might graduate into a Tika subproject, if not it should available as a Tika parser and we'll get it automatically.

          Show
          jnioche Julien Nioche added a comment - Any23 might graduate into a Tika subproject, if not it should available as a Tika parser and we'll get it automatically.
          Hide
          lewismc Lewis John McGibbney added a comment -

          thanks Julien. To be honest it would be nice for the latter of your comments to materialise. I'll keep this issue open to track the progress.

          Show
          lewismc Lewis John McGibbney added a comment - thanks Julien. To be honest it would be nice for the latter of your comments to materialise. I'll keep this issue open to track the progress.
          Hide
          hudson Hudson added a comment -

          Integrated in nutch-trunk-maven #69 (See https://builds.apache.org/job/nutch-trunk-maven/69/)
          NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

          markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/log4j.properties
          Show
          hudson Hudson added a comment - Integrated in nutch-trunk-maven #69 (See https://builds.apache.org/job/nutch-trunk-maven/69/ ) NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/log4j.properties
          Hide
          hudson Hudson added a comment -

          Integrated in Nutch-trunk #1699 (See https://builds.apache.org/job/Nutch-trunk/1699/)
          NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j

          markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/log4j.properties
          Show
          hudson Hudson added a comment - Integrated in Nutch-trunk #1699 (See https://builds.apache.org/job/Nutch-trunk/1699/ ) NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221185 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/conf/log4j.properties
          Hide
          markus17 Markus Jelsma added a comment -

          Hi guys, anything new on this one?

          Show
          markus17 Markus Jelsma added a comment - Hi guys, anything new on this one?
          Hide
          lewismc Lewis John McGibbney added a comment -

          Hi Markus. I'm really gutted about this one, I've not had time to sort it out. I want to say the following things though.

          • Any23 is now available on repository.apache.org [1], however I think we need to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty trivial though.
          • Any23 already has a crawler plugin implementation (nothing like the stuff we offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch? [2] Unfortunately the documentation is not great at all as I'm sure you'll agree.

          [1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23
          [2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/

          Show
          lewismc Lewis John McGibbney added a comment - Hi Markus. I'm really gutted about this one, I've not had time to sort it out. I want to say the following things though. Any23 is now available on repository.apache.org [1] , however I think we need to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty trivial though. Any23 already has a crawler plugin implementation (nothing like the stuff we offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch? [2] Unfortunately the documentation is not great at all as I'm sure you'll agree. [1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23 [2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/
          Hide
          lewismc Lewis John McGibbney added a comment -

          This is a first ditch attempt at the parse-any23 plugin. In all honesty the patch is a monster due to a hugely excessive test suite. This will be cut down once I get the code implementation written properly.

          Show
          lewismc Lewis John McGibbney added a comment - This is a first ditch attempt at the parse-any23 plugin. In all honesty the patch is a monster due to a hugely excessive test suite. This will be cut down once I get the code implementation written properly.
          Hide
          markus17 Markus Jelsma added a comment -

          This is a parser plugin right? How will this work if we for example would like to parse microdata with any23 and use Tika's BoilerpipeContentHandler to extraction? In the current BP patch we use multiple content handlers to parse all in one go so i wonder if this could be implemented as such.

          Please correct me when wrong

          Show
          markus17 Markus Jelsma added a comment - This is a parser plugin right? How will this work if we for example would like to parse microdata with any23 and use Tika's BoilerpipeContentHandler to extraction? In the current BP patch we use multiple content handlers to parse all in one go so i wonder if this could be implemented as such. Please correct me when wrong
          Hide
          lewismc Lewis John McGibbney added a comment -

          Yeah your right Markus. The Any23 libraries are parsers for extracting stuff like microdata we would rely upon Tika for content extraction. Currently in Any23 I think were stuck way back at 0.6 or something so there is obviously work to be done here obviously. I've been looking at https://svn.apache.org/viewvc/nutch/trunk/src/plugin/microformats-reltag/
          I'll work towards reusing as much of the Tika stuff we have.

          Show
          lewismc Lewis John McGibbney added a comment - Yeah your right Markus. The Any23 libraries are parsers for extracting stuff like microdata we would rely upon Tika for content extraction. Currently in Any23 I think were stuck way back at 0.6 or something so there is obviously work to be done here obviously. I've been looking at https://svn.apache.org/viewvc/nutch/trunk/src/plugin/microformats-reltag/ I'll work towards reusing as much of the Tika stuff we have.
          Hide
          lewismc Lewis John McGibbney added a comment -

          I missed the boat on this one as we were focusing too much on actually getting Any23 moving... which did not happen.
          We are however moving Any23 over to Tika so the goodies will be coming once the transition is finished.

          Show
          lewismc Lewis John McGibbney added a comment - I missed the boat on this one as we were focusing too much on actually getting Any23 moving... which did not happen. We are however moving Any23 over to Tika so the goodies will be coming once the transition is finished.
          Hide
          lewismc Lewis John McGibbney added a comment -

          There has been a change of heart as of recent down in Any23land.
          I feel that the project has taken a turn for the better and things are looking much brighter for Any23.

          Show
          lewismc Lewis John McGibbney added a comment - There has been a change of heart as of recent down in Any23land. I feel that the project has taken a turn for the better and things are looking much brighter for Any23.
          Hide
          lewismc Lewis John McGibbney added a comment -

          First pass at this for 2.x HEAD.
          Some tests covering RDFa and Microdata extraction.
          I've documented the patch everywhere I could to make the Any23 functionality as clear as possible.

          For those wanting to test out this patch, please turn logging to debug and you will see a nice extractor report in with your logs. This is great for seeing which Any23 extractors were activated and used as well as how many triples were extracted and how long it took to do the job!

          Some con's which I would like to address. Right now by default we (Any23 code base) print out a rather bulky configuration message which is really unappealing as far as logging goes. I need to find a way of turning this off. It can maybe be done through configuration but I may also need to add a switch down in Any23 for it.

          So anyway, here is a first pass. If you are able to comment it would be great.
          Thanks

          Show
          lewismc Lewis John McGibbney added a comment - First pass at this for 2.x HEAD. Some tests covering RDFa and Microdata extraction. I've documented the patch everywhere I could to make the Any23 functionality as clear as possible. For those wanting to test out this patch, please turn logging to debug and you will see a nice extractor report in with your logs. This is great for seeing which Any23 extractors were activated and used as well as how many triples were extracted and how long it took to do the job! Some con's which I would like to address. Right now by default we (Any23 code base) print out a rather bulky configuration message which is really unappealing as far as logging goes. I need to find a way of turning this off. It can maybe be done through configuration but I may also need to add a switch down in Any23 for it. So anyway, here is a first pass. If you are able to comment it would be great. Thanks
          Hide
          lewismc Lewis John McGibbney added a comment -

          During ApacheCon I'll port this to trunk. Unless someone else wishes to do so

          Show
          lewismc Lewis John McGibbney added a comment - During ApacheCon I'll port this to trunk. Unless someone else wishes to do so
          Hide
          lewismc Lewis John McGibbney added a comment -

          Did anyone get an opportunity to try this out on 2.x?

          Show
          lewismc Lewis John McGibbney added a comment - Did anyone get an opportunity to try this out on 2.x?
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Hi Lewis John McGibbney, not yet. But I head a look on the patch. Looks good, in general! Some comments:

          • dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We recently had a discussion about that topic @user.
          • all extracted triples are finally stored in one multi-valued field, each triple represented as string. That's not an optimal representation, regarding two (are there more?) possible use cases: extract and index key-value pairs as structured content (cf. @dev), index into some triple store (as new indexer back-end)
          • similar: isn't there a more efficient way to pass triples from parse to indexing filter than tab-separated in a huge string (there may be many triples in one document!)

          The latter two points aren't a blocker by no means. But we should think about evolving the plugin and make it really usable.

          Show
          wastl-nagel Sebastian Nagel added a comment - Hi Lewis John McGibbney , not yet. But I head a look on the patch. Looks good, in general! Some comments: dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We recently had a discussion about that topic @user . all extracted triples are finally stored in one multi-valued field, each triple represented as string. That's not an optimal representation, regarding two (are there more?) possible use cases: extract and index key-value pairs as structured content (cf. @dev ), index into some triple store (as new indexer back-end) similar: isn't there a more efficient way to pass triples from parse to indexing filter than tab-separated in a huge string (there may be many triples in one document!) The latter two points aren't a blocker by no means. But we should think about evolving the plugin and make it really usable.
          Hide
          githubbot ASF GitHub Bot added a comment -

          thilohaas opened a new pull request #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - thilohaas opened a new pull request #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-318087705

          Hi @thilohaas this patch is too large for us to merge into Nutch master branch...
          Can you please separate our your code to implement Microdata support? We can then review that patch alone.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-318087705 Hi @thilohaas this patch is too large for us to merge into Nutch master branch... Can you please separate our your code to implement Microdata support? We can then review that patch alone. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-318189750

          Will try this patch while waiting for it to be merged into the official repo... thanks, man!

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-318189750 Will try this patch while waiting for it to be merged into the official repo... thanks, man! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          lewismc Lewis John McGibbney added a comment -

          We need some sort of reasonable response here...
          Currently, this issue is too large.
          Sebastians comments are true, can you please consider addressing them and then we can work with this patch?

          Show
          lewismc Lewis John McGibbney added a comment - We need some sort of reasonable response here... Currently, this issue is too large. Sebastians comments are true, can you please consider addressing them and then we can work with this patch?
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-318291768

          Hi @simoncpu , there is no way we can merge this code into master branch of Nutch... it is simply too much of a change.
          This patch needs to be reduced in size to be considered.
          Thank you for all contributions to Nutch, we welcome all, we need to make sure that the software is high quality and *stable*.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-318291768 Hi @simoncpu , there is no way we can merge this code into master branch of Nutch... it is simply too much of a change. This patch needs to be reduced in size to be considered. Thank you for all contributions to Nutch, we welcome all, we need to make sure that the software is high quality and * stable *. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          thilohaas commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-323806868

          Sorry, I didn't accidentally added changes from another local test-branch. Should be cleaned up now and only contain any23 plugin relevant changes.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - thilohaas commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-323806868 Sorry, I didn't accidentally added changes from another local test-branch. Should be cleaned up now and only contain any23 plugin relevant changes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#discussion_r134292923

          ##########
          File path: src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
          ##########
          @@ -0,0 +1,165 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +package org.apache.nutch.any23;
          +
          +import java.io.ByteArrayOutputStream;
          +import java.io.IOException;
          +import java.net.URISyntaxException;
          +import java.nio.charset.Charset;
          +import java.util.*;
          +
          +import org.apache.any23.Any23;
          +import org.apache.any23.writer.BenchmarkTripleHandler;
          +import org.apache.any23.writer.NTriplesWriter;
          +import org.apache.any23.writer.TripleHandler;
          +import org.apache.any23.writer.TripleHandlerException;
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.nutch.metadata.Metadata;
          +import org.apache.nutch.parse.*;
          +import org.apache.nutch.protocol.Content;
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +import org.w3c.dom.DocumentFragment;
          +
          +/**
          + * <p>This implementation of

          {@link org.apache.nutch.parse.HtmlParseFilter}

          + * uses the <a href="http://any23.apache.org">Apache Any23</a> library
          + * for parsing and extracting structured data in RDF format from a
          + * variety of Web documents. Currently it supports the following
          + * input formats:</p>
          + * <ol><li>RDF/XML, Turtle, Notation 3</li>
          + * <li>RDFa with RDFa1.1 prefix mechanism</li>
          + * <li>Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview,
          + * License, XFN and Species</li>
          + * <li>HTML5 Microdata: (such as Schema.org)</li>
          + * <li>CSV: Comma Separated Values with separator autodetection.</li></ol>.
          + * <p>In this implementation triples are written as Notation3 e.g.
          + * <code><http://www.bbc.co.uk/news/scotland/> <http://iptc.org/std/rNews/2011-10-07#datePublished> "2014/03/31 13:53:03"@en-gb .</code>
          + * and triples are identified within output triple streams by the presence of '\n'.
          + * The presence of the '\n' is a characteristic specific to N3 serialization in Any23.
          + * In order to use another/other writers implementing the
          + * <a href="http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html">TripleHandler</a>
          + * interface, we will most likely need to identify an alternative data characteristic
          + * which we can use to split triples streams.</p>
          + * <p>
          + *
          + */
          +public class Any23ParseFilter implements HtmlParseFilter {
          +
          + /** Logging instance */
          + public static final Logger LOG = LoggerFactory.getLogger(Any23ParseFilter.class);
          +
          + private Configuration conf = null;
          +
          + /** Constant identifier used as a Key for writing and reading
          + * triples to and from the metadata Map field.
          + */
          + public final static String ANY23_TRIPLES = "Any23-Triples";
          +
          + private static class Any23Parser {
          +
          + Set<String> triples = null;
          +
          + Any23Parser(String url, String htmlContent) throws TripleHandlerException {
          + triples = new TreeSet<String>();
          + try

          { + parse(url, htmlContent); + }

          catch (URISyntaxException e)

          { + throw new RuntimeException(e.getReason()); + }

          catch (IOException e)

          { + e.printStackTrace(); + }
          + }
          +
          + /**
          + * Maintains a {@link java.util.Set} containing the triples
          + * @return a {@link java.util.Set} of triples.
          + */
          + public Set<String> getTriples() { + return triples; + }
          +
          + private void parse(String url, String htmlContent) throws URISyntaxException, IOException, TripleHandlerException {
          + Any23 any23 = new Any23();
          + ByteArrayOutputStream baos = new ByteArrayOutputStream();
          + TripleHandler tHandler = new NTriplesWriter(baos);
          + BenchmarkTripleHandler bHandler = new BenchmarkTripleHandler(tHandler);
          + try { + any23.extract(htmlContent, url, "text/HTML", "UTF-8", bHandler); + } catch (Exception e) {+ e.printStackTrace();+ }

          finally

          { + tHandler.close(); + bHandler.close(); + }

          + //This merely prints out a report of the Any23 extraction.
          + LOG.info("Any23 report: " + bHandler.report());
          + String n3 = baos.toString("UTF-8");
          + // we split the triples stream by the occurrence of
          + // '\n' as this is a distinguishing feature of NTriples
          + // output serialization in Any23.
          + String[] triplesStrings = n3.split("\n");
          + Collections.addAll(triples, triplesStrings);
          + }
          + }
          +
          + /**
          + * @see org.apache.hadoop.conf.Configurable#getConf()
          + */
          + @Override
          + public Configuration getConf()

          { + return this.conf; + }

          +
          + /**
          + * @see org.apache.hadoop.conf.Configurable#setConf(org.apache.hadoop.conf.Configuration)
          + */
          + @Override
          + public void setConf(Configuration conf)

          { + this.conf = conf; + }

          +
          + /**
          + * @see org.apache.nutch.parse.HtmlParseFilter#filter(Content, ParseResult, HTMLMetaTags, DocumentFragment)
          + */
          + @Override
          + public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
          +
          + Any23Parser parser = null;
          + try

          { + String htmlContent = new String(content.getContent(), Charset.forName("UTF-8")); + parser = new Any23Parser(content.getUrl(), htmlContent); + }

          catch (TripleHandlerException e)

          { + throw new RuntimeException("Error running Any23 parser: " + e.getMessage()); + }

          + Set<String> triples = parser.getTriples();
          + // can't store multiple values in page metadata -> separate by tabs
          + StringBuilder sb = new StringBuilder();
          +
          + Parse parse = parseResult.get(content.getUrl());
          + Metadata metadata = parse.getData().getParseMeta();
          +
          + for (String triple : triples) {
          + sb.append(triple);

          Review comment:
          Previously we discussed and agreed that this was not an optional solution for associating triples with the Metadata. I still agree with that. We need to think of a more efficient manner for persisting triples.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#discussion_r134292923 ########## File path: src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java ########## @@ -0,0 +1,165 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.any23; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.net.URISyntaxException; +import java.nio.charset.Charset; +import java.util.*; + +import org.apache.any23.Any23; +import org.apache.any23.writer.BenchmarkTripleHandler; +import org.apache.any23.writer.NTriplesWriter; +import org.apache.any23.writer.TripleHandler; +import org.apache.any23.writer.TripleHandlerException; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.metadata.Metadata; +import org.apache.nutch.parse.*; +import org.apache.nutch.protocol.Content; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; + +/** + * <p>This implementation of {@link org.apache.nutch.parse.HtmlParseFilter} + * uses the <a href="http://any23.apache.org">Apache Any23</a> library + * for parsing and extracting structured data in RDF format from a + * variety of Web documents. Currently it supports the following + * input formats:</p> + * <ol><li>RDF/XML, Turtle, Notation 3</li> + * <li>RDFa with RDFa1.1 prefix mechanism</li> + * <li>Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview, + * License, XFN and Species</li> + * <li>HTML5 Microdata: (such as Schema.org)</li> + * <li>CSV: Comma Separated Values with separator autodetection.</li></ol>. + * <p>In this implementation triples are written as Notation3 e.g. + * <code>< http://www.bbc.co.uk/news/scotland/ > < http://iptc.org/std/rNews/2011-10-07#datePublished > "2014/03/31 13:53:03"@en-gb .</code> + * and triples are identified within output triple streams by the presence of '\n'. + * The presence of the '\n' is a characteristic specific to N3 serialization in Any23. + * In order to use another/other writers implementing the + * <a href="http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TripleHandler.html">TripleHandler</a> + * interface, we will most likely need to identify an alternative data characteristic + * which we can use to split triples streams.</p> + * <p> + * + */ +public class Any23ParseFilter implements HtmlParseFilter { + + /** Logging instance */ + public static final Logger LOG = LoggerFactory.getLogger(Any23ParseFilter.class); + + private Configuration conf = null; + + /** Constant identifier used as a Key for writing and reading + * triples to and from the metadata Map field. + */ + public final static String ANY23_TRIPLES = "Any23-Triples"; + + private static class Any23Parser { + + Set<String> triples = null; + + Any23Parser(String url, String htmlContent) throws TripleHandlerException { + triples = new TreeSet<String>(); + try { + parse(url, htmlContent); + } catch (URISyntaxException e) { + throw new RuntimeException(e.getReason()); + } catch (IOException e) { + e.printStackTrace(); + } + } + + /** + * Maintains a {@link java.util.Set} containing the triples + * @return a {@link java.util.Set} of triples. + */ + public Set<String> getTriples() { + return triples; + } + + private void parse(String url, String htmlContent) throws URISyntaxException, IOException, TripleHandlerException { + Any23 any23 = new Any23(); + ByteArrayOutputStream baos = new ByteArrayOutputStream(); + TripleHandler tHandler = new NTriplesWriter(baos); + BenchmarkTripleHandler bHandler = new BenchmarkTripleHandler(tHandler); + try { + any23.extract(htmlContent, url, "text/HTML", "UTF-8", bHandler); + } catch (Exception e) {+ e.printStackTrace();+ } finally { + tHandler.close(); + bHandler.close(); + } + //This merely prints out a report of the Any23 extraction. + LOG.info("Any23 report: " + bHandler.report()); + String n3 = baos.toString("UTF-8"); + // we split the triples stream by the occurrence of + // '\n' as this is a distinguishing feature of NTriples + // output serialization in Any23. + String[] triplesStrings = n3.split("\n"); + Collections.addAll(triples, triplesStrings); + } + } + + /** + * @see org.apache.hadoop.conf.Configurable#getConf() + */ + @Override + public Configuration getConf() { + return this.conf; + } + + /** + * @see org.apache.hadoop.conf.Configurable#setConf(org.apache.hadoop.conf.Configuration) + */ + @Override + public void setConf(Configuration conf) { + this.conf = conf; + } + + /** + * @see org.apache.nutch.parse.HtmlParseFilter#filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) + */ + @Override + public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { + + Any23Parser parser = null; + try { + String htmlContent = new String(content.getContent(), Charset.forName("UTF-8")); + parser = new Any23Parser(content.getUrl(), htmlContent); + } catch (TripleHandlerException e) { + throw new RuntimeException("Error running Any23 parser: " + e.getMessage()); + } + Set<String> triples = parser.getTriples(); + // can't store multiple values in page metadata -> separate by tabs + StringBuilder sb = new StringBuilder(); + + Parse parse = parseResult.get(content.getUrl()); + Metadata metadata = parse.getData().getParseMeta(); + + for (String triple : triples) { + sb.append(triple); Review comment: Previously we discussed and agreed that this was not an optional solution for associating triples with the Metadata. I still agree with that. We need to think of a more efficient manner for persisting triples. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#discussion_r134293186

          ##########
          File path: src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
          ##########
          @@ -0,0 +1,165 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +package org.apache.nutch.any23;
          +
          +import java.io.ByteArrayOutputStream;
          +import java.io.IOException;
          +import java.net.URISyntaxException;
          +import java.nio.charset.Charset;
          +import java.util.*;
          +
          +import org.apache.any23.Any23;
          +import org.apache.any23.writer.BenchmarkTripleHandler;
          +import org.apache.any23.writer.NTriplesWriter;
          +import org.apache.any23.writer.TripleHandler;
          +import org.apache.any23.writer.TripleHandlerException;
          +import org.apache.hadoop.conf.Configuration;
          +import org.apache.nutch.metadata.Metadata;
          +import org.apache.nutch.parse.*;
          +import org.apache.nutch.protocol.Content;
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +import org.w3c.dom.DocumentFragment;
          +
          +/**
          + * <p>This implementation of

          {@link org.apache.nutch.parse.HtmlParseFilter}

          + * uses the <a href="http://any23.apache.org">Apache Any23</a> library
          + * for parsing and extracting structured data in RDF format from a
          + * variety of Web documents. Currently it supports the following
          + * input formats:</p>

          Review comment:
          To be honest the comment, including a list of the supported formats is not really necessary. You can just link back to the any23.apache.org homepage for a list of supported formats.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on a change in pull request #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#discussion_r134293186 ########## File path: src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java ########## @@ -0,0 +1,165 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.any23; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.net.URISyntaxException; +import java.nio.charset.Charset; +import java.util.*; + +import org.apache.any23.Any23; +import org.apache.any23.writer.BenchmarkTripleHandler; +import org.apache.any23.writer.NTriplesWriter; +import org.apache.any23.writer.TripleHandler; +import org.apache.any23.writer.TripleHandlerException; +import org.apache.hadoop.conf.Configuration; +import org.apache.nutch.metadata.Metadata; +import org.apache.nutch.parse.*; +import org.apache.nutch.protocol.Content; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; + +/** + * <p>This implementation of {@link org.apache.nutch.parse.HtmlParseFilter} + * uses the <a href="http://any23.apache.org">Apache Any23</a> library + * for parsing and extracting structured data in RDF format from a + * variety of Web documents. Currently it supports the following + * input formats:</p> Review comment: To be honest the comment, including a list of the supported formats is not really necessary. You can just link back to the any23.apache.org homepage for a list of supported formats. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-326104881

          I tried building using the updated patch but got this:

          ```
          [ivy:resolve] WARN: ::::::::::::::::::::::::::::::::::::::::::::::
          [ivy:resolve] WARN: :: UNRESOLVED DEPENDENCIES ::
          [ivy:resolve] WARN: ::::::::::::::::::::::::::::::::::::::::::::::
          [ivy:resolve] WARN: :: org.apache.commons#commons-csv;1.0-SNAPSHOT-rev1148315: not found
          [ivy:resolve] WARN: ::::::::::::::::::::::::::::::::::::::::::::::
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-326104881 I tried building using the updated patch but got this: ``` [ivy:resolve] WARN: :::::::::::::::::::::::::::::::::::::::::::::: [ivy:resolve] WARN: :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] WARN: :::::::::::::::::::::::::::::::::::::::::::::: [ivy:resolve] WARN: :: org.apache.commons#commons-csv;1.0-SNAPSHOT-rev1148315: not found [ivy:resolve] WARN: :::::::::::::::::::::::::::::::::::::::::::::: ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-326107912

          @simoncpu this may be intermittent... please report back here if it does not resolve itself. I am aware that this SNAPSHOT dependency has given us problems in the past. We may need to push a fix somewhere in Any23 e.g. upgrade the commons-csv library.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-326107912 @simoncpu this may be intermittent... please report back here if it does not resolve itself. I am aware that this SNAPSHOT dependency has given us problems in the past. We may need to push a fix somewhere in Any23 e.g. upgrade the commons-csv library. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-326307268

          @lewismc It still didn't work, so I just grabbed the jar file at: http://svn.apache.org/repos/asf/any23/repo-ext/org/apache/commons/commons-csv/1.0-SNAPSHOT-rev1148315/.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-326307268 @lewismc It still didn't work, so I just grabbed the jar file at: http://svn.apache.org/repos/asf/any23/repo-ext/org/apache/commons/commons-csv/1.0-SNAPSHOT-rev1148315/ . ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-326309710

          OK this is an issue. The solution is to address https://issues.apache.org/jira/browse/ANY23-264

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-326309710 OK this is an issue. The solution is to address https://issues.apache.org/jira/browse/ANY23-264 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-327062894

          @thilohaas I tested this on a website with Microdata, but it can't index anything...

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-327062894 @thilohaas I tested this on a website with Microdata, but it can't index anything... ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-327062894

          @thilohaas I tested this on a website with Microdata, but it can't index anything...

          EDIT: The error is:
          `Error parsing: http://example.org/website-with-microdata: org.apache.nutch.parse.ParseException: Unable to successfully parse content`

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-327062894 @thilohaas I tested this on a website with Microdata, but it can't index anything... EDIT: The error is: `Error parsing: http://example.org/website-with-microdata: org.apache.nutch.parse.ParseException: Unable to successfully parse content` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-327229664

          @thilohaas can you consider the comments above please?

          @simoncpu thank you for trying out the patch... please keep providing feedback. Did you manage to debug the source of the ParseException? The URL you provide is not actually available... have you tried it on anything else? An example would be https://www.w3.org

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-327229664 @thilohaas can you consider the comments above please? @simoncpu thank you for trying out the patch... please keep providing feedback. Did you manage to debug the source of the ParseException? The URL you provide is not actually available... have you tried it on anything else? An example would be https://www.w3.org ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-327245880

          @lewismc Here's one of the URLs that I've tried:

          http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/(url)

          BTW, the previous patch was able to parse the Microdata without problems.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-327245880 @lewismc Here's one of the URLs that I've tried: http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/ (url) BTW, the previous patch was able to parse the Microdata without problems. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-327245880

          @lewismc Here's one of the URLs that I've tried:

          http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/(url)

          BTW, the previous patch was able to parse the Microdata without problems.

          EDIT, here's the full output:
          ```Thread FetcherThread has no more work available
          Using queue mode : byHost
          -finishing thread FetcherThread, activeThreads=1
          Fetcher: throughput threshold: -1
          Thread FetcherThread has no more work available
          Fetcher: throughput threshold retries: 5
          -finishing thread FetcherThread, activeThreads=1
          fetcher.maxNum.threads can't be < than 50 : using 50 instead
          -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
          -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
          Thread FetcherThread has no more work available
          -finishing thread FetcherThread, activeThreads=0
          -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
          -activeThreads=0
          Fetcher: finished at 2017-09-05 17:25:43, elapsed: 00:00:08
          Parsing : 20170905172529
          /home/simoncpu/nutch/runtime/local/bin/nutch parse -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D mapreduce.task.skip.start.attempts=2 -D mapreduce.map.skip.maxrecords=1 crawl-dir/segments/20170905172529
          ParseSegment: starting at 2017-09-05 17:25:45
          ParseSegment: segment: crawl-dir/segments/20170905172529
          Error parsing: http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
          Parsed (225ms):http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/
          ParseSegment: finished at 2017-09-05 17:25:51, elapsed: 00:00:06
          CrawlDB update
          /home/simoncpu/nutch/runtime/local/bin/nutch updatedb -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl-dir/crawldb crawl-dir/segments/20170905172529
          CrawlDb update: starting at 2017-09-05 17:25:53
          CrawlDb update: db: crawl-dir/crawldb
          CrawlDb update: segments: [crawl-dir/segments/20170905172529]
          CrawlDb update: additions allowed: true
          CrawlDb update: URL normalizing: false
          CrawlDb update: URL filtering: false
          CrawlDb update: 404 purging: false
          CrawlDb update: Merging segment data into db.
          CrawlDb update: finished at 2017-09-05 17:25:59, elapsed: 00:00:05
          Link inversion
          /home/simoncpu/nutch/runtime/local/bin/nutch invertlinks crawl-dir/linkdb crawl-dir/segments/20170905172529
          LinkDb: starting at 2017-09-05 17:26:01
          LinkDb: linkdb: crawl-dir/linkdb
          LinkDb: URL normalize: true
          LinkDb: URL filter: true
          LinkDb: internal links will be ignored.
          LinkDb: adding segment: crawl-dir/segments/20170905172529
          LinkDb: finished at 2017-09-05 17:26:06, elapsed: 00:00:04
          Dedup on crawldb
          /home/simoncpu/nutch/runtime/local/bin/nutch dedup crawl-dir/crawldb
          DeduplicationJob: starting at 2017-09-05 17:26:07
          Deduplication: 0 documents marked as duplicates
          Deduplication: Updating status of duplicate urls into crawl db.
          Deduplication finished at 2017-09-05 17:26:15, elapsed: 00:00:07
          Indexing 20170905172529 to index
          /home/simoncpu/nutch/runtime/local/bin/nutch index crawl-dir/crawldb -linkdb crawl-dir/linkdb crawl-dir/segments/20170905172529
          Segment dir is complete: crawl-dir/segments/20170905172529.
          Indexer: starting at 2017-09-05 17:26:17
          Indexer: deleting gone documents: false
          Indexer: URL filtering: false
          Indexer: URL normalizing: false
          Active IndexWriters :
          ElasticRestIndexWriter
          elastic.rest.host : hostname
          elastic.rest.port : port
          elastic.rest.index : elastic index command
          elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
          elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

          Indexer: number of documents indexed, deleted, or skipped:
          Indexer: finished at 2017-09-05 17:26:23, elapsed: 00:00:05
          Cleaning up index if possible
          /home/simoncpu/nutch/runtime/local/bin/nutch clean crawl-dir/crawldb
          Wed Sep 6 01:26:28 DST 2017 : Finished loop with 1 iterations
          ```

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-327245880 @lewismc Here's one of the URLs that I've tried: http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/ (url) BTW, the previous patch was able to parse the Microdata without problems. EDIT, here's the full output: ```Thread FetcherThread has no more work available Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 Thread FetcherThread has no more work available Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 fetcher.maxNum.threads can't be < than 50 : using 50 instead -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 Thread FetcherThread has no more work available -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2017-09-05 17:25:43, elapsed: 00:00:08 Parsing : 20170905172529 /home/simoncpu/nutch/runtime/local/bin/nutch parse -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D mapreduce.task.skip.start.attempts=2 -D mapreduce.map.skip.maxrecords=1 crawl-dir/segments/20170905172529 ParseSegment: starting at 2017-09-05 17:25:45 ParseSegment: segment: crawl-dir/segments/20170905172529 Error parsing: http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Parsed (225ms): http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/ ParseSegment: finished at 2017-09-05 17:25:51, elapsed: 00:00:06 CrawlDB update /home/simoncpu/nutch/runtime/local/bin/nutch updatedb -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl-dir/crawldb crawl-dir/segments/20170905172529 CrawlDb update: starting at 2017-09-05 17:25:53 CrawlDb update: db: crawl-dir/crawldb CrawlDb update: segments: [crawl-dir/segments/20170905172529] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2017-09-05 17:25:59, elapsed: 00:00:05 Link inversion /home/simoncpu/nutch/runtime/local/bin/nutch invertlinks crawl-dir/linkdb crawl-dir/segments/20170905172529 LinkDb: starting at 2017-09-05 17:26:01 LinkDb: linkdb: crawl-dir/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: crawl-dir/segments/20170905172529 LinkDb: finished at 2017-09-05 17:26:06, elapsed: 00:00:04 Dedup on crawldb /home/simoncpu/nutch/runtime/local/bin/nutch dedup crawl-dir/crawldb DeduplicationJob: starting at 2017-09-05 17:26:07 Deduplication: 0 documents marked as duplicates Deduplication: Updating status of duplicate urls into crawl db. Deduplication finished at 2017-09-05 17:26:15, elapsed: 00:00:07 Indexing 20170905172529 to index /home/simoncpu/nutch/runtime/local/bin/nutch index crawl-dir/crawldb -linkdb crawl-dir/linkdb crawl-dir/segments/20170905172529 Segment dir is complete: crawl-dir/segments/20170905172529. Indexer: starting at 2017-09-05 17:26:17 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : ElasticRestIndexWriter elastic.rest.host : hostname elastic.rest.port : port elastic.rest.index : elastic index command elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250) elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB) Indexer: number of documents indexed, deleted, or skipped: Indexer: finished at 2017-09-05 17:26:23, elapsed: 00:00:05 Cleaning up index if possible /home/simoncpu/nutch/runtime/local/bin/nutch clean crawl-dir/crawldb Wed Sep 6 01:26:28 DST 2017 : Finished loop with 1 iterations ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-327258936

          I get a parser error using the [Any23 Webservice](http://any23.org/any23/?format=best&uri=http%3A%2F%2Fmcdonalds.jobs%2Fsalt-lake-city-ut%2Fgeneral-manager%2F2947B6E7B04147FFBEE1445E66D7EA67%2Fjob%2F&validation-mode=validate-fix&report=on&annotate=on)

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-327258936 I get a parser error using the [Any23 Webservice] ( http://any23.org/any23/?format=best&uri=http%3A%2F%2Fmcdonalds.jobs%2Fsalt-lake-city-ut%2Fgeneral-manager%2F2947B6E7B04147FFBEE1445E66D7EA67%2Fjob%2F&validation-mode=validate-fix&report=on&annotate=on ) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          thilohaas commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
          URL: https://github.com/apache/nutch/pull/205#issuecomment-327262863

          Sadly I'm currently too busy, but will definitely look into it as soon as possible.
          Do you maybe have an idea of how to pass an array or hash of strings to the filter (see my comment on the PR)? So I would be able to simplify the process and come up with an alternative way of storing triples on the documents.

          btw the any23 webservice seems to be broken, as it's failing on all websites I've tried. For example google as well: http://any23.org/any23/?format=best&uri=https%3A%2F%2Fgoogle.com&validation-mode=none

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - thilohaas commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x URL: https://github.com/apache/nutch/pull/205#issuecomment-327262863 Sadly I'm currently too busy, but will definitely look into it as soon as possible. Do you maybe have an idea of how to pass an array or hash of strings to the filter (see my comment on the PR)? So I would be able to simplify the process and come up with an alternative way of storing triples on the documents. btw the any23 webservice seems to be broken, as it's failing on all websites I've tried. For example google as well: http://any23.org/any23/?format=best&uri=https%3A%2F%2Fgoogle.com&validation-mode=none ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org

            People

            • Assignee:
              lewismc Lewis John McGibbney
              Reporter:
              lewismc Lewis John McGibbney
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development