Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2231

Jexl support in generator job

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.11
    • 1.12
    • None
    • None

    Description

      Generator should support Jexl expressions. This would make it much easier to implement focussing crawlers that rely on information stored in the CrawlDB. With the HostDB it is possible to restrict the generator to select only interesting records but it is very cumbersome and involves domainblacklist-urlfiltering.

      With Jexl support, it is no hassle!

      Crawl only english records:

      bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
      

      Crawl only HTML records:

      bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 'text/html' || Content_Type == 'application/xhtml+xml')"
      

      Keep in mind:

      • Jexl doesn't allow a hyphen/minus in field identifier, they are transformed to underscores
      • string literals must be in quotes, only surrounding qoute needs to be escaped by backslash

      Attachments

        1. NUTCH-2231.patch
          13 kB
          Markus Jelsma
        2. NUTCH-2231.patch
          12 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              markus17 Markus Jelsma
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: