Solr
  1. Solr
  2. SOLR-3691

SimplePostTool: Mode for indexing a web page

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 6.0
    • Component/s: scripts and tools
    • Labels:
      None

      Description

      The simple post.jar tool should both show some sample code as well as aid users in testing Solr from the command line. Missing is an easy way to index a web page.

      1. SOLR-3691.patch
        55 kB
        Jan Høydahl
      2. SOLR-3691.patch
        55 kB
        Jan Høydahl
      3. SOLR-3691.patch
        26 kB
        Jan Høydahl
      4. SOLR-3691.patch
        25 kB
        Jan Høydahl
      5. SOLR-3691.patch
        25 kB
        Jan Høydahl

        Activity

        Hide
        Jan Høydahl added a comment -

        First patch. Implements a new mode -Ddata=web which fetches and posts a web page to Solr, and optionally pulls out links from it (using SolrCell extractOnly=true) and crawls to N levels.

        This patch also implements recursion level support for files as well, plus optional delay.

        This is not - as with post.jar in general - intended as a production feature, but as a nice way for newbies to test posting web pages to Solr without an external crawler, to increase the OOTB experience™

        Show
        Jan Høydahl added a comment - First patch. Implements a new mode -Ddata=web which fetches and posts a web page to Solr, and optionally pulls out links from it (using SolrCell extractOnly=true) and crawls to N levels. This patch also implements recursion level support for files as well, plus optional delay. This is not - as with post.jar in general - intended as a production feature, but as a nice way for newbies to test posting web pages to Solr without an external crawler, to increase the OOTB experience™
        Hide
        Jan Høydahl added a comment -

        New patch:

        • Adds URL to literal.url
        • Prints number of docs in each folder for recursive
        • Enforced max 10 levels depth for web to shortcut any loops, to avoid stupid users pissing off webmasters
        • Added Useragent string to the GET requests

        I think this is about what's needed in a simplistic example - and it is quite useful for quick prototyping too

        Any general feedback?

        Show
        Jan Høydahl added a comment - New patch: Adds URL to literal.url Prints number of docs in each folder for recursive Enforced max 10 levels depth for web to shortcut any loops, to avoid stupid users pissing off webmasters Added Useragent string to the GET requests I think this is about what's needed in a simplistic example - and it is quite useful for quick prototyping too Any general feedback?
        Hide
        Jan Høydahl added a comment -

        New patch:

        • Fetches pages with GZIP/deflate
        • Warns if user uses delay < 10s
        • Prints how many new links per level
        • Normalizes URLs by stripping everything after "#"
        Show
        Jan Høydahl added a comment - New patch: Fetches pages with GZIP/deflate Warns if user uses delay < 10s Prints how many new links per level Normalizes URLs by stripping everything after "#"
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Lance Norskog added a comment -

        robots.txt. I would not commit this without honoring robots.txt.

        Show
        Lance Norskog added a comment - robots.txt. I would not commit this without honoring robots.txt.
        Hide
        Jan Høydahl added a comment -

        New patch. This is totally reorganizing the code to make it testable and adds a bunch of unit tests.

        Also added basic robots.txt support, so that we don't offend anyone.

        Lance Norskog, can you take it for a test ride?

        Show
        Jan Høydahl added a comment - New patch. This is totally reorganizing the code to make it testable and adds a bunch of unit tests. Also added basic robots.txt support, so that we don't offend anyone. Lance Norskog , can you take it for a test ride?
        Hide
        Jan Høydahl added a comment -

        Here's the new help screen including "web" mode, "depth" and "delay" support:

        SimplePostTool version 1.5
        Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]
        
        Supported System Properties and their defaults:
          -Ddata=files|web|args|stdin (default=files)
          -Dtype=<content-type> (default=application/xml)
          -Durl=<solr-update-url> (default=http://localhost:8983/solr/update)
          -Dauto=yes|no (default=no)
          -Drecursive=yes|no|<depth> (default=0)
          -Ddelay=<seconds> (default=0 for files, 10 for web)
          -Dfiletypes=<type>[,<type>,...] (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
          -Dparams="<key>=<value>[&<key>=<value>...]" (values must be URL-encoded)
          -Dcommit=yes|no (default=yes)
          -Doptimize=yes|no (default=no)
          -Dout=yes|no (default=no)
        
        This is a simple command line tool for POSTing raw data to a Solr
        port.  Data can be read from files specified as commandline args,
        URLs specified as args, as raw commandline arg strings or via STDIN.
        Examples:
          java -jar post.jar *.xml
          java -Ddata=args  -jar post.jar '<delete><id>42</id></delete>'
          java -Ddata=stdin -jar post.jar < hd.xml
          java -Ddata=web -jar post.jar http://example.com/
          java -Dtype=text/csv -jar post.jar *.csv
          java -Dtype=application/json -jar post.jar *.json
          java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=a -Dtype=application/pdf -jar post.jar a.pdf
          java -Dauto -jar post.jar *
          java -Dauto -Drecursive -jar post.jar afolder
          java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder
        The options controlled by System Properties include the Solr
        URL to POST to, the Content-Type of the data, whether a commit
        or optimize should be executed, and whether the response should
        be written to STDOUT. If auto=yes the tool will try to set type
        and url automatically from file name. When posting rich documents
        the file name will be propagated as "resource.name" and also used
        as "literal.id". You may override these or any other request parameter
        through the -Dparams property. To do a commit only, use "-" as argument.
        The web mode is a simple crawler following links within domain, default delay=10s.
        
        Show
        Jan Høydahl added a comment - Here's the new help screen including "web" mode, "depth" and "delay" support: SimplePostTool version 1.5 Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]] Supported System Properties and their defaults: -Ddata=files|web|args|stdin (default=files) -Dtype=<content-type> (default=application/xml) -Durl=<solr-update-url> (default=http://localhost:8983/solr/update) -Dauto=yes|no (default=no) -Drecursive=yes|no|<depth> (default=0) -Ddelay=<seconds> (default=0 for files, 10 for web) -Dfiletypes=<type>[,<type>,...] (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log) -Dparams="<key>=<value>[&<key>=<value>...]" (values must be URL-encoded) -Dcommit=yes|no (default=yes) -Doptimize=yes|no (default=no) -Dout=yes|no (default=no) This is a simple command line tool for POSTing raw data to a Solr port. Data can be read from files specified as commandline args, URLs specified as args, as raw commandline arg strings or via STDIN. Examples: java -jar post.jar *.xml java -Ddata=args -jar post.jar '<delete><id>42</id></delete>' java -Ddata=stdin -jar post.jar < hd.xml java -Ddata=web -jar post.jar http://example.com/ java -Dtype=text/csv -jar post.jar *.csv java -Dtype=application/json -jar post.jar *.json java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=a -Dtype=application/pdf -jar post.jar a.pdf java -Dauto -jar post.jar * java -Dauto -Drecursive -jar post.jar afolder java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder The options controlled by System Properties include the Solr URL to POST to, the Content-Type of the data, whether a commit or optimize should be executed, and whether the response should be written to STDOUT. If auto=yes the tool will try to set type and url automatically from file name. When posting rich documents the file name will be propagated as "resource.name" and also used as "literal.id". You may override these or any other request parameter through the -Dparams property. To do a commit only, use "-" as argument. The web mode is a simple crawler following links within domain, default delay=10s.
        Hide
        Erik Hatcher added a comment -

        Jan - this is great stuff!

        Maybe this deserves a rename of *Simple*PostTool to just PostTool now that it's not so simple any more?

        Show
        Erik Hatcher added a comment - Jan - this is great stuff! Maybe this deserves a rename of *Simple*PostTool to just PostTool now that it's not so simple any more?
        Hide
        Jan Høydahl added a comment -

        Maybe this deserves a rename of *Simple*PostTool to just PostTool now that it's not so simple any more?

        Sure I know it's more code, but I hope it's actually more simple to follow the logic in the code now than before, since it's better structured. Besides, we only use standard SDK functions, so it is still self-contained without extra deps, which is a major part of the Simple name. Besides, since much stuff is moved out from main() and into the class, it is also easier for folks to utilize this stuff from their own code should they wish.

        Show
        Jan Høydahl added a comment - Maybe this deserves a rename of *Simple*PostTool to just PostTool now that it's not so simple any more? Sure I know it's more code, but I hope it's actually more simple to follow the logic in the code now than before, since it's better structured. Besides, we only use standard SDK functions, so it is still self-contained without extra deps, which is a major part of the Simple name. Besides, since much stuff is moved out from main() and into the class, it is also easier for folks to utilize this stuff from their own code should they wish.
        Hide
        Jan Høydahl added a comment -

        Last update:

        • Fixed typo in usage
        • Fixed ArrayIndexOutOfBounds when robots.txt contains only a # on one line
        • No longer prints redirect warnings for every page on a site, just the first
        • No longer throws exception when robots.txt does not exist for a domain

        I'll commit this to trunk and we can iterate from there.

        Show
        Jan Høydahl added a comment - Last update: Fixed typo in usage Fixed ArrayIndexOutOfBounds when robots.txt contains only a # on one line No longer prints redirect warnings for every page on a site, just the first No longer throws exception when robots.txt does not exist for a domain I'll commit this to trunk and we can iterate from there.
        Hide
        Jan Høydahl added a comment -

        Committed to trunk in r1374497

        Will backport to 4.x soon

        Show
        Jan Høydahl added a comment - Committed to trunk in r1374497 Will backport to 4.x soon
        Hide
        Jan Høydahl added a comment -

        Fixed javadocs-lint errors in r1374549

        Show
        Jan Høydahl added a comment - Fixed javadocs-lint errors in r1374549
        Hide
        Jan Høydahl added a comment -

        Committed to branch_4x in r1383190

        Show
        Jan Høydahl added a comment - Committed to branch_4x in r1383190
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Jan Høydahl
        http://svn.apache.org/viewvc?view=revision&revision=1383190

        SOLR-3691: SimplePostTool: Mode for indexing a web page (merge from trunk)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Jan Høydahl http://svn.apache.org/viewvc?view=revision&revision=1383190 SOLR-3691 : SimplePostTool: Mode for indexing a web page (merge from trunk)
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Jan Høydahl
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development