Nutch
  1. Nutch
  2. NUTCH-427

protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.8.1, 0.9.0, 1.0.0
    • Fix Version/s: 1.7, 2.2
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      JAVA - OS independent

    • Patch Info:
      Patch Available

      Description

      Title: protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares
      Author: Armel T. Nene
      Update: Vadim Bauer
      Email: armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r <AT> g m x . d e

      A. Introduction

      The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements
      the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the
      behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also
      support all the properties from the JCifs library.
      You can find more information on the following site: http://jcifs.samba.org/
      The smb protocol syntax for crawling is as follow: smb://xxxxx (i.e. smb://server/share).

      B. Installation

      1) Binaries only: The protocol-smb files can be found in the ../plugins directory.
      Copy the "protocol-smb" to NUTCHHOME/build/plugins directory.
      Put the "smb.properties" file in the NUTCHHOME/conf directory.
      Configure the properties in "smb.properties" file
      Enable the plugin by updating "nutch-site.xml" file found in NUTCHHOME/conf directory
      e.g. <property>
      <name>plugin.includes</name>
      <value>protocol-smb| other plugins...</value>
      <description>
      </description>
      </property>

      2) Source code: The protocol-smb sources can be found in the ../src directory.
      Always refer to the Nutch wiki for detailed instructions on building Nutch. In short:
      Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
      Update the build.xml in NUTCHHOME/src/plugin to include plugin
      Update the NUTCHHOME/default.properties file to include plugin
      run ant to build
      Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties
      Enable the plugin by updating the nutch-site.xml file

      C: Known Issues

      1) URLMalformedException: unkown protocol: smb

      The SMB URL protocol handler is not being successfully installed.
      In short, the jCIFS jar must be loaded by the System class loader.

      Workaround: a) a short term solutions will be to installed the JCIFS jar
      library found in protocol-smb folder in
      JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext

      b) After completing step a), if the exeception is still thrown
      set the System properties by passing the following arguments
      to the JVM:

      -Djava.protocol.handler.pkgs=jcifs

      c) You can set the property also in your Code for example if
      you start Crawling with org.apache.nutch.crawl.Crawl
      Add the following two lines. This will be the Same like in b)
      public static void main(String args[]) throws Exception {
      System.setProperty("java.protocol.handler.pkgs", "jcifs");
      new java.util.PropertyPermission("java.protocol.handler.pkgs","read, write")
      //and so on

      Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

      2) FATAL smb.SMB - Could not read content of protocol: smb://xxxxxx

      This problem usually occurs if the following properties are not set correctly in
      the "smb.properties" file:

      • username
      • password
      • domain

      Also refer to the following resources for more information on the list of
      available properties and how to set them:

      http://jcifs.samba.org/src/docs/api/overview-summary.html#scp
      Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

      N.B. All properties should set in the "smb.properties" file. You can set
      all supported JCIFS properties in the "smb.properties" file.

      3) Only tested on Windows XP and Windows Server 2003. Please report any tests
      conclusion on other OS.

      1. protocol-smb.zip
        649 kB
        Vadimo
      2. protocol-smb.zip
        636 kB
        Armel Nene
      3. protocol-smb-diff.txt
        16 kB
        Ilguiz Latypov
      4. protocol-smb-dist.zip
        737 kB
        Ilguiz Latypov

        Activity

        Tejas Patil made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Won't Fix [ 2 ]
        Lewis John McGibbney made changes -
        Fix Version/s 1.7 [ 12323281 ]
        Fix Version/s 2.2 [ 12323285 ]
        Patch Info Patch Available [ 10042 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb-dist.zip [ 12442365 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb.zip [ 12393556 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb-diff.txt [ 12393557 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb.zip [ 12393556 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb-diff.txt [ 12393551 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb.zip [ 12393552 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb.zip [ 12393552 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb-diff.txt [ 12393551 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb.zip [ 12393546 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb-diff.txt [ 12393547 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb-diff.txt [ 12393547 ]
        Ilguiz Latypov made changes -
        Attachment protocol-smb.zip [ 12393546 ]
        Andrzej Bialecki made changes -
        Priority Major [ 3 ] Minor [ 4 ]
        Vadimo made changes -
        Affects Version/s 0.9.0 [ 12312013 ]
        Affects Version/s 1.0.0 [ 12312443 ]
        Description Title: protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares
        Author: Armel T. Nene
        Email: armel.nene NOSPAM-AT-NOSPAM idna-solutions.com

        A. Introduction

            The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements
            the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the
            behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also
            support all the properties from the JCifs library.
            You can find more information on the following site: http://jcifs.samba.org/
            The smb protocol syntax is as follow: smb://xxxxx (i.e. smb://server/share) .
            
        B. Installation

            1) Binaries only: Copy the "protocol-smb" to NUTCHHOME/build/plugins directory.
                                Put the "smb.properties" file in the NUTCHHOME/conf directory.
                                Configure the properties in "smb.properties" file
                                Enable the plugin by updating "nutch-site.xml" file found in NUTCHHOME/conf directory

            2) Source code: Always refer to the Nutch wiki for detailed instructions on building Nutch. In short:
                                Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
                                Update the build.xml in NUTCHHOME/src/plugin to include plugin
                                Update the NUTCHHOME/default.properties file to include plugin
                                run ant to build
                                Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties
                                Enable the plugin by updating the nutch-site.xml file

        C: Known Issues

            1) URLMalformedException: unkown protocol: smb

               The SMB URL protocol handler is not being successfully installed.
               In short, the jCIFS jar must be loaded by the System class loader.

               Workaround: a) a short term solutions will be to installed the JCIFS jar
                              library found in protocol-smb folder in
                              JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext

                           b) After completing step a), if the exeception is still thrown
                              set the System properties by passing the following arguments
                              to the JVM:

                              -Djava.protocol.handler.pkgs=jcifs

               Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

            2) FATAL smb.SMB - Could not read content of protocol: smb://xxxxxx

               This problem usually occurs if the following properties are not set correctly in
               the "smb.properties" file:

               - username
               - password
               - domain

               Also refer to the following resources for more information on the list of
               available properties and how to set them:

               http://jcifs.samba.org/src/docs/api/overview-summary.html#scp
               Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

               N.B. All properties should set in the "smb.properties" file. You can set
                    all supported JCIFS properties in the "smb.properties" file.
             
            3) Only tested on Windows XP and Windows Server 2003. Please report any tests
               conclusion on other OS. It should also run on any other OS without any change.
        Title: protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares
        Author: Armel T. Nene
        Update: Vadim Bauer
        Email: armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r <AT> g m x . d e

        A. Introduction

            The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements
            the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the
            behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also
            support all the properties from the JCifs library.
            You can find more information on the following site: http://jcifs.samba.org/
            The smb protocol syntax for crawling is as follow: smb://xxxxx (i.e. smb://server/share).
            
        B. Installation

            1) Binaries only: The protocol-smb files can be found in the ../plugins directory.
        Copy the "protocol-smb" to NUTCHHOME/build/plugins directory.
                                Put the "smb.properties" file in the NUTCHHOME/conf directory.
                                Configure the properties in "smb.properties" file
                                Enable the plugin by updating "nutch-site.xml" file found in NUTCHHOME/conf directory
        e.g. <property>
             <name>plugin.includes</name>
             <value>protocol-smb| other plugins...</value>
             <description>
          </description>
          </property>

            2) Source code: The protocol-smb sources can be found in the ../src directory.
        Always refer to the Nutch wiki for detailed instructions on building Nutch. In short:
                                Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
                                Update the build.xml in NUTCHHOME/src/plugin to include plugin
                                Update the NUTCHHOME/default.properties file to include plugin
                                run ant to build
                                Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties
                                Enable the plugin by updating the nutch-site.xml file

        C: Known Issues

            1) URLMalformedException: unkown protocol: smb

               The SMB URL protocol handler is not being successfully installed.
               In short, the jCIFS jar must be loaded by the System class loader.

               Workaround: a) a short term solutions will be to installed the JCIFS jar
                              library found in protocol-smb folder in
                              JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext

                           b) After completing step a), if the exeception is still thrown
                              set the System properties by passing the following arguments
                              to the JVM:

                              -Djava.protocol.handler.pkgs=jcifs

        c) You can set the property also in your Code for example if
        you start Crawling with org.apache.nutch.crawl.Crawl
        Add the following two lines. This will be the Same like in b)
        public static void main(String args[]) throws Exception {
        System.setProperty("java.protocol.handler.pkgs", "jcifs");
        new java.util.PropertyPermission("java.protocol.handler.pkgs","read, write")
        //and so on

               Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

            2) FATAL smb.SMB - Could not read content of protocol: smb://xxxxxx

               This problem usually occurs if the following properties are not set correctly in
               the "smb.properties" file:

               - username
               - password
               - domain

               Also refer to the following resources for more information on the list of
               available properties and how to set them:

               http://jcifs.samba.org/src/docs/api/overview-summary.html#scp
               Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

               N.B. All properties should set in the "smb.properties" file. You can set
                    all supported JCIFS properties in the "smb.properties" file.
             
            3) Only tested on Windows XP and Windows Server 2003. Please report any tests
               conclusion on other OS.
        Vadimo made changes -
        Attachment protocol-smb.zip [ 12358278 ]
        Andrzej Bialecki made changes -
        Priority Critical [ 2 ] Major [ 3 ]
        Armel Nene made changes -
        Field Original Value New Value
        Attachment protocol-smb.zip [ 12348364 ]
        Armel Nene created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Armel Nene
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development