Issue Details (XML | Word | Printable)

Key: NUTCH-427
Type: New Feature New Feature
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Armel Nene
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Nutch

protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation.

Created: 05/Jan/07 02:42 PM   Updated: 08/Nov/08 04:50 AM
Return to search
Component/s: fetcher
Affects Version/s: 0.8.1, 0.9.0, 1.0.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works protocol-smb-diff.txt 2008-11-08 04:49 AM Ilguiz Latypov 16 kB
Zip Archive Licensed for inclusion in ASF works protocol-smb.zip 2008-11-08 04:49 AM Ilguiz Latypov 649 kB
Zip Archive protocol-smb.zip 2007-05-25 09:04 PM Vadim Bauer 649 kB
Zip Archive Licensed for inclusion in ASF works protocol-smb.zip 2007-01-05 03:11 PM Armel Nene 636 kB
Environment: JAVA - OS independent


 Description  « Hide
Title: protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares
Author: Armel T. Nene
Update: Vadim Bauer
Email: armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r <AT> g m x . d e

A. Introduction

The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements
the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the
behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also
support all the properties from the JCifs library.
You can find more information on the following site: http://jcifs.samba.org/
The smb protocol syntax for crawling is as follow: smb://xxxxx (i.e. smb://server/share).

B. Installation

1) Binaries only: The protocol-smb files can be found in the ../plugins directory.
Copy the "protocol-smb" to NUTCHHOME/build/plugins directory.
Put the "smb.properties" file in the NUTCHHOME/conf directory.
Configure the properties in "smb.properties" file
Enable the plugin by updating "nutch-site.xml" file found in NUTCHHOME/conf directory
e.g. <property>
<name>plugin.includes</name>
<value>protocol-smb| other plugins...</value>
<description>
</description>
</property>

2) Source code: The protocol-smb sources can be found in the ../src directory.
Always refer to the Nutch wiki for detailed instructions on building Nutch. In short:
Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
Update the build.xml in NUTCHHOME/src/plugin to include plugin
Update the NUTCHHOME/default.properties file to include plugin
run ant to build
Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties
Enable the plugin by updating the nutch-site.xml file

C: Known Issues

1) URLMalformedException: unkown protocol: smb

The SMB URL protocol handler is not being successfully installed.
In short, the jCIFS jar must be loaded by the System class loader.

Workaround: a) a short term solutions will be to installed the JCIFS jar
library found in protocol-smb folder in
JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext

b) After completing step a), if the exeception is still thrown
set the System properties by passing the following arguments
to the JVM:

-Djava.protocol.handler.pkgs=jcifs

c) You can set the property also in your Code for example if
you start Crawling with org.apache.nutch.crawl.Crawl
Add the following two lines. This will be the Same like in b)
public static void main(String args[]) throws Exception {
System.setProperty("java.protocol.handler.pkgs", "jcifs");
new java.util.PropertyPermission("java.protocol.handler.pkgs","read, write")
//and so on

Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

2) FATAL smb.SMB - Could not read content of protocol: smb://xxxxxx

This problem usually occurs if the following properties are not set correctly in
the "smb.properties" file:

  • username
  • password
  • domain

Also refer to the following resources for more information on the list of
available properties and how to set them:

http://jcifs.samba.org/src/docs/api/overview-summary.html#scp
Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

N.B. All properties should set in the "smb.properties" file. You can set
all supported JCIFS properties in the "smb.properties" file.

3) Only tested on Windows XP and Windows Server 2003. Please report any tests
conclusion on other OS.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Armel Nene made changes - 05/Jan/07 03:11 PM
Field Original Value New Value
Attachment protocol-smb.zip [ 12348364 ]
Andrzej Bialecki added a comment - 05/Jan/07 03:55 PM
JCIFS is licensed under LGPL, so it cannot be included in Nutch distribution. As a consequence, we could add this plugin but it wouldn't be a part of the regular build ...

Armel Nene added a comment - 05/Jan/07 04:00 PM
The best way is to make the plugin available on plugin central, so that
people who needs the plugin can download it from there.

Andrzej Bialecki added a comment - 07/Mar/07 10:27 PM
New features are not critical. This plugin uses an LGPL library, which cannot be included in Nutch repository.

Andrzej Bialecki made changes - 07/Mar/07 10:27 PM
Priority Critical [ 2 ] Major [ 3 ]
Vadim Bauer added a comment - 22/May/07 12:35 PM
There is an Error in the plugin.xml File

the plugin id should be protocol-smb and not protocol-file!

<?xml version="1.0" encoding="UTF-8" ?>

  • <!-- Document : plugin.xml
    Created on : 03 January 2007, 10:41
    Author : Armel T. Nene
    Description:
    This file is used by Nutch to configure the SMB protocol

-->

  • <plugin id="protocol-smb" name="SMB Protocol Plug-in" version="1.0.0" provider-name="iDNA Solutions LTD">
  • <runtime>
  • <library name="protocol-smb.jar">
    <export name="*" />
    </library>
    <library name="jcifs-1.2.12.jar" />
    </runtime>
  • <requires>
    <import plugin="nutch-extensionpoints" />
    </requires>
  • <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol" point="org.apache.nutch.protocol.Protocol">
  • <implementation id="org.apache.nutch.protocol.smb.SMB" class="org.apache.nutch.protocol.smb.SMB">
    <parameter name="protocolName" value="SMB" />
    </implementation>
    </extension>
    </plugin>

Vadim Bauer added a comment - 25/May/07 09:04 PM
This is an update to the previous Version. check the Included readme.txt

Title: protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares
Author: Armel T. Nene
Update: Vadim Bauer
Email: armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r <AT> g m x . d e

A. Introduction

The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements
the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the
behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also
support all the properties from the JCifs library.
You can find more information on the following site: http://jcifs.samba.org/
The smb protocol syntax for crawling is as follow: smb://xxxxx (i.e. smb://server/share).

B. Installation

1) Binaries only: The protocol-smb files can be found in the ../plugins directory.
Copy the "protocol-smb" to NUTCHHOME/build/plugins directory.
Put the "smb.properties" file in the NUTCHHOME/conf directory.
Configure the properties in "smb.properties" file
Enable the plugin by updating "nutch-site.xml" file found in NUTCHHOME/conf directory
e.g. <property>
<name>plugin.includes</name>
<value>protocol-smb| other plugins...</value>
<description>
</description>
</property>

2) Source code: The protocol-smb sources can be found in the ../src directory.
Always refer to the Nutch wiki for detailed instructions on building Nutch. In short:
Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
Update the build.xml in NUTCHHOME/src/plugin to include plugin
Update the NUTCHHOME/default.properties file to include plugin
run ant to build
Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties
Enable the plugin by updating the nutch-site.xml file

C: Known Issues

1) URLMalformedException: unkown protocol: smb

The SMB URL protocol handler is not being successfully installed.
In short, the jCIFS jar must be loaded by the System class loader.

Workaround: a) a short term solutions will be to installed the JCIFS jar
library found in protocol-smb folder in
JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext

b) After completing step a), if the exeception is still thrown
set the System properties by passing the following arguments
to the JVM:

-Djava.protocol.handler.pkgs=jcifs

c) You can set the property also in your Code for example if
you start Crawling with org.apache.nutch.crawl.Crawl
Add the following two lines. This will be the Same like in b)
public static void main(String args[]) throws Exception {
System.setProperty("java.protocol.handler.pkgs", "jcifs");
new java.util.PropertyPermission("java.protocol.handler.pkgs","read, write")
//and so on

Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

2) FATAL smb.SMB - Could not read content of protocol: smb://xxxxxx

This problem usually occurs if the following properties are not set correctly in
the "smb.properties" file:

  • username
  • password
  • domain

Also refer to the following resources for more information on the list of
available properties and how to set them:

http://jcifs.samba.org/src/docs/api/overview-summary.html#scp
Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

N.B. All properties should set in the "smb.properties" file. You can set
all supported JCIFS properties in the "smb.properties" file.

3) Only tested on Windows XP and Windows Server 2003. Please report any tests
conclusion on other OS.


Vadim Bauer made changes - 25/May/07 09:04 PM
Attachment protocol-smb.zip [ 12358278 ]
Vadim Bauer added a comment - 25/May/07 09:08 PM
The update fixes some issues which I had with the old version by trying to use it with Nutch 1.0-dev

Vadim Bauer made changes - 25/May/07 09:08 PM
Description Title: protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares
Author: Armel T. Nene
Email: armel.nene NOSPAM-AT-NOSPAM idna-solutions.com

A. Introduction

    The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements
    the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the
    behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also
    support all the properties from the JCifs library.
    You can find more information on the following site: http://jcifs.samba.org/
    The smb protocol syntax is as follow: smb://xxxxx (i.e. smb://server/share) .
    
B. Installation

    1) Binaries only: Copy the "protocol-smb" to NUTCHHOME/build/plugins directory.
                        Put the "smb.properties" file in the NUTCHHOME/conf directory.
                        Configure the properties in "smb.properties" file
                        Enable the plugin by updating "nutch-site.xml" file found in NUTCHHOME/conf directory

    2) Source code: Always refer to the Nutch wiki for detailed instructions on building Nutch. In short:
                        Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
                        Update the build.xml in NUTCHHOME/src/plugin to include plugin
                        Update the NUTCHHOME/default.properties file to include plugin
                        run ant to build
                        Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties
                        Enable the plugin by updating the nutch-site.xml file

C: Known Issues

    1) URLMalformedException: unkown protocol: smb

       The SMB URL protocol handler is not being successfully installed.
       In short, the jCIFS jar must be loaded by the System class loader.

       Workaround: a) a short term solutions will be to installed the JCIFS jar
                      library found in protocol-smb folder in
                      JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext

                   b) After completing step a), if the exeception is still thrown
                      set the System properties by passing the following arguments
                      to the JVM:

                      -Djava.protocol.handler.pkgs=jcifs

       Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

    2) FATAL smb.SMB - Could not read content of protocol: smb://xxxxxx

       This problem usually occurs if the following properties are not set correctly in
       the "smb.properties" file:

       - username
       - password
       - domain

       Also refer to the following resources for more information on the list of
       available properties and how to set them:

       http://jcifs.samba.org/src/docs/api/overview-summary.html#scp
       Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

       N.B. All properties should set in the "smb.properties" file. You can set
            all supported JCIFS properties in the "smb.properties" file.
     
    3) Only tested on Windows XP and Windows Server 2003. Please report any tests
       conclusion on other OS. It should also run on any other OS without any change.
Title: protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares
Author: Armel T. Nene
Update: Vadim Bauer
Email: armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r <AT> g m x . d e

A. Introduction

    The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements
    the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the
    behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also
    support all the properties from the JCifs library.
    You can find more information on the following site: http://jcifs.samba.org/
    The smb protocol syntax for crawling is as follow: smb://xxxxx (i.e. smb://server/share).
    
B. Installation

    1) Binaries only: The protocol-smb files can be found in the ../plugins directory.
Copy the "protocol-smb" to NUTCHHOME/build/plugins directory.
                        Put the "smb.properties" file in the NUTCHHOME/conf directory.
                        Configure the properties in "smb.properties" file
                        Enable the plugin by updating "nutch-site.xml" file found in NUTCHHOME/conf directory
e.g. <property>
     <name>plugin.includes</name>
     <value>protocol-smb| other plugins...</value>
     <description>
  </description>
  </property>

    2) Source code: The protocol-smb sources can be found in the ../src directory.
Always refer to the Nutch wiki for detailed instructions on building Nutch. In short:
                        Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
                        Update the build.xml in NUTCHHOME/src/plugin to include plugin
                        Update the NUTCHHOME/default.properties file to include plugin
                        run ant to build
                        Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties
                        Enable the plugin by updating the nutch-site.xml file

C: Known Issues

    1) URLMalformedException: unkown protocol: smb

       The SMB URL protocol handler is not being successfully installed.
       In short, the jCIFS jar must be loaded by the System class loader.

       Workaround: a) a short term solutions will be to installed the JCIFS jar
                      library found in protocol-smb folder in
                      JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext

                   b) After completing step a), if the exeception is still thrown
                      set the System properties by passing the following arguments
                      to the JVM:

                      -Djava.protocol.handler.pkgs=jcifs

c) You can set the property also in your Code for example if
you start Crawling with org.apache.nutch.crawl.Crawl
Add the following two lines. This will be the Same like in b)
public static void main(String args[]) throws Exception {
System.setProperty("java.protocol.handler.pkgs", "jcifs");
new java.util.PropertyPermission("java.protocol.handler.pkgs","read, write")
//and so on

       Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

    2) FATAL smb.SMB - Could not read content of protocol: smb://xxxxxx

       This problem usually occurs if the following properties are not set correctly in
       the "smb.properties" file:

       - username
       - password
       - domain

       Also refer to the following resources for more information on the list of
       available properties and how to set them:

       http://jcifs.samba.org/src/docs/api/overview-summary.html#scp
       Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html

       N.B. All properties should set in the "smb.properties" file. You can set
            all supported JCIFS properties in the "smb.properties" file.
     
    3) Only tested on Windows XP and Windows Server 2003. Please report any tests
       conclusion on other OS.
Affects Version/s 1.0.0 [ 12312443 ]
Affects Version/s 0.9.0 [ 12312013 ]
Joe Hurley added a comment - 14/Aug/08 07:41 AM
Is there a reason why this plugin only handles directories? I had to make the following changes to enable file crawling:

in SMBResponse.java:
replace `byte[] byte` with `this.content` on line 163
remove lines 206 and 209

also It got stuck in the file not found case. After examining the protocol-file code, I moved the else statement in SMB.java, lines 76 and 77 outside of the curly bracket on line 78. After this change, the code could continue after encountering a file not found rather than looping forever.

And since then, it seems to work nicely on Windows Vista. Thanks for the plugin!


Andrzej Bialecki made changes - 22/Sep/08 04:22 PM
Priority Major [ 3 ] Minor [ 4 ]
Ilguiz Latypov added a comment - 08/Nov/08 01:45 AM - edited
Fixed reading of SMB files, updated to jcifs 1.3.0, enhanced the smoke
test app. Protected special characters such as apostrophe and hash
mark with URL encoding.

Fixed the infinite retry loop in SMB.java.

Tried but could not activate the Apache logging.


Ilguiz Latypov made changes - 08/Nov/08 01:45 AM
Attachment protocol-smb.zip [ 12393546 ]
Ilguiz Latypov made changes - 08/Nov/08 01:47 AM
Attachment protocol-smb-diff.txt [ 12393547 ]
Ilguiz Latypov made changes - 08/Nov/08 03:18 AM
Attachment protocol-smb-diff.txt [ 12393547 ]
Ilguiz Latypov made changes - 08/Nov/08 03:18 AM
Attachment protocol-smb.zip [ 12393546 ]
Ilguiz Latypov made changes - 08/Nov/08 03:19 AM
Attachment protocol-smb-diff.txt [ 12393551 ]
Ilguiz Latypov made changes - 08/Nov/08 03:19 AM
Attachment protocol-smb.zip [ 12393552 ]
Ilguiz Latypov made changes - 08/Nov/08 04:48 AM
Attachment protocol-smb.zip [ 12393552 ]
Ilguiz Latypov made changes - 08/Nov/08 04:48 AM
Attachment protocol-smb-diff.txt [ 12393551 ]
Ilguiz Latypov made changes - 08/Nov/08 04:49 AM
Attachment protocol-smb.zip [ 12393556 ]
Ilguiz Latypov made changes - 08/Nov/08 04:49 AM
Attachment protocol-smb-diff.txt [ 12393557 ]