Issue Details (XML | Word | Printable)

Key: HADOOP-3246
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Ankur
Reporter: Ankur
Votes: 1
Watchers: 8
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

FTP client over HDFS

Created: 14/Apr/08 10:50 AM   Updated: 22/Aug/08 07:50 PM
Component/s: util
Affects Version/s: 0.16.3
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
GZip Archive Licensed for inclusion in ASF works FTPJars.tar.gz 2008-05-30 10:46 AM Ankur 936 kB
Text File Licensed for inclusion in ASF works Hadoop-3246-latest.patch 2008-05-29 07:52 AM Ankur 41 kB
Text File Licensed for inclusion in ASF works HADOOP-3246.patch 2008-05-19 11:21 PM Doug Cutting 38 kB

Release Note: Introduced an FTPFileSystem backed by Apache Commons FTPClient to directly store data into HDFS.
Resolution Date: 29/May/08 08:39 PM


 Description  « Hide
An FTP client that stores content directly into HDFS allows data from FTP serves to be stored directly into HDFS instead of first copying the data locally and then uploading it into HDFS. The benefits are apparent from an administrative perspective as large datasets can be pulled from FTP servers with minimal human intervention.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Enis Soztutar added a comment - 14/Apr/08 02:46 PM
duplicate of HADOOP-3199

Ankur added a comment - 15/Apr/08 05:33 AM
Even though the two issues look same, they are different.

HADOOP-3199 is about an FTP server that provides FTP access to data in HDFS. Any FTP client would then be able to access HDFS data through FTP.

This issue is about an FTP client talks to remote FTP server(s), pull data from them and store directly into HDFS.

At present we are faced with the issue of our data lying in different remote FTP server locations. Pulling a lot of data from different locations is a lot of manual work including fetching data over FTP, storing it locally and then putting it into HDFS. This is cumbersome especially if the data is too large to fit into local storage.

This utility essentially provides following benefits
1. The steps of 'pull data from FTP server', 'store locally', 'tranfer to HDFS' and 'delete local copy' are converted into 1 step - 'Pull data and store into HDFS' .
2. No need to worry about lack of local storage as data goes directly into HDFS.
3. Can be used to run a batch of commands that include pulling data from different FTP servers.

All of this greatly simplifies administrative tasks.

+1 for marking this as 'Not Duplicate'


Enis Soztutar added a comment - 15/Apr/08 08:00 AM
oops, I've missed that this issue will track and FTP client. Reopening the issue.

Doug Cutting added a comment - 15/Apr/08 05:20 PM
> At present we are faced with the issue of our data lying in different remote FTP server locations.

To be clear: these are existing FTP servers, not remote HDFS systems that you wish to access over FTP? Because, if you want to transfer data from a remote HDFS system, then the HFTP scheme permits this over HTTP or HTTPS.


Ankur added a comment - 16/Apr/08 05:48 AM
> To be clear...

Yes these are existing FTP servers not HDFS systems


Ankur added a comment - 21/Apr/08 01:37 PM
I added a couple of JAR files in my patch (commons-net-1.4.1.jar and oro-2.0.8.jar) but svn diff would'nt add the jar contents to the patch correctly. all it says in the patch is 'Cannot display the contents of the binary file'. Consequently when I try to test the patch on my local machine it fails as it is unable to add the required binary files.

I am using Collabnet's SVN command line client - 1.4.6 for Red hat linux.

Can someone suggest a workaround as this is preventing me from submitting the patch

Thanks


Ankur added a comment - 21/Apr/08 02:39 PM
For the patch to work, following additional JARs are required
  • commons-net-1.4.1.jar
  • oro-2.0.8.jar

These can be downloaded from
http://download.nextag.com/apache/commons/net/binaries/commons-net-1.4.1.tar.gz


Ankur added a comment - 21/Apr/08 02:46 PM
The patch implements the ftpclient as a standalone shell. Most basic commands are implemented
get, put, delete, ls, cd, pwd , lls (DFS), lpwd (DFS), lcd (DFS).

Important thing to note is that get, put and delete are implemented so that they work on multiple files so no separate mget or mput is required.

Future work:
==========
Suggestions are welcome on

1. More commands that we would like to include in ftpclient
2. Multi-threaded I/O for faster get, put
3. Help improvement
4. Any other feature/modification


Hadoop QA added a comment - 22/Apr/08 08:30 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380604/ftpClient.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 9 new or modified tests.

javadoc -1. The javadoc tool appears to have generated 1 warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs -1. The patch appears to cause Findbugs to fail.

core tests -1. The patch failed core unit tests.

contrib tests -1. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2294/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2294/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2294/console

This message is automatically generated.


Ankur added a comment - 23/Apr/08 06:04 AM
Javadoc, findbugs, unit-test errors will go once the addtional required JARs (commons-net-1.4.1.jar and oro-2.0.8.jar) are included in the hadoop lib directory.
I am assuming that this does not happen in automated patch testing as the same errors surfaced in my local patch testing following the patch-test instructions mentioned on wiki.

I tried adding the JARS manually to the clean environment but then the testing would complain saying 'Cannot test patch in a modified environment'

Any suggestions on workaround for local patch testing ?


Ankur added a comment - 24/Apr/08 08:31 AM
So I tested the patch locally after adding the JARS to the test env manually and disabling the modification check in script 'test-patch.sh'.
Here's the report on my machine

-1 overall.
[exec]
[exec] @author +1. The patch does not contain any @author tags.
[exec]
[exec] tests included +1. The patch appears to include 9 new or modified tests.
[exec]
[exec] javadoc +1. The javadoc tool did not generate any warning messages.
[exec]
[exec] javac +1. The applied patch does not generate any new javac compiler warnings.
[exec]
[exec] findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

The find bugs warning is due to a couple of System.exit() in the code which even though is bad practice but still acceptable in my opinion for a shell application like this. Still I can go ahead and change this once I get some review comments.


Owen O'Malley added a comment - 29/Apr/08 09:30 PM
Wouldn't this be done better as a FileSystem? Then you could use distcp to copy to or from a ftp server.

Doug Cutting added a comment - 29/Apr/08 09:43 PM
> Wouldn't this be done better as a FileSystem?

+1. That makes sense to me. If the goal is to load data from FTP servers into HDFS, or vice-versa, then a FTP FileSystem implementation would permit both using existing tools like 'bin/hadoop fs' and distcp.


Chris Douglas added a comment - 29/Apr/08 10:28 PM
A few points:
  • +1 for this being a FileSystem
  • Changing javac.version in build.xml from 1.5 to 1.6 should not be part of this patch
  • ftp.server.host, ftp.server.username, and ftp.server.password should be o.a.h.conf.Configuration properties, not system properties. Conditionally running the unit tests based on a switch in build.xml is also not consistent with the existing unit tests; all the changes to build.xml should probably be reverted.
  • ftpClient.sh doesn't seem necessary; it certainly won't be if this is a FileSystem
  • It looks like some testing code accidentally made it into the patch, in FtpShell:
    +    conf.set("fs.default.name", "hdfs://agoel-pc:9000");
    +    conf.set("fs.hdfs.impl", "org.apache.hadoop.dfs.DistributedFileSystem");
    +    conf.set("hadoop.tmp.dir", "/tmp/hadoop-aankurgoel");
    
  • Process trivia: CHANGES.txt is added by the committer when the patch goes in. The "Release Notes" field in JIRA is filled out by the contributor
  • FtpClient::EraserThread is a heavyweight way to hide the password. Unfortunately, java.io.Console::readPassword is only in Java 1.6, and it disables echo via a (platform-dependent) native call, so it's not clear how one would do this in Java 1.5. Leaving this as a configuration property is probably sufficient until we can move to Java 1.6; as a FileSystem, it needs to be a Configuration property, anyway.
  • If you wanted to keep some of the code for a FTP client separate from existing tools, take a look at o.a.h.util.ToolBase to pick up some of the generic option parsing common to most hadoop utilities

Ankur added a comment - 01/May/08 10:57 AM
Chris, Thanks for the comments, i'll submit the new patch soon with suggested modifications

> It looks like some testing code accidentally
Sorry! will take better care next time.

> If you wanted to keep some of the code ...
I guess that won't be necessary as existing tools would be sufficient once this is implemented as an FTP Filesystem.


Ankur added a comment - 08/May/08 02:35 PM
The latest patch that implements the FTP client as an FTPFileSystem as suggested. Here are the results of testing the patch on my local machine (Since automated testing is going to fail anyway )

+1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 7 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.


Doug Cutting added a comment - 08/May/08 05:29 PM
Overall this looks very good! A few minor suggestions:
  • We should accept the username and password in the uri, as ftp://user:pass@host/, with the password optional.
  • It would be nice if folks could specify different usernames and passwords for different hosts in their configuration, perhaps with properties like ftp.user.host.example.com and ftp.password.host.example.com.
  • Rather than keeping a connection open in the FileSystem instance, perhaps we should open and close new connections for each file read, written, renamed, etc? FileSystem.java caches FileSystem implementations forever, and an FTP connection might time out. Also, the working-directory state of the connection makes this not thread-safe, which a connection per request would fix.
  • It would be best if the unit tests ran standalone, without requiring an external FTP server. We might include the Mina FTP server just for testing? We could put the jars somewhere in src/test.

Hadoop QA added a comment - 09/May/08 01:35 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381694/ftpFileSystem.patch
against trunk revision 654315.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 7 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to cause Findbugs to fail.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2431/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2431/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2431/console

This message is automatically generated.


Ankur added a comment - 10/May/08 09:09 AM
Ok, here the latest patch that implements the suggested changes from Doug. Here are the results of testing on my machine.

[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 7 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.


Ankur added a comment - 10/May/08 09:11 AM
The new patch now has test case that runs standalone using Apache mina FTP server embedded in its code. As a result additional Jars from Apache Mina FTP server are required. Name of all the required Jars present in the patch.

Doug Cutting added a comment - 12/May/08 11:23 PM
Thanks! This looks great!

Can you please attach the required jar files to this issue too?

Ideally the test service should specify zero as the port, so that the OS allocates a free port, then we can ask which port was allocated and use that. But it doesn't look as though the Mina FTP server permits one to get the port that it's actually launched on. So perhaps we should try a few times in a loop, using a randomly generated port? Otherwise we'll get random test failures when that ports already in use, e.g., by another test run.


Ankur added a comment - 13/May/08 05:50 AM
Jar files required for this patch

Ankur added a comment - 13/May/08 06:01 AM
Is there an issue trying to run the server on 2021 ? I understand its less than ideal but still better to have a port defined for our test service instead of random connection attempts trying to figure out which port actually Mina launched on.

Ankur added a comment - 13/May/08 08:39 AM
Trying to connect to a randomly generated port might result in getting connected to a a different service running on the random port causing confusion in the commons FTPClient code and in turn FTPFileSystem code.

As a simple fix I added the following line after bind() call in the MinaListener.start() method.

setPort(acceptor.getLocalAddress().getPort());

This sets the port correctly to the actual port the FTP server ended up listening on. As a result we could do simple thing in our Test case like

MinaListener listener = (MinaListener) server.getServerContext().getListener("default");
int serverPort = listener.getPort();
ftpFs = FileSystem.get(URI.create("ftp://admin:admin@localhost:" + serverPort), conf);

This works and I have tested it on my local machine.
I have created a JIRA issue for Mina - https://issues.apache.org/jira/browse/FTPSERVER-134.

If acceptable then I can provide the updated ftpserver-core.jar with this fix till the time it gets pushed into their code line.


Ankur added a comment - 13/May/08 11:53 AM
Looks like I got the fix committed to the Mina code line faster than I expected . So I have modified the Test case to set port as 0 so that OS can assign a free port. Also the testcase has been modified to fetch the port correctly from listener.

I am attaching the latest Jars and updated patch.


Ankur added a comment - 14/May/08 10:51 AM
Doug, is there anything else in this patch that can be improved upon ?

Ankur added a comment - 19/May/08 01:15 PM
This is rather important piece of functionality allowing data-mobility between hadoop cluster and different FTP servers.
Thus increasing the priority hoping it gets scheduled for commit sooner.

Doug Cutting added a comment - 19/May/08 11:21 PM
This is looking great!

Stuff only used for testing should only be on the test classpath. I moved the ftp and mina jar files to src/test/lib and the test username & password to src/test/conf. Do these changes look reasonable?

It would be simpler for folks to evaluate this patch if the jars were all in a single .tar.gz that, when unpacked in a Hadoop tree, put them in the right places. And any .jar that's not an Apache product should be accompanied by a .LICENSE.txt file containing the license that .jar is under.


Ankur added a comment - 20/May/08 07:34 AM
  • Added License for SLF4J in the latest patch.
  • Packed the required jars in Hadoop-3246-Required-Jars.tar.gz. For testing this patch the file needs to be uncompressed in $HADOOP_HOME.

Note that moving Mina and ftp jars to src/test/lib is fine for this issue. But when Hadoop-3199 becomes available, these jars would be required under lib/. So until then we can keep the Mina and ftp jars under src/test/lib


Doug Cutting added a comment - 20/May/08 04:38 PM
The problem with the ftpserver jars is that they're not yet a released Apache product. Providing them as a part of a Hadoop release means that the Hadoop project would be releasing them, and we must perform the diligence required of Apache releases. This is not something we should do lightly.

So it will be much easier to provide HADOOP-3199 in Hadoop once there has been a release of ftpserver.


Ankur added a comment - 21/May/08 06:15 AM
I definitely agree on this. Has this patch been scheduled for commit? I wonder what release is going to have this because otherwise we'll have to use an existing release and apply this patch exclusively for our deployment.

Doug Cutting added a comment - 21/May/08 04:28 PM
I think we can include this in 0.18. We won't distributed the ftpserver jars in the release package.

It would be nice is to make it so that, if you unpack a release and run unit tests, they pass. So we should probably skip the ftp unit tests when the ftpserver jars are not present.


Ankur added a comment - 22/May/08 07:01 AM
I assume that we would be compiling ftp unit tests with Mina jars and omitting them while release bundling.

In that case we would have to load the ftpServer classes explicitly ourselves in ftp unit test and skip the test if classes are not found. However it is a little cumbersome since there are around 7 ftpServer and mina classes referenced.

Does this fix sound reasonable ?


Doug Cutting added a comment - 22/May/08 09:29 PM
I think it might be simpler to break the test into two classes: TestFtpFileSystem, which will be run by our ant unit tests, and FtpTests, which actually performs the tests. The former can have a single method, testFtp(), that tries to load the FtpServer class. If it can, it constructs an instance of FtpTests and runs it. If it can't it just prints a warning. Might that work?

Ankur added a comment - 23/May/08 07:18 AM
That will work. I am just thinking if whether in the testFtp() method we should check for just one or all of the classes from FTPServer used in the test since they come from different FTPServer jars ?

Doug Cutting added a comment - 23/May/08 05:08 PM
> I am just thinking if whether in the testFtp() method we should check for just one or all of the classes

I think one is enough. If all jars are present, as from subversion, tests will run correctly. If no jars are present, as in a release, tests will be skipped. If only some jars are present, an unsupported configuration, tests will crash. That seems fine to me.


Ankur added a comment - 29/May/08 07:52 AM
  • Added FtpTests() that has the code for running FTPFileSystem test.
  • Modified TestFTPFileSystem to check for FTPServer class and conditionally execute the test.

Hope the patch is now in good shape to get committed


Doug Cutting added a comment - 29/May/08 08:39 PM
I just committed this. Thanks, Ankur!

Ankur added a comment - 30/May/08 10:46 AM
Attaching new required Jar files created after compiling Mina FTP server jars with Java 1.5.

Tsz Wo (Nicholas), SZE added a comment - 30/May/08 06:07 PM
This issue introduces a few jar files for testing. The location of them is src/test/lib, i.e under src. In order to compile the tests in my eclipse project, I have to add them to the build path manually. Usually, jar files can be found in lib. So it is not obvious to find these jar files. I cannot find them until I read the patch in this issue. Could we move these jar files to lib?

Doug Cutting added a comment - 30/May/08 10:51 PM
We want to keep these out of lib/, since they're not a released binary that we want Hadoop users to use. We include them in svn for testing only.

Hudson added a comment - 01/Jun/08 01:49 PM