|
[
Permlink
| « Hide
]
Enis Soztutar added a comment - 14/Apr/08 02:46 PM
duplicate of HADOOP-3199
Even though the two issues look same, they are different.
HADOOP-3199 is about an FTP server that provides FTP access to data in HDFS. Any FTP client would then be able to access HDFS data through FTP. This issue is about an FTP client talks to remote FTP server(s), pull data from them and store directly into HDFS. At present we are faced with the issue of our data lying in different remote FTP server locations. Pulling a lot of data from different locations is a lot of manual work including fetching data over FTP, storing it locally and then putting it into HDFS. This is cumbersome especially if the data is too large to fit into local storage. This utility essentially provides following benefits All of this greatly simplifies administrative tasks. +1 for marking this as 'Not Duplicate' oops, I've missed that this issue will track and FTP client. Reopening the issue.
> At present we are faced with the issue of our data lying in different remote FTP server locations.
To be clear: these are existing FTP servers, not remote HDFS systems that you wish to access over FTP? Because, if you want to transfer data from a remote HDFS system, then the HFTP scheme permits this over HTTP or HTTPS. I added a couple of JAR files in my patch (commons-net-1.4.1.jar and oro-2.0.8.jar) but svn diff would'nt add the jar contents to the patch correctly. all it says in the patch is 'Cannot display the contents of the binary file'. Consequently when I try to test the patch on my local machine it fails as it is unable to add the required binary files.
I am using Collabnet's SVN command line client - 1.4.6 for Red hat linux. Can someone suggest a workaround as this is preventing me from submitting the patch Thanks For the patch to work, following additional JARs are required
These can be downloaded from The patch implements the ftpclient as a standalone shell. Most basic commands are implemented
get, put, delete, ls, cd, pwd , lls (DFS), lpwd (DFS), lcd (DFS). Important thing to note is that get, put and delete are implemented so that they work on multiple files so no separate mget or mput is required. Future work: 1. More commands that we would like to include in ftpclient -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380604/ftpClient.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 9 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2294/testReport/ This message is automatically generated. Javadoc, findbugs, unit-test errors will go once the addtional required JARs (commons-net-1.4.1.jar and oro-2.0.8.jar) are included in the hadoop lib directory.
I am assuming that this does not happen in automated patch testing as the same errors surfaced in my local patch testing following the patch-test instructions mentioned on wiki. I tried adding the JARS manually to the clean environment but then the testing would complain saying 'Cannot test patch in a modified environment' Any suggestions on workaround for local patch testing ? So I tested the patch locally after adding the JARS to the test env manually and disabling the modification check in script 'test-patch.sh'.
Here's the report on my machine -1 overall. The find bugs warning is due to a couple of System.exit() in the code which even though is bad practice but still acceptable in my opinion for a shell application like this. Still I can go ahead and change this once I get some review comments. Wouldn't this be done better as a FileSystem? Then you could use distcp to copy to or from a ftp server.
> Wouldn't this be done better as a FileSystem?
+1. That makes sense to me. If the goal is to load data from FTP servers into HDFS, or vice-versa, then a FTP FileSystem implementation would permit both using existing tools like 'bin/hadoop fs' and distcp. A few points:
Chris, Thanks for the comments, i'll submit the new patch soon with suggested modifications
> It looks like some testing code accidentally > If you wanted to keep some of the code ... The latest patch that implements the FTP client as an FTPFileSystem as suggested. Here are the results of testing the patch on my local machine (Since automated testing is going to fail anyway
+1 overall. Overall this looks very good! A few minor suggestions:
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381694/ftpFileSystem.patch against trunk revision 654315. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2431/testReport/ This message is automatically generated. Ok, here the latest patch that implements the suggested changes from Doug. Here are the results of testing on my machine.
[exec] +1 overall. Thanks! This looks great!
Can you please attach the required jar files to this issue too? Ideally the test service should specify zero as the port, so that the OS allocates a free port, then we can ask which port was allocated and use that. But it doesn't look as though the Mina FTP server permits one to get the port that it's actually launched on. So perhaps we should try a few times in a loop, using a randomly generated port? Otherwise we'll get random test failures when that ports already in use, e.g., by another test run. Is there an issue trying to run the server on 2021 ? I understand its less than ideal but still better to have a port defined for our test service instead of random connection attempts trying to figure out which port actually Mina launched on.
Trying to connect to a randomly generated port might result in getting connected to a a different service running on the random port causing confusion in the commons FTPClient code and in turn FTPFileSystem code.
As a simple fix I added the following line after bind() call in the MinaListener.start() method. setPort(acceptor.getLocalAddress().getPort()); This sets the port correctly to the actual port the FTP server ended up listening on. As a result we could do simple thing in our Test case like MinaListener listener = (MinaListener) server.getServerContext().getListener("default"); This works and I have tested it on my local machine. If acceptable then I can provide the updated ftpserver-core.jar with this fix till the time it gets pushed into their code line. Looks like I got the fix committed to the Mina code line faster than I expected
I am attaching the latest Jars and updated patch. This is looking great!
Stuff only used for testing should only be on the test classpath. I moved the ftp and mina jar files to src/test/lib and the test username & password to src/test/conf. Do these changes look reasonable? It would be simpler for folks to evaluate this patch if the jars were all in a single .tar.gz that, when unpacked in a Hadoop tree, put them in the right places. And any .jar that's not an Apache product should be accompanied by a .LICENSE.txt file containing the license that .jar is under.
Note that moving Mina and ftp jars to src/test/lib is fine for this issue. But when Hadoop-3199 becomes available, these jars would be required under lib/. So until then we can keep the Mina and ftp jars under src/test/lib The problem with the ftpserver jars is that they're not yet a released Apache product. Providing them as a part of a Hadoop release means that the Hadoop project would be releasing them, and we must perform the diligence required of Apache releases. This is not something we should do lightly.
So it will be much easier to provide HADOOP-3199 in Hadoop once there has been a release of ftpserver. I think we can include this in 0.18. We won't distributed the ftpserver jars in the release package.
It would be nice is to make it so that, if you unpack a release and run unit tests, they pass. So we should probably skip the ftp unit tests when the ftpserver jars are not present. I assume that we would be compiling ftp unit tests with Mina jars and omitting them while release bundling.
In that case we would have to load the ftpServer classes explicitly ourselves in ftp unit test and skip the test if classes are not found. However it is a little cumbersome since there are around 7 ftpServer and mina classes referenced. Does this fix sound reasonable ? I think it might be simpler to break the test into two classes: TestFtpFileSystem, which will be run by our ant unit tests, and FtpTests, which actually performs the tests. The former can have a single method, testFtp(), that tries to load the FtpServer class. If it can, it constructs an instance of FtpTests and runs it. If it can't it just prints a warning. Might that work?
> I am just thinking if whether in the testFtp() method we should check for just one or all of the classes
I think one is enough. If all jars are present, as from subversion, tests will run correctly. If no jars are present, as in a release, tests will be skipped. If only some jars are present, an unsupported configuration, tests will crash. That seems fine to me. I just committed this. Thanks, Ankur!
This issue introduces a few jar files for testing. The location of them is src/test/lib, i.e under src. In order to compile the tests in my eclipse project, I have to add them to the build path manually. Usually, jar files can be found in lib. So it is not obvious to find these jar files. I cannot find them until I read the patch in this issue. Could we move these jar files to lib?
We want to keep these out of lib/, since they're not a released binary that we want Hadoop users to use. We include them in svn for testing only.
Integrated in Hadoop-trunk #509 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/509/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||