Hadoop Common
  1. Hadoop Common
  2. HADOOP-3754

Support a Thrift Interface to access files/directories in HDFS

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Thrift is a cross-language RPC framework. It supports automatic code generation for a variety of languages (Java, C++, python, PHP, etc) It would be nice if HDFS APIs are exposed though Thirft. It will allow applications written in any programming language to access HDFS.

      1. hadoopthrift.tar.gz
        251 kB
        dhruba borthakur
      2. hadoopthrift9.patch
        53 kB
        dhruba borthakur
      3. hadoopthrift8.patch
        54 kB
        dhruba borthakur
      4. hadoopthrift7.patch
        54 kB
        dhruba borthakur
      5. hadoopthrift6.patch
        51 kB
        dhruba borthakur
      6. hadoopthrift5.patch
        51 kB
        dhruba borthakur
      7. hadoopthrift4.patch
        47 kB
        dhruba borthakur
      8. hadoopthrift3.patch
        40 kB
        dhruba borthakur
      9. hadoopthrift2.patch
        36 kB
        dhruba borthakur
      10. thrift1.patch
        30 kB
        dhruba borthakur

        Activity

        Hide
        dhruba borthakur added a comment -

        This patch exposes a Thrift API from the FileSystem client. This is not the most optimal solution. A better and more-performant solution would be to expose Thrift APIs from the Datanode and NameNode directly.

        This patch requires that a FileSystemClientProxy server be started to expose the Thrift API. The code is, at this point, more a demonstration and proof-of-concept, rather than a finished piece of development. The Thrift API is a subset of the FileSystem API at present.

        The Hadoop Thrift API is in src/contrib/thrift/if. Running "ant package" should compile all that is required. The FileSystemProxyClient can be started by running src/contrib/thrift/start_thrift_server.sh. There is a python command line interface hdfs.py that is used to demonstrate that a python client can use the Thrift FileSystem API.

        More information about thrift can be found at http://incubator.apache.org/thrift/index.html

        Show
        dhruba borthakur added a comment - This patch exposes a Thrift API from the FileSystem client. This is not the most optimal solution. A better and more-performant solution would be to expose Thrift APIs from the Datanode and NameNode directly. This patch requires that a FileSystemClientProxy server be started to expose the Thrift API. The code is, at this point, more a demonstration and proof-of-concept, rather than a finished piece of development. The Thrift API is a subset of the FileSystem API at present. The Hadoop Thrift API is in src/contrib/thrift/if. Running "ant package" should compile all that is required. The FileSystemProxyClient can be started by running src/contrib/thrift/start_thrift_server.sh. There is a python command line interface hdfs.py that is used to demonstrate that a python client can use the Thrift FileSystem API. More information about thrift can be found at http://incubator.apache.org/thrift/index.html
        Hide
        dhruba borthakur added a comment -

        Attached build.xml

        Show
        dhruba borthakur added a comment - Attached build.xml
        Hide
        dhruba borthakur added a comment -

        Implemented more FileSystem APIs, i.e. ls, setpermissions, setowner, setReplication, etc.

        Show
        dhruba borthakur added a comment - Implemented more FileSystem APIs, i.e. ls, setpermissions, setowner, setReplication, etc.
        Hide
        Pete Wyckoff added a comment -

        A few quick comments - (1) I don't see the generated code from thrift in the patch and (2) I think including bindings for perl, java, python, ruby, C (if that exists in thrift now) and C++ would be good; (3) your name is in the header for the python script; (4) I think you definitely need a comment about opening a JIRA to change the com.facebook.thrift to org.apache.thrift with a dependency on the thrift JIRA to that effect (I can give you the url for it); (5) the name I don't think is so good?? Should it be contrib/thrift or contrib/thrift_dfsclient or something.

        btw this could be a great building block for fuse-dfs - getting it out of the Java business could mean making it a real kernel module.

        Show
        Pete Wyckoff added a comment - A few quick comments - (1) I don't see the generated code from thrift in the patch and (2) I think including bindings for perl, java, python, ruby, C (if that exists in thrift now) and C++ would be good; (3) your name is in the header for the python script; (4) I think you definitely need a comment about opening a JIRA to change the com.facebook.thrift to org.apache.thrift with a dependency on the thrift JIRA to that effect (I can give you the url for it); (5) the name I don't think is so good?? Should it be contrib/thrift or contrib/thrift_dfsclient or something. btw this could be a great building block for fuse-dfs - getting it out of the Java business could mean making it a real kernel module.
        Hide
        Pete Wyckoff added a comment -
        Show
        Pete Wyckoff added a comment - Thrift namespacing stuff - https://issues.apache.org/jira/browse/THRIFT-34
        Hide
        Craig Macdonald added a comment -

        Some questions/comments:

        • Would it be desirable that the thrift API reflects in some way the libhdfs API?
          • See also HADOOP-3264 (libhdfs does not provide permissions in API)
        • How is open for write, open for append signalled in the API?
        Show
        Craig Macdonald added a comment - Some questions/comments: Would it be desirable that the thrift API reflects in some way the libhdfs API? See also HADOOP-3264 (libhdfs does not provide permissions in API) How is open for write, open for append signalled in the API?
        Hide
        dhruba borthakur added a comment -

        Incorporated most of Pete's and Craig's comments. Thanks folks!

        I have renamed the package to thriftfs. I have also incorporated the "append" API to this package. This package now generates HDFS interfaces for Ruby, Cocoa, perl, php, python, etc.etc.

        The generated files are not part of the source code. An "ant package" generates the API in all of the above languages. Thus, a binary package should contain these APIs. Does it sound ok to you?

        Show
        dhruba borthakur added a comment - Incorporated most of Pete's and Craig's comments. Thanks folks! I have renamed the package to thriftfs. I have also incorporated the "append" API to this package. This package now generates HDFS interfaces for Ruby, Cocoa, perl, php, python, etc.etc. The generated files are not part of the source code. An "ant package" generates the API in all of the above languages. Thus, a binary package should contain these APIs. Does it sound ok to you?
        Hide
        Pete Wyckoff added a comment -

        +1

        but, a few more nits:

        1. I do think that requiring people to download and compile thrift will be too much of a hassle given that the compiler is in C++ so checking in the generated code, really is the way to go - I think And of course, this requires checking in the needed libraries in various langauges - the libthrift,.jar, thrift.so, thrift.py, ... But, we can require it, it just makes it more of a hassle for the user, but in this case, I think we need to have a README that tells people how to do that. Also, why do we need to check in the limited_relection header if the user has to download thrift??

        2. The exceptions thrown by the library are very general and do not match the client lib - e.g., IOException, ... although this could be a later add on.

        3. A note saying the chown is not atomic - i.e., the group in theory could change between the get and the set

        4. I think copy from local would be more robust if one could optionally add a checksum so the server could ensure it's looking at the right file and if not and/or the path does not exist, a meaningful exception is thrown but again could be a later add on

        5. Not needed now, but the command line isn't very robust to errors or friendly about printing them out in a meaningful user friends way.

        6. Generally a README that explains what this is and/or a bigger release note.

        7. Not now, but I would be super, super interested in knowing the performance of read/writes from this server.

        8. as we saw with the metastore, it would be cool to have an optional #of minimum threads in the worker pool.

        9. I don't quite understand why src/contrib/build-contrib.xml needs to change for adding this??

        10. would be better to inherit from thrift/src/contrib/fb303 but could be done later and then include counts for each operation.

        But, this is a killer application since no Java or Hadoop is needed on the client whatsoever! Congratulations! Would be cool even to use the Java bindings from a thin client to show no need for all of hadoop.

        I would really, really love to see:

        List<BlockAddresses> readBlocks(string filename) throws IOException ;
        List<BlockAddresses> writeBlocks(string filename, i64 length) throws IOException;

        which give you access to reading/writing directly from the data node over TCP

        Overall looks very good on the first cut.

        pete

        Show
        Pete Wyckoff added a comment - +1 but, a few more nits: 1. I do think that requiring people to download and compile thrift will be too much of a hassle given that the compiler is in C++ so checking in the generated code, really is the way to go - I think And of course, this requires checking in the needed libraries in various langauges - the libthrift,.jar, thrift.so, thrift.py, ... But, we can require it, it just makes it more of a hassle for the user, but in this case, I think we need to have a README that tells people how to do that. Also, why do we need to check in the limited_relection header if the user has to download thrift?? 2. The exceptions thrown by the library are very general and do not match the client lib - e.g., IOException, ... although this could be a later add on. 3. A note saying the chown is not atomic - i.e., the group in theory could change between the get and the set 4. I think copy from local would be more robust if one could optionally add a checksum so the server could ensure it's looking at the right file and if not and/or the path does not exist, a meaningful exception is thrown but again could be a later add on 5. Not needed now, but the command line isn't very robust to errors or friendly about printing them out in a meaningful user friends way. 6. Generally a README that explains what this is and/or a bigger release note. 7. Not now, but I would be super, super interested in knowing the performance of read/writes from this server. 8. as we saw with the metastore, it would be cool to have an optional #of minimum threads in the worker pool. 9. I don't quite understand why src/contrib/build-contrib.xml needs to change for adding this?? 10. would be better to inherit from thrift/src/contrib/fb303 but could be done later and then include counts for each operation. But, this is a killer application since no Java or Hadoop is needed on the client whatsoever! Congratulations! Would be cool even to use the Java bindings from a thin client to show no need for all of hadoop. I would really, really love to see: List<BlockAddresses> readBlocks(string filename) throws IOException ; List<BlockAddresses> writeBlocks(string filename, i64 length) throws IOException; which give you access to reading/writing directly from the data node over TCP Overall looks very good on the first cut. pete
        Hide
        Pete Wyckoff added a comment -

        oh one other thing is I would hope this generates lots of small/medium utilities (because they can be self contained - e.g., perl which only needs the address of the server and not Java and a config dir and hadoop itself), so do you have any thoughts on where these things would be contributed? e.g., src/contrib/thrfitfs/scripts or something ??

        Show
        Pete Wyckoff added a comment - oh one other thing is I would hope this generates lots of small/medium utilities (because they can be self contained - e.g., perl which only needs the address of the server and not Java and a config dir and hadoop itself), so do you have any thoughts on where these things would be contributed? e.g., src/contrib/thrfitfs/scripts or something ??
        Hide
        Pete Wyckoff added a comment -

        One other thought on the naming - I could very easily see adding thrift APIs for the NameNode and JobTracker for managing them and getting stats and such - e.g., for Nagios or proprietary monitoring systems and stats collection. In which case, naming this src/contrib/thriftapis/dfsclient would be more extensible.

        Show
        Pete Wyckoff added a comment - One other thought on the naming - I could very easily see adding thrift APIs for the NameNode and JobTracker for managing them and getting stats and such - e.g., for Nagios or proprietary monitoring systems and stats collection. In which case, naming this src/contrib/thriftapis/dfsclient would be more extensible.
        Hide
        Nitay Joffe added a comment -

        This is great work Dhruba!

        I worked on a very similar system at Powerset and had a few questions/comments:

        1) I wrote a C++ stream using boost::iostreams which used the thrift API underneath so that you can work with a standard stream yet read/write to hadoop. In order to make this work, I had to add a seek() method to the IDL. On input streams, I would allow arbitrary seeking. On output streams I would only allow it to get called with an offset of 0 (no actual seeking) which boost streams uses to find the current location.

        2) I had situations where the client would not close the files appropriately. This meant that other users would not see the file even though the writing was done because the data did not appear until the file is closed. To fix this situation, I put a TimerTask on each fd which would timeout (and close the file) after some period. Whenever an operation was called on a Handle its TimerTask would reset.

        I would attach code but we are currently being merged into Microsoft and in the process of figuring out how open source contributions will work.

        Cheers,
        -n

        Show
        Nitay Joffe added a comment - This is great work Dhruba! I worked on a very similar system at Powerset and had a few questions/comments: 1) I wrote a C++ stream using boost::iostreams which used the thrift API underneath so that you can work with a standard stream yet read/write to hadoop. In order to make this work, I had to add a seek() method to the IDL. On input streams, I would allow arbitrary seeking. On output streams I would only allow it to get called with an offset of 0 (no actual seeking) which boost streams uses to find the current location. 2) I had situations where the client would not close the files appropriately. This meant that other users would not see the file even though the writing was done because the data did not appear until the file is closed. To fix this situation, I put a TimerTask on each fd which would timeout (and close the file) after some period. Whenever an operation was called on a Handle its TimerTask would reset. I would attach code but we are currently being merged into Microsoft and in the process of figuring out how open source contributions will work. Cheers, -n
        Hide
        dhruba borthakur added a comment -

        Thanks Pete & Nitay for the detailed comments. Thanks a bunch.

        1. The patch includes the thrift binary for Linux. See lib/thrift/thrift and lib/thrift/libthrift.jar. Thus, a Linux compile does not have to download any external libraries, utilities.

        2. The proxy server uses the message from the hadoop.IOException to create its own exception. This is the best we can do for now. If we want to improve it later, we can do that. The application would see the real exception string, so it shoudl be enough for debugging purposes, won't it?

        3. Added a note to chown to say that it is not-atomic. This is true for hdfs.py only and does not apply to the chown thrift interface.

        4. I like your idea of using the checksum all the way from the client, but maybe we can postpone it to a later date.

        5. The python command line needs more work. However, I am not targeting the python wrapper as a piece that an application will use as it is. It is there to demonstrate how to access HDFS from a python script. I

        6. Added README that describes the approach, build and deployment process. I plan on writing a Wiki page once this patch gets committed.

        7. performance measurement will come at a later date

        8. Added default minimum number of threads to be 10.

        9. The change to build-contrib.xml ensures that the generated jar file(s) are in the CLASSPATH while compiling HadoopThriftServer.java.

        10. I would wait to include fb303. This is mostly for statistics management and process management and can be added at a later date. It might be useful to use HadoopMetrics or via HADOOP-3772.

        11. I added a new call setInactiveTimeoutPeriod() that allows an application to specify how long the proxy server should remain active starting from the last call to it. If this timer expires, then the proxy server closes all open files and shuts down. The default inactivity timeout is 1 hour. This does not completely address Nitay's problems, but maybe solves it to a certain extent. If Nitay could merge in his code for per-handle timer once this patch is committed, that will be great.

        12. If, at a future time, we add Thrift APIs to Namenode, Datanode, etc, they would have to be located in src/hdfs and not in contrib. Even if we decide to keep them in contrib, they could be src/contrib/thriftfs/namenode, src/contrib/thriftfs/datanode, etc. I think the API in this patch should try to resemble existing API in fs.FileSystem.

        13. I added a getFileBlockLocations API to allow fetching the block locations of a file.

        Show
        dhruba borthakur added a comment - Thanks Pete & Nitay for the detailed comments. Thanks a bunch. 1. The patch includes the thrift binary for Linux. See lib/thrift/thrift and lib/thrift/libthrift.jar. Thus, a Linux compile does not have to download any external libraries, utilities. 2. The proxy server uses the message from the hadoop.IOException to create its own exception. This is the best we can do for now. If we want to improve it later, we can do that. The application would see the real exception string, so it shoudl be enough for debugging purposes, won't it? 3. Added a note to chown to say that it is not-atomic. This is true for hdfs.py only and does not apply to the chown thrift interface. 4. I like your idea of using the checksum all the way from the client, but maybe we can postpone it to a later date. 5. The python command line needs more work. However, I am not targeting the python wrapper as a piece that an application will use as it is. It is there to demonstrate how to access HDFS from a python script. I 6. Added README that describes the approach, build and deployment process. I plan on writing a Wiki page once this patch gets committed. 7. performance measurement will come at a later date 8. Added default minimum number of threads to be 10. 9. The change to build-contrib.xml ensures that the generated jar file(s) are in the CLASSPATH while compiling HadoopThriftServer.java. 10. I would wait to include fb303. This is mostly for statistics management and process management and can be added at a later date. It might be useful to use HadoopMetrics or via HADOOP-3772 . 11. I added a new call setInactiveTimeoutPeriod() that allows an application to specify how long the proxy server should remain active starting from the last call to it. If this timer expires, then the proxy server closes all open files and shuts down. The default inactivity timeout is 1 hour. This does not completely address Nitay's problems, but maybe solves it to a certain extent. If Nitay could merge in his code for per-handle timer once this patch is committed, that will be great. 12. If, at a future time, we add Thrift APIs to Namenode, Datanode, etc, they would have to be located in src/hdfs and not in contrib. Even if we decide to keep them in contrib, they could be src/contrib/thriftfs/namenode, src/contrib/thriftfs/datanode, etc. I think the API in this patch should try to resemble existing API in fs.FileSystem. 13. I added a getFileBlockLocations API to allow fetching the block locations of a file.
        Hide
        steve_l added a comment -

        1. until something comes out of incubation, apache can't release it. So checking in the relevant artifacts and releasing downstream versions is going to be on the forbidden list.

        2. it may/should be possible to do something about sticking the artifacts up on the maven repository, though there CPU versions and such like complicate binding.

        3. regarding a client that doesnt need java/hadoop.jar, well, you' ve just moved the dependencies, haven't you, and switched to a different kind of RPC. What would appeal to me more would be a RESTy interface that works long-haul, so you could submit work a remote cluster, then monitor that work through any Atom feed reader.

        Show
        steve_l added a comment - 1. until something comes out of incubation, apache can't release it. So checking in the relevant artifacts and releasing downstream versions is going to be on the forbidden list. 2. it may/should be possible to do something about sticking the artifacts up on the maven repository, though there CPU versions and such like complicate binding. 3. regarding a client that doesnt need java/hadoop.jar, well, you' ve just moved the dependencies, haven't you, and switched to a different kind of RPC. What would appeal to me more would be a RESTy interface that works long-haul, so you could submit work a remote cluster, then monitor that work through any Atom feed reader.
        Hide
        Pete Wyckoff added a comment -

        Hi Steve,

        Regarding #3, thrift is a thin library and as such is a lot lighter than requiring Hadoop RPC which uses reflection and thus must basically be kept in perfect sync with the server side's code (yes, 19 has separate jars, but the client jar isn't very full featured i bet). Thrift gives you the advantage of tight integration (ie well-defined APIs) whilst still providing pretty loose coupling, although, of course, not as loose as REST. With thrift, one can work on data structures and thrift does provide some REST-like interfaces too. I think REST may be good for some things like just asking for job status and such, but doesn't provide robust enough APIs as thrift, which could provide a better alternative RPC to remote reflection in addition to providing something for thin clients. And thrift has bindings in literally about 20 languages.

        Although I guess protocol buffers are an alternative too now that they are open sourced but, I think they only have C, Java and Python bindings.

        pete

        Show
        Pete Wyckoff added a comment - Hi Steve, Regarding #3, thrift is a thin library and as such is a lot lighter than requiring Hadoop RPC which uses reflection and thus must basically be kept in perfect sync with the server side's code (yes, 19 has separate jars, but the client jar isn't very full featured i bet). Thrift gives you the advantage of tight integration (ie well-defined APIs) whilst still providing pretty loose coupling, although, of course, not as loose as REST. With thrift, one can work on data structures and thrift does provide some REST-like interfaces too. I think REST may be good for some things like just asking for job status and such, but doesn't provide robust enough APIs as thrift, which could provide a better alternative RPC to remote reflection in addition to providing something for thin clients. And thrift has bindings in literally about 20 languages. Although I guess protocol buffers are an alternative too now that they are open sourced but, I think they only have C, Java and Python bindings. pete
        Hide
        Pete Wyckoff added a comment -

        Code review comments:

        1. I didn't notice before, but if the server is restarted, it is highly likely that clients will have handles to the wrong files sitting around. ie the nextId counter will just start over at 100.

        2. what about Nitay's #2 comment about closing files if the client dies? Is it possible to implement the same semantics that the DFSClient would?

        3. Steve's comments - esp about including thrift ...

        Show
        Pete Wyckoff added a comment - Code review comments: 1. I didn't notice before, but if the server is restarted, it is highly likely that clients will have handles to the wrong files sitting around. ie the nextId counter will just start over at 100. 2. what about Nitay's #2 comment about closing files if the client dies? Is it possible to implement the same semantics that the DFSClient would? 3. Steve's comments - esp about including thrift ...
        Hide
        Doug Cutting added a comment -

        > until something comes out of incubation, apache can't release it.

        Right, we can use the version at http://developers.facebook.com/thrift/, under the Thrift Software License, but we should not use the incubating software until it is released by Apache.

        Show
        Doug Cutting added a comment - > until something comes out of incubation, apache can't release it. Right, we can use the version at http://developers.facebook.com/thrift/ , under the Thrift Software License, but we should not use the incubating software until it is released by Apache.
        Hide
        Pete Wyckoff added a comment -

        Does use mean we can include it in svn with its runtime libraries?

        Show
        Pete Wyckoff added a comment - Does use mean we can include it in svn with its runtime libraries?
        Hide
        Doug Cutting added a comment -

        > Does use mean we can include it in svn with its runtime libraries?

        Do you mean software distributed under the Thrift Software License? Yes. This looks like a standard BSD-style license, which is permitted under Apache's 3rd party licensing policy (http://www.apache.org/legal/3party.html).

        Show
        Doug Cutting added a comment - > Does use mean we can include it in svn with its runtime libraries? Do you mean software distributed under the Thrift Software License? Yes. This looks like a standard BSD-style license, which is permitted under Apache's 3rd party licensing policy ( http://www.apache.org/legal/3party.html ).
        Hide
        Chad Walters added a comment -

        Hmm, that release is months old. There's been a lot of changes/progress on Thrift since that release was made, including improvements to the Java portion of the system. Doug, what are the relevant policies around using code that is in the Incubator?

        Show
        Chad Walters added a comment - Hmm, that release is months old. There's been a lot of changes/progress on Thrift since that release was made, including improvements to the Java portion of the system. Doug, what are the relevant policies around using code that is in the Incubator?
        Hide
        Doug Cutting added a comment -

        > Hmm, that release is months old. There's been a lot of changes/progress on Thrift since that release was made, including improvements to the Java portion of the system.

        We should only include released code from other Apache projects. So until Thrift makes a release from the incubator, we should not include it, since it's not yet legally Apache code. Apache's fundamental law is that it's not licensed by Apache until a PMC has voted it so with 3 or more +1 votes and no -1 votes. If Facebook wanted to produce and distribute a new release of "Facebook Thrift" under an acceptable license (e.g., the Apache license) then Hadoop could re-distribute it, but it could not be anywhere called "Apache Thrift".

        Show
        Doug Cutting added a comment - > Hmm, that release is months old. There's been a lot of changes/progress on Thrift since that release was made, including improvements to the Java portion of the system. We should only include released code from other Apache projects. So until Thrift makes a release from the incubator, we should not include it, since it's not yet legally Apache code. Apache's fundamental law is that it's not licensed by Apache until a PMC has voted it so with 3 or more +1 votes and no -1 votes. If Facebook wanted to produce and distribute a new release of "Facebook Thrift" under an acceptable license (e.g., the Apache license) then Hadoop could re-distribute it, but it could not be anywhere called "Apache Thrift".
        Hide
        dhruba borthakur added a comment -

        I included the thrift from http://developers.facebook.com/thrift/.

        Also, make the shutdown from the python script more elegant. The python script creates a proxy server if the port is not specified. Also, the proxy server kills itself if the inactivity period is more than 1 hour (configurable). I would address Nitay's comments about particular files going inactive in a later release.

        If the proxy server is restarted, all existing client's connections would get an error. They would have to re-establish a new thrift connection to the proxy server. Thus, the proxy server does not have to persist the objectids, they get restarted from 100.

        Show
        dhruba borthakur added a comment - I included the thrift from http://developers.facebook.com/thrift/ . Also, make the shutdown from the python script more elegant. The python script creates a proxy server if the port is not specified. Also, the proxy server kills itself if the inactivity period is more than 1 hour (configurable). I would address Nitay's comments about particular files going inactive in a later release. If the proxy server is restarted, all existing client's connections would get an error. They would have to re-establish a new thrift connection to the proxy server. Thus, the proxy server does not have to persist the objectids, they get restarted from 100.
        Hide
        Pete Wyckoff added a comment -

        RE: new objectids, there's nothing in the API that prevents a client from connecting, getting a thrift object id, closing the connection, and re-connecting to do the reads or writes. So, they wouldn't get any errors in that case, assuming the server did a restart between the close and the next open.

        Show
        Pete Wyckoff added a comment - RE: new objectids, there's nothing in the API that prevents a client from connecting, getting a thrift object id, closing the connection, and re-connecting to do the reads or writes. So, they wouldn't get any errors in that case, assuming the server did a restart between the close and the next open.
        Hide
        dhruba borthakur added a comment -

        I agree. Would it help if I start the object Id from a Random.nextLong() instead of 100?

        Show
        dhruba borthakur added a comment - I agree. Would it help if I start the object Id from a Random.nextLong() instead of 100?
        Hide
        dhruba borthakur added a comment -

        I changed the number 100 to start from a random number. Submitting this patch to get HadoopQA tests.

        Show
        dhruba borthakur added a comment - I changed the number 100 to start from a random number. Submitting this patch to get HadoopQA tests.
        Hide
        Pete Wyckoff added a comment -

        +1

        Show
        Pete Wyckoff added a comment - +1
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12387895/hadoopthrift6.patch
        against trunk revision 685406.

        -1 @author. The patch appears to contain 2 @author tags which the Hadoop community has agreed to not allow in code contributions.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to cause Findbugs to fail.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3049/testReport/
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3049/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3049/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387895/hadoopthrift6.patch against trunk revision 685406. -1 @author. The patch appears to contain 2 @author tags which the Hadoop community has agreed to not allow in code contributions. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3049/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3049/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3049/console This message is automatically generated.
        Hide
        dhruba borthakur added a comment -

        Fix HadoopQA failures.

        Show
        dhruba borthakur added a comment - Fix HadoopQA failures.
        Hide
        dhruba borthakur added a comment -

        1. Removed author tags.
        2. Added a unit test (needs more work to be comprehensive)
        3. Changed lots of LOG messages from INFO to DEBUG level.

        Show
        dhruba borthakur added a comment - 1. Removed author tags. 2. Added a unit test (needs more work to be comprehensive) 3. Changed lots of LOG messages from INFO to DEBUG level.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12388306/hadoopthrift7.patch
        against trunk revision 686362.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3061/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388306/hadoopthrift7.patch against trunk revision 686362. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3061/console This message is automatically generated.
        Hide
        dhruba borthakur added a comment -

        Canceling patch because it went stale.

        Show
        dhruba borthakur added a comment - Canceling patch because it went stale.
        Hide
        dhruba borthakur added a comment -

        Merged patch with latest trunk

        Show
        dhruba borthakur added a comment - Merged patch with latest trunk
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12388504/hadoopthrift8.patch
        against trunk revision 687868.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to cause Findbugs to fail.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3082/testReport/
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3082/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3082/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388504/hadoopthrift8.patch against trunk revision 687868. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3082/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3082/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3082/console This message is automatically generated.
        Hide
        dhruba borthakur added a comment -

        The problem is that the patch includes two binaries: a joar file for libthrift and an executable for "thrift". When the patch is extracted using "patch -i <filename> -p 0" the two binary fies do not get extracted. Is there a special 'svn diff" command that I need to use to package binary files? The two binary files show up as "M" in my local workspace.

        Show
        dhruba borthakur added a comment - The problem is that the patch includes two binaries: a joar file for libthrift and an executable for "thrift". When the patch is extracted using "patch -i <filename> -p 0" the two binary fies do not get extracted. Is there a special 'svn diff" command that I need to use to package binary files? The two binary files show up as "M" in my local workspace.
        Hide
        dhruba borthakur added a comment -

        Fixed warnings detected by HadoopQA.

        Show
        dhruba borthakur added a comment - Fixed warnings detected by HadoopQA.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12388784/hadoopthrift9.patch
        against trunk revision 688960.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        -1 release audit. The applied patch generated 275 release audit warnings (more than the trunk's current 271 warnings).

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/testReport/
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/artifact/trunk/current/releaseAuditDiffWarnings.txt
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388784/hadoopthrift9.patch against trunk revision 688960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 275 release audit warnings (more than the trunk's current 271 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3104/console This message is automatically generated.
        Hide
        dhruba borthakur added a comment -

        The javadoc warning does not seem to be introduced by this patch. The warnings are:

        [exec] [javadoc] Loading source files for package org.apache.hadoop.contrib.failmon...
        [exec] [javadoc] Constructing Javadoc information...
        [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/org/apache/hadoop/fs/FileSystem.java:34: package org.apache.hadoop.hdfs does not exist
        [exec] [javadoc] import org.apache.hadoop.hdfs.DistributedFileSystem;
        [exec] [javadoc] ^
        [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/org/apache/hadoop/fs/FsShell.java:46: package org.apache.hadoop.hdfs does not exist
        [exec] [javadoc] import org.apache.hadoop.hdfs.DistributedFileSystem;
        [exec] [javadoc] ^
        [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/mapred/org/apache/hadoop/mapred/Task.java:36: package org.apache.hadoop.hdfs does not exist
        [exec] [javadoc] import org.apache.hadoop.hdfs.DistributedFileSystem;
        [exec] [javadoc] ^
        [exec] [javadoc] Standard Doclet version 1.6.0_06
        [exec] [javadoc] Building tree for all the packages and classes...
        [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/org/apache/hadoop/fs/FileSystem.java:56: warning - Tag @link: reference not found: DistributedFileSystem
        [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/org/apache/hadoop/metrics/MetricsUtil.java:34: warning - Tag @link: reference not found: org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics
        [exec] [javadoc] Building index for all the packages and classes...
        [exec] [javadoc] Building index for all classes...
        [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/overview.html: warning - Tag @link: reference not found: org.apache.hadoop.hdfs.server.namenode.NameNode
        [exec] [javadoc] Generating /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/docs/api/stylesheet.css...
        [exec] [javadoc] 6 warnings
        [exec] There appear to be 1 javadoc warnings generated by the patched build.
        [exec]

        None of these files are touched by this patch. I think this patch can be safely committed. Please comment if you think otherwise.

        Show
        dhruba borthakur added a comment - The javadoc warning does not seem to be introduced by this patch. The warnings are: [exec] [javadoc] Loading source files for package org.apache.hadoop.contrib.failmon... [exec] [javadoc] Constructing Javadoc information... [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/org/apache/hadoop/fs/FileSystem.java:34: package org.apache.hadoop.hdfs does not exist [exec] [javadoc] import org.apache.hadoop.hdfs.DistributedFileSystem; [exec] [javadoc] ^ [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/org/apache/hadoop/fs/FsShell.java:46: package org.apache.hadoop.hdfs does not exist [exec] [javadoc] import org.apache.hadoop.hdfs.DistributedFileSystem; [exec] [javadoc] ^ [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/mapred/org/apache/hadoop/mapred/Task.java:36: package org.apache.hadoop.hdfs does not exist [exec] [javadoc] import org.apache.hadoop.hdfs.DistributedFileSystem; [exec] [javadoc] ^ [exec] [javadoc] Standard Doclet version 1.6.0_06 [exec] [javadoc] Building tree for all the packages and classes... [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/org/apache/hadoop/fs/FileSystem.java:56: warning - Tag @link: reference not found: DistributedFileSystem [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/org/apache/hadoop/metrics/MetricsUtil.java:34: warning - Tag @link: reference not found: org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics [exec] [javadoc] Building index for all the packages and classes... [exec] [javadoc] Building index for all classes... [exec] [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/core/overview.html: warning - Tag @link: reference not found: org.apache.hadoop.hdfs.server.namenode.NameNode [exec] [javadoc] Generating /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/docs/api/stylesheet.css... [exec] [javadoc] 6 warnings [exec] There appear to be 1 javadoc warnings generated by the patched build. [exec] None of these files are touched by this patch. I think this patch can be safely committed. Please comment if you think otherwise.
        Hide
        dhruba borthakur added a comment -

        Also, the release-audit warnings are from thrift binary files and jars. However, the patch has a LICENSE file that describes the license for thrift.

        Show
        dhruba borthakur added a comment - Also, the release-audit warnings are from thrift binary files and jars. However, the patch has a LICENSE file that describes the license for thrift.
        Hide
        dhruba borthakur added a comment -

        I just committed this.

        Show
        dhruba borthakur added a comment - I just committed this.
        Hide
        Owen O'Malley added a comment -

        I think we should revert this change.
        1. No one +1'ed it.
        2. It checks a binary executable into lib. I think it would be better to check in the generated files, but not the generator.
        3. It uses a different platform name than the rest of the build.
        4. There is no README in lib/thrift that states where the included binaries and jar came from.

        Show
        Owen O'Malley added a comment - I think we should revert this change. 1. No one +1'ed it. 2. It checks a binary executable into lib. I think it would be better to check in the generated files, but not the generator. 3. It uses a different platform name than the rest of the build. 4. There is no README in lib/thrift that states where the included binaries and jar came from.
        Hide
        Owen O'Malley added a comment -

        Actually, other contrib modules have a lib directory. We should probably move the thrift library stuff into src/contrib/thriftfs/lib/. (Although I still think the executable for thrift should be removed. We do not want to get into the business of distributing compilation tools.

        Show
        Owen O'Malley added a comment - Actually, other contrib modules have a lib directory. We should probably move the thrift library stuff into src/contrib/thriftfs/lib/. (Although I still think the executable for thrift should be removed. We do not want to get into the business of distributing compilation tools.
        Hide
        Pete Wyckoff added a comment -

        +1 on Owen's comment about including the generated code. The compiled thrift will only work on compatible platforms, so people on Windows or other "weird" platforms will not be able to use this without compiling thrift themselves which is too much of a hassle i would think.

        Show
        Pete Wyckoff added a comment - +1 on Owen's comment about including the generated code. The compiled thrift will only work on compatible platforms, so people on Windows or other "weird" platforms will not be able to use this without compiling thrift themselves which is too much of a hassle i would think.
        Hide
        Owen O'Malley added a comment -

        I rolled this back in Subversion until the binary is removed, the library is moved down into the contrib lib, and the README is added to the lib.

        Show
        Owen O'Malley added a comment - I rolled this back in Subversion until the binary is removed, the library is moved down into the contrib lib, and the README is added to the lib.
        Hide
        dhruba borthakur added a comment -

        This includes the changes proposed by Owen. This patch has binary files (jars), so a gzipped tar file is included as a patch.

        Show
        dhruba borthakur added a comment - This includes the changes proposed by Owen. This patch has binary files (jars), so a gzipped tar file is included as a patch.
        Hide
        dhruba borthakur added a comment -

        Owen can you pl review it? Thanks.

        Show
        dhruba borthakur added a comment - Owen can you pl review it? Thanks.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12389059/hadoopthrift.tar.gz
        against trunk revision 689733.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3133/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389059/hadoopthrift.tar.gz against trunk revision 689733. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3133/console This message is automatically generated.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #586 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/586/ )
        Hide
        Owen O'Malley added a comment -

        I just recommitted this. Thanks, Dhruba!

        Show
        Owen O'Malley added a comment - I just recommitted this. Thanks, Dhruba!
        Hide
        steve_l added a comment -

        what may make sense for the future releases of apache thrift is to make sure that the thrift team release versions on the relevant platform for hadoop (linux, windows, 32, 64 bit, etc), and provide source and binary RPMs. If there was a thrift RPM, it would be that much easy for a hadoop RPM to depend on it, and so get the right libraries onto the target OS.

        Show
        steve_l added a comment - what may make sense for the future releases of apache thrift is to make sure that the thrift team release versions on the relevant platform for hadoop (linux, windows, 32, 64 bit, etc), and provide source and binary RPMs. If there was a thrift RPM, it would be that much easy for a hadoop RPM to depend on it, and so get the right libraries onto the target OS.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #589 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/589/ )

          People

          • Assignee:
            dhruba borthakur
            Reporter:
            dhruba borthakur
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development