Details

    • Type: New Feature New Feature
    • Status: In Progress
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      HDFS is inefficient with large numbers of small files. Thus one might pack many small files into large, compressed, archives. But, for efficient map-reduce operation, it is desireable to be able to split inputs into smaller chunks, with one or more small original file per split. The zip format, unlike tar, permits enumeration of files in the archive without scanning the entire archive. Thus a zip InputFormat could efficiently permit splitting large archives into splits that contain one or more archived files.

        Activity

        Hide
        Ankur added a comment -

        Proposed Implementation Approach
        --------------------------------------------------

        1. Implement class ZipInputFormat to extend FileInputFormat.

        2. Override the getSplits() method to read each file's
        InputStream and construct a ZipInputStream out of it.

        3. Create FileSplits in a way that each file split has the following
        properties

        • FileSplit.start = start index of a zip entry.
        • FileSplit.length = end index of a zip entry.
        • fileSplit.file = Zip file.
        • Sum of compressed size of zip entries <= splitSize

        For e.g. start = 3, length = 6 signifies that zip entries 3 to 6
        will be read from the zip file of this split.

        4. Implement class ZipRecordReader to read each zip entry in its split
        Using LineRecordReader.

        5. Each zip entry will be treated as a text file.

        6. Implement the necessary unit test case classes.

        Questions:
        =========
        1. Is there a need to implement a ZipCodec (like GzipCodec and DefaultCodec) ?
        2. Should the ZipRecordReader be flexible enough to treat the individual zip entries in a
        FileSplit as being a text file or a sequence file ?

        Please feel free to comment on anything that I missed which might be required.
        Also any suggestions/recommendation to make the implementation better will be greatly
        appreciated.

        -Ankur

        Show
        Ankur added a comment - Proposed Implementation Approach -------------------------------------------------- 1. Implement class ZipInputFormat to extend FileInputFormat. 2. Override the getSplits() method to read each file's InputStream and construct a ZipInputStream out of it. 3. Create FileSplits in a way that each file split has the following properties FileSplit.start = start index of a zip entry. FileSplit.length = end index of a zip entry. fileSplit.file = Zip file. Sum of compressed size of zip entries <= splitSize For e.g. start = 3, length = 6 signifies that zip entries 3 to 6 will be read from the zip file of this split. 4. Implement class ZipRecordReader to read each zip entry in its split Using LineRecordReader. 5. Each zip entry will be treated as a text file. 6. Implement the necessary unit test case classes. Questions: ========= 1. Is there a need to implement a ZipCodec (like GzipCodec and DefaultCodec) ? 2. Should the ZipRecordReader be flexible enough to treat the individual zip entries in a FileSplit as being a text file or a sequence file ? Please feel free to comment on anything that I missed which might be required. Also any suggestions/recommendation to make the implementation better will be greatly appreciated. -Ankur
        Hide
        Doug Cutting added a comment -

        > 2. Override the getSplits() method to read each file's InputStream

        I think getSplits() should construct a split for each element of java.util.zip.ZipFile#entries().

        > 3. Create FileSplits [ ... ]

        We should probably extend FileSplit or InputSplit specifically for zip files. The fields needed per split are the archive file's path and the path of the file within the archive. I don't think there's much point in supporting splits smaller than a file within the zip archive, so start and end offsets are not required here.

        > 4. Implement class ZipRecordReader to read each zip entry in its split
        Using LineRecordReader.

        We should be able to use LineRecordReader directly, passing its constructor the result of ZipFile#getInputStream().

        Show
        Doug Cutting added a comment - > 2. Override the getSplits() method to read each file's InputStream I think getSplits() should construct a split for each element of java.util.zip.ZipFile#entries(). > 3. Create FileSplits [ ... ] We should probably extend FileSplit or InputSplit specifically for zip files. The fields needed per split are the archive file's path and the path of the file within the archive. I don't think there's much point in supporting splits smaller than a file within the zip archive, so start and end offsets are not required here. > 4. Implement class ZipRecordReader to read each zip entry in its split Using LineRecordReader. We should be able to use LineRecordReader directly, passing its constructor the result of ZipFile#getInputStream().
        Hide
        Ankur added a comment -
        • This patch does not modify any existing source file and adds 3 new files
          1. ZipInputFormat.java
          2. ZipSplit.java
          3. TestZipInputFormat.java
        • The ZipInputFormat simply creates one split for each zip entry in an input zip file.
        • Each split is of type ZipSplit and is read using a LineRecordReader.
        • TestZipInputFormat is the unit test code that tests the ZipInputFormat with different zip files
          having different number of entries.
        • More information is available in the javadoc
        Show
        Ankur added a comment - This patch does not modify any existing source file and adds 3 new files 1. ZipInputFormat.java 2. ZipSplit.java 3. TestZipInputFormat.java The ZipInputFormat simply creates one split for each zip entry in an input zip file. Each split is of type ZipSplit and is read using a LineRecordReader. TestZipInputFormat is the unit test code that tests the ZipInputFormat with different zip files having different number of entries. More information is available in the javadoc
        Hide
        Ankur added a comment -

        Attaching the patch file

        Show
        Ankur added a comment - Attaching the patch file
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12373681/ZipInputFormat.patch
        against trunk revision r614192.

        @author +1. The patch does not contain any @author tags.

        javadoc -1. The javadoc tool appears to have generated messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs -1. The patch appears to introduce 2 new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1673/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1673/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1673/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1673/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373681/ZipInputFormat.patch against trunk revision r614192. @author +1. The patch does not contain any @author tags. javadoc -1. The javadoc tool appears to have generated messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 2 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1673/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1673/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1673/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1673/console This message is automatically generated.
        Hide
        Ankur added a comment -

        Following issues reported by QA were fixed

        1. Findbugs errors in ZipInputFormat.java were fixed. The streams are now closed properly in isSplitable() and getSplits() methods.
        2. Javadoc comments fixed and verified that no new javadoc warnings are generated after applying the patch.
        3. Fixed formatting in the code.
        4. core-tests and contrib-tests are now passing after the above changes.

        Kindly verify.

        Show
        Ankur added a comment - Following issues reported by QA were fixed 1. Findbugs errors in ZipInputFormat.java were fixed. The streams are now closed properly in isSplitable() and getSplits() methods. 2. Javadoc comments fixed and verified that no new javadoc warnings are generated after applying the patch. 3. Fixed formatting in the code. 4. core-tests and contrib-tests are now passing after the above changes. Kindly verify.
        Hide
        Doug Cutting added a comment -

        Some comments:

        • isSplittable throws an exception when an empty zip archive is passed. Instead, an empty zip file should just provide no keys and values, but not throw exceptions.
        • in getSplits, there's no need to explicitly test that each file exists. Instead, we can rely on open() throwing an exception if a file does not exist.
        • getRecordReader should not loop calling getNextEntry(), but instead just call getEntry(String).

        Oh, wait. On that last point, it looks like getEntry() is only available on ZipFile, and we cannot create a ZipFile except from a File. Wih an InputStream we must use ZipInputStream, which does not support getEntry(), since InputStream doesn't support random access. Sigh. This considerably reduces the utility of this InputFormat. GNU Classpath's implementation of java.io.zip.ZipFile use a RandomAccessFile, which we could implement, but, alas, we can't use GNU's code at Apache because it is under the GPL.

        Zlib includes a zip file parser (minizip) that's under a BSD-like license and that permits random access to zip file entries from a user-supplied input stream. So we could do it in C. Sigh.

        Show
        Doug Cutting added a comment - Some comments: isSplittable throws an exception when an empty zip archive is passed. Instead, an empty zip file should just provide no keys and values, but not throw exceptions. in getSplits, there's no need to explicitly test that each file exists. Instead, we can rely on open() throwing an exception if a file does not exist. getRecordReader should not loop calling getNextEntry(), but instead just call getEntry(String). Oh, wait. On that last point, it looks like getEntry() is only available on ZipFile, and we cannot create a ZipFile except from a File. Wih an InputStream we must use ZipInputStream, which does not support getEntry(), since InputStream doesn't support random access. Sigh. This considerably reduces the utility of this InputFormat. GNU Classpath's implementation of java.io.zip.ZipFile use a RandomAccessFile, which we could implement, but, alas, we can't use GNU's code at Apache because it is under the GPL. Zlib includes a zip file parser (minizip) that's under a BSD-like license and that permits random access to zip file entries from a user-supplied input stream. So we could do it in C. Sigh.
        Hide
        Doug Cutting added a comment -

        So, while implementing a zip InputFormat based on native code would be a lot more work, it would also have some distinct advantages:

        • it could handle archives greater than 2GB;
        • it would know where within the archive each split resides, so that splits could be properly localized;
        • once HDFS implements append, it could provide appendable archives.

        None of these are possible with java.io.zip.

        Show
        Doug Cutting added a comment - So, while implementing a zip InputFormat based on native code would be a lot more work, it would also have some distinct advantages: it could handle archives greater than 2GB; it would know where within the archive each split resides, so that splits could be properly localized; once HDFS implements append, it could provide appendable archives. None of these are possible with java.io.zip.
        Hide
        Ankur added a comment -

        Some questions.
        1. How is a java.io.InputStream passed and used in native code. The header file represents it as a jobject which I tried casting to FILE * and reading, it did not work as expected.

        2. Can a native method call return structures that can be converted to java objects ? If so how ?
        Basically I want to be able to return an array of C structure where each element holds the following information

        • The path of the entry
        • The number of the entry
        • Offset of the entry in the zip file
          So that this info can be converted to an array of ZipSplit.

        I am new to JNI so things are less than obvious for me, a little help will be greatly appreciated on JNI.

        Show
        Ankur added a comment - Some questions. 1. How is a java.io.InputStream passed and used in native code. The header file represents it as a jobject which I tried casting to FILE * and reading, it did not work as expected. 2. Can a native method call return structures that can be converted to java objects ? If so how ? Basically I want to be able to return an array of C structure where each element holds the following information The path of the entry The number of the entry Offset of the entry in the zip file So that this info can be converted to an array of ZipSplit. I am new to JNI so things are less than obvious for me, a little help will be greatly appreciated on JNI.
        Hide
        Arun C Murthy added a comment -

        How is a java.io.InputStream passed and used in native code. The header file represents it as a jobject which I tried casting to FILE * and reading, it did not work as expected.

        I'm not sure what exactly you are trying, but the way I implemented the native codecs was to read data from the InputStream in the Java layer, put the data into a direct-buffer and then pass it to the native zlib library.

        The stream you are talking about is the handle to the zlib stream, which is zlib specific. That just represents the state of the zlib stream.

        Can a native method call return structures that can be converted to java objects ? If so how ?

        I'm sure that can be done via some hoops, but would be quite involved (I think).

        Some details here: http://java.sun.com/docs/books/jni/html/other.html#30942

        JNI Documentation from Sun: http://java.sun.com/docs/books/jni/html/jniTOC.html

        Hope that helps.

        Show
        Arun C Murthy added a comment - How is a java.io.InputStream passed and used in native code. The header file represents it as a jobject which I tried casting to FILE * and reading, it did not work as expected. I'm not sure what exactly you are trying, but the way I implemented the native codecs was to read data from the InputStream in the Java layer, put the data into a direct-buffer and then pass it to the native zlib library. The stream you are talking about is the handle to the zlib stream, which is zlib specific. That just represents the state of the zlib stream. Can a native method call return structures that can be converted to java objects ? If so how ? I'm sure that can be done via some hoops, but would be quite involved (I think). Some details here: http://java.sun.com/docs/books/jni/html/other.html#30942 JNI Documentation from Sun: http://java.sun.com/docs/books/jni/html/jniTOC.html Hope that helps.
        Hide
        Doug Cutting added a comment -

        > I'm not sure what exactly you are trying [ ... ]

        The need here is to read from an FSInputStream, returned from an arbitrary FileSystem implementation, from C. In particular, we need to be able to make callbacks from C to Java for read() and seek(). (I think open() and close() can be handled entirely in Java, and tell() can be implemented entirely in C.)

        Show
        Doug Cutting added a comment - > I'm not sure what exactly you are trying [ ... ] The need here is to read from an FSInputStream, returned from an arbitrary FileSystem implementation, from C. In particular, we need to be able to make callbacks from C to Java for read() and seek(). (I think open() and close() can be handled entirely in Java, and tell() can be implemented entirely in C.)
        Hide
        Ankur added a comment -

        > The need here is to ...
        Callback from C to Java is fine for read(). But seek() might be an issue since for true random access we need to be able to seek forward and backwards from
        1. start of the stream
        2. current pos of the stream
        3. end of the stream

        After taking a deep dive into the minizip code and implementing some POC code I am not sure how a seek() callback from C to java might be implemented in way that can be leveraged from existing minizip parser code. Any suggestions ?

        Just to give an idea, here is a some sample code for read() that I implemented.

        // including zlib & minizip libraries
        #include "unzip.h"

        // including java library
        #include <jni.h>
        #include "ZipInputFormat.h"

        //defining read() and seek() IO APIs

        uLong ZCALLBACK fread_file_func
        ( voidpf opaque, voidpf stream, void* buf, uLong size)

        {

        jlong bytesRead;
        JNIEnv *env = (JNIEnv *) opaque;
        jobject javaStream = (jobject) stream;
        jclass dataInputStream = (*env)->GetObjectClass(env, stream);
        jmethodID MID_read = (*env)->GetMethodID(env, dataInputStream, "read", "([BII)I");
        if(MID_read == NULL)

        { printf("\nfread_file_func(): read() method not found"); }

        else

        { jbyteArray byteArray = (*env)->NewByteArray(env, size); bytesRead = (*env)->CallIntMethod(env, javaStream, MID_read, byteArray, 0, size); (*env)->GetByteArrayRegion(env, byteArray, 0, bytesRead, buf); printf("\nNumber of bytes read: %u\n", bytesRead ); }

        return bytesRead;

        }

        // the native function exposed to Java, declared as a static method
        // dataStream is of type java.io.DataInputStream.
        // zipClass is of type ZipInputformat

        JNIEXPORT void JNICALL Java_ZipInputFormat_display
        (JNIEnv *env, jclass zipClass, jobject dataStream)

        { unsigned char * buf = (unsigned char *) malloc( sizeof (unsigned char) * 1024 * 64); fread_file_func(env, dataStream, buf, 64*1024); }
        Show
        Ankur added a comment - > The need here is to ... Callback from C to Java is fine for read(). But seek() might be an issue since for true random access we need to be able to seek forward and backwards from 1. start of the stream 2. current pos of the stream 3. end of the stream After taking a deep dive into the minizip code and implementing some POC code I am not sure how a seek() callback from C to java might be implemented in way that can be leveraged from existing minizip parser code. Any suggestions ? Just to give an idea, here is a some sample code for read() that I implemented. // including zlib & minizip libraries #include "unzip.h" // including java library #include <jni.h> #include "ZipInputFormat.h" //defining read() and seek() IO APIs uLong ZCALLBACK fread_file_func ( voidpf opaque, voidpf stream, void* buf, uLong size) { jlong bytesRead; JNIEnv *env = (JNIEnv *) opaque; jobject javaStream = (jobject) stream; jclass dataInputStream = (*env)->GetObjectClass(env, stream); jmethodID MID_read = (*env)->GetMethodID(env, dataInputStream, "read", "([BII)I"); if(MID_read == NULL) { printf("\nfread_file_func(): read() method not found"); } else { jbyteArray byteArray = (*env)->NewByteArray(env, size); bytesRead = (*env)->CallIntMethod(env, javaStream, MID_read, byteArray, 0, size); (*env)->GetByteArrayRegion(env, byteArray, 0, bytesRead, buf); printf("\nNumber of bytes read: %u\n", bytesRead ); } return bytesRead; } // the native function exposed to Java, declared as a static method // dataStream is of type java.io.DataInputStream. // zipClass is of type ZipInputformat JNIEXPORT void JNICALL Java_ZipInputFormat_display (JNIEnv *env, jclass zipClass, jobject dataStream) { unsigned char * buf = (unsigned char *) malloc( sizeof (unsigned char) * 1024 * 64); fread_file_func(env, dataStream, buf, 64*1024); }
        Hide
        Doug Cutting added a comment -

        Right, you can't seek a DataInputStream. Instead use FSDataInputStream, which is seekable.

        Show
        Doug Cutting added a comment - Right, you can't seek a DataInputStream. Instead use FSDataInputStream, which is seekable.
        Hide
        Ankur added a comment -

        Ok, But I should be able to change the offset to the end of the stream since central directory structure of zip file is at the end.
        Presently the FSDataInputStream.seek() throws IOExeption and doe not change the stream position if I try to position it past the
        end of stream which is unlike fseek() which positions the offset to end of stream.

        Is there a workaround to this or is it a functionality that needs to be added ?

        Show
        Ankur added a comment - Ok, But I should be able to change the offset to the end of the stream since central directory structure of zip file is at the end. Presently the FSDataInputStream.seek() throws IOExeption and doe not change the stream position if I try to position it past the end of stream which is unlike fseek() which positions the offset to end of stream. Is there a workaround to this or is it a functionality that needs to be added ?
        Hide
        Doug Cutting added a comment -

        Since the file is being accessed read-only, we can call FileSystem#getStatus(Path).getLen() and pass the file length from Java to C with the FSInputStream when we open the archive. Would that work?

        Arguably we should add a method to Seekable that returns the length, or perhaps adopt the convention that attempts to seek past EOF leave the pointer at EOF, but I don't think that's required for this issue.

        Show
        Doug Cutting added a comment - Since the file is being accessed read-only, we can call FileSystem#getStatus(Path).getLen() and pass the file length from Java to C with the FSInputStream when we open the archive. Would that work? Arguably we should add a method to Seekable that returns the length, or perhaps adopt the convention that attempts to seek past EOF leave the pointer at EOF, but I don't think that's required for this issue.
        Hide
        Ankur added a comment -

        Here is what I did
        1. Implemented JNI callbacks in C that callback Java for open, read, close, seek and tell on a FSDataInputStream.
        2. Implemented some JAVA test code to verify that callbacks work correctly.
        3. Made changes to existing Makefile of minizip to compile and build my C code as shared object.
        4. Placed the ".so" file in $LD_LIBRARY_PATH directory.

        The integration was successful and worked beautifully. The callbacks worked perfectly to ensure that zip file opened
        as FSDataInputStream was opened ad read correclty

        However, Sigh. I found that the minizip parsing code did'nt work correctly for Zip file > 2 GB.

        The code uses uLong (unsigned long 4 bytes) instead of jlong (signed long long 8 bytes).
        Replacing uLong with jlong would'nt work as code performs a lot of bit shifting operations. (I tried this.)

        Also the parsing code relies on directory structure entries being in 32 - bit format and will require RE-WORK
        based upon knowledge of 64-bit entries keeping in mind backward compatibility with 32 bit entries.

        Note:- The IO callback APIs implemented by me make use of jlong in read(), seek() and tell().

        QUESTION: Is the RE-WORK really required or is there a workaround that I am missing ???

        Show
        Ankur added a comment - Here is what I did 1. Implemented JNI callbacks in C that callback Java for open, read, close, seek and tell on a FSDataInputStream. 2. Implemented some JAVA test code to verify that callbacks work correctly. 3. Made changes to existing Makefile of minizip to compile and build my C code as shared object. 4. Placed the ".so" file in $LD_LIBRARY_PATH directory. The integration was successful and worked beautifully. The callbacks worked perfectly to ensure that zip file opened as FSDataInputStream was opened ad read correclty However, Sigh. I found that the minizip parsing code did'nt work correctly for Zip file > 2 GB. The code uses uLong (unsigned long 4 bytes) instead of jlong (signed long long 8 bytes). Replacing uLong with jlong would'nt work as code performs a lot of bit shifting operations. (I tried this.) Also the parsing code relies on directory structure entries being in 32 - bit format and will require RE-WORK based upon knowledge of 64-bit entries keeping in mind backward compatibility with 32 bit entries. Note:- The IO callback APIs implemented by me make use of jlong in read(), seek() and tell(). QUESTION: Is the RE-WORK really required or is there a workaround that I am missing ???
        Hide
        Ankur added a comment -

        Small correction:
        =============
        The parsing code works for zip archives UPTO 4 GB (Not 2 GB)
        It fails to process zip files of SIZE > 4 GB correctly.

        After a little more research I figured that the Zip64 format support (for files > 4 GB)
        is not implemented presently in the minizip code.

        So looks like if we need support for files > 4GB, then minizip parsing and reading code
        would definitely require re-work. In other words the minizip code would need to be "extended"
        to support Zip64 format.

        This in turn further increases the scope of work.

        Any suggestion or recommendations ?

        Show
        Ankur added a comment - Small correction: ============= The parsing code works for zip archives UPTO 4 GB (Not 2 GB) It fails to process zip files of SIZE > 4 GB correctly. After a little more research I figured that the Zip64 format support (for files > 4 GB) is not implemented presently in the minizip code. So looks like if we need support for files > 4GB, then minizip parsing and reading code would definitely require re-work. In other words the minizip code would need to be "extended" to support Zip64 format. This in turn further increases the scope of work. Any suggestion or recommendations ?
        Hide
        Doug Cutting added a comment -

        It looks like minizip is out, then. The unzip code is based on a file descriptor, but there are only 35 lines that touch that file descriptor, so it might not be too hard to modify it to read from something else. But then we have to maintain a branched version of that. Sigh.

        Show
        Doug Cutting added a comment - It looks like minizip is out, then. The unzip code is based on a file descriptor, but there are only 35 lines that touch that file descriptor, so it might not be too hard to modify it to read from something else. But then we have to maintain a branched version of that. Sigh.
        Hide
        Ankur added a comment -

        > ...so it might not be too hard to modify it to read from something else.
        Actually! I already spent sufficient time setting things up, adding new code (I/O apis that can be plugged to the unzip code) to make it work in a manner that a Zip file name is passed to a native C call from Java which then uses unzip APIs to do open/read/seek operations on it.

        What's different is that my custom implemented I/O APIs are used to construct the I/O function pointer structure and this structure is passed to unzip APIs. The custom I/O APIs are responsible for making Java callbacks whenever unzip APIs request an I/O operation via them.

        My concern is not that part, but the APIs of unzip.c that are ZIP format agnostic and does all the low level bit shifting operations, directory parsing, reading and uncompressing stuff since it is that part which fails for file > 4GB. Now modifying that part would mean 2 things.

        1. We would be extending unzip code in minizip to support ZIP64 format for our needs and we would be required to maintain it.
        2. Any modification would require decent knowledge of the format and would need to ensure backward compatibility with older ZIP format.

        So the question here is, do we go ahead and extend the minizip code for ZIP64 format? (This would be quite involved I think)
        Or do we stick with the present limitation of 4GB and schedule it for later ?

        Show
        Ankur added a comment - > ...so it might not be too hard to modify it to read from something else. Actually! I already spent sufficient time setting things up, adding new code (I/O apis that can be plugged to the unzip code) to make it work in a manner that a Zip file name is passed to a native C call from Java which then uses unzip APIs to do open/read/seek operations on it. What's different is that my custom implemented I/O APIs are used to construct the I/O function pointer structure and this structure is passed to unzip APIs. The custom I/O APIs are responsible for making Java callbacks whenever unzip APIs request an I/O operation via them. My concern is not that part, but the APIs of unzip.c that are ZIP format agnostic and does all the low level bit shifting operations, directory parsing, reading and uncompressing stuff since it is that part which fails for file > 4GB. Now modifying that part would mean 2 things. 1. We would be extending unzip code in minizip to support ZIP64 format for our needs and we would be required to maintain it. 2. Any modification would require decent knowledge of the format and would need to ensure backward compatibility with older ZIP format. So the question here is, do we go ahead and extend the minizip code for ZIP64 format? (This would be quite involved I think) Or do we stick with the present limitation of 4GB and schedule it for later ?
        Hide
        Doug Cutting added a comment -

        Sorry, I wasn't clear. I was thinking we might try using, instead of minizip, the source code for the unzip command line executable, http://www.info-zip.org/UnZip.html, which uses file-io directly, but not in too many places.

        Show
        Doug Cutting added a comment - Sorry, I wasn't clear. I was thinking we might try using, instead of minizip, the source code for the unzip command line executable, http://www.info-zip.org/UnZip.html , which uses file-io directly, but not in too many places.
        Hide
        Ankur added a comment -

        Thanks for clarifying . But even Unzip in its present release 5.52 does not serve our purpose of supporting large files ( > 4GB) since it does not take care of extra headers in Zip64 format that are used specifically for supporting large archives.

        This is well documented and clearly stated in the FAQ, http://www.info-zip.org/FAQ.html#limits.
        Given below is an excerpt from the page :-

        "Also note that in August 2001, PKWARE released PKZIP 4.50 with support for large files and archives via a pair of new header types, "PK\x06\x06" and "PK\x06\x07". So far these headers are undocumented, but most of their fields are fairly obvious. We don't yet know when Zip and UnZip will support this extension to the format. In the short term, it is possible to improve Zip and UnZip's capabilities slightly on certain Linux systems (and probably other Unix-like systems) by recompiling with the -DLARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 options. This will allow the utilities to handle uncompressed data files greater than 2 GB in size, as long as the total size of the archive containing them is less than 2 GB."
        =======================================================

        This leaves us with little options. Either we look for something else that implements zip64 extension and whose license is such that we can include it in our code or we ourselves implement these extensions in Minizip code which we will have to test extensively and maintain. Sigh.

        Show
        Ankur added a comment - Thanks for clarifying . But even Unzip in its present release 5.52 does not serve our purpose of supporting large files ( > 4GB) since it does not take care of extra headers in Zip64 format that are used specifically for supporting large archives. This is well documented and clearly stated in the FAQ, http://www.info-zip.org/FAQ.html#limits . Given below is an excerpt from the page :- "Also note that in August 2001, PKWARE released PKZIP 4.50 with support for large files and archives via a pair of new header types, "PK\x06\x06" and "PK\x06\x07". So far these headers are undocumented, but most of their fields are fairly obvious. We don't yet know when Zip and UnZip will support this extension to the format. In the short term, it is possible to improve Zip and UnZip's capabilities slightly on certain Linux systems (and probably other Unix-like systems) by recompiling with the -DLARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 options. This will allow the utilities to handle uncompressed data files greater than 2 GB in size, as long as the total size of the archive containing them is less than 2 GB." ======================================================= This leaves us with little options. Either we look for something else that implements zip64 extension and whose license is such that we can include it in our code or we ourselves implement these extensions in Minizip code which we will have to test extensively and maintain. Sigh.
        Hide
        Ankur added a comment -

        Or as another option we can have our implementation of ZipInputStream purely in Java (no native code) that is based upon Sun's Java.io.zip.ZipInputStream with some additions and modifications to :-

        1. Work with a Seekable stream (like FSDataInputStream).
        2. Read only central directory structure to obtain file information instead of sequentially
        reading the whole archive (Sun's implementation).
        3. Make sure Zip64 headers are processed correctly.

        This way we will have the following advantages.

        1. A pure Java Zip stream parser supporting Zip64 format (No native code).
        2. Support for Random as well as Sequential access.
        3. No dependency on any external components.
        4. Ease of modification for adding append when HDFS provides this facility.
        5. Possibility of donating our parser as a Zip64 compliant java zip parser to open source in future.

        The above of course require a lot of work but looking at the advantages I feel its worth it.

        Your opinion ?

        Show
        Ankur added a comment - Or as another option we can have our implementation of ZipInputStream purely in Java (no native code) that is based upon Sun's Java.io.zip.ZipInputStream with some additions and modifications to :- 1. Work with a Seekable stream (like FSDataInputStream). 2. Read only central directory structure to obtain file information instead of sequentially reading the whole archive (Sun's implementation). 3. Make sure Zip64 headers are processed correctly. This way we will have the following advantages. 1. A pure Java Zip stream parser supporting Zip64 format (No native code). 2. Support for Random as well as Sequential access. 3. No dependency on any external components. 4. Ease of modification for adding append when HDFS provides this facility. 5. Possibility of donating our parser as a Zip64 compliant java zip parser to open source in future. The above of course require a lot of work but looking at the advantages I feel its worth it. Your opinion ?
        Hide
        Doug Cutting added a comment -

        One of the major attractions of the zip format for Hadoop is that it provides interoperability with standard tools. But if we generate >4GB archives that shell tools cannot access, interoperability is broken. Folks might as well then use SequenceFile or some other Hadoop-specific format. So, until standard shell tools support access to >4GB zip archives, I see little motivation for Hadoop to support this.

        Show
        Doug Cutting added a comment - One of the major attractions of the zip format for Hadoop is that it provides interoperability with standard tools. But if we generate >4GB archives that shell tools cannot access, interoperability is broken. Folks might as well then use SequenceFile or some other Hadoop-specific format. So, until standard shell tools support access to >4GB zip archives, I see little motivation for Hadoop to support this.
        Hide
        Ankur added a comment -

        So do we wait for standard tools to support files > 4 GB before making a Zip InputFormat available in HADOOP ?

        Show
        Ankur added a comment - So do we wait for standard tools to support files > 4 GB before making a Zip InputFormat available in HADOOP ?
        Hide
        Ankur added a comment -

        Also it would be nice and I shall be thankful if you can recommend other bugs/issues that I can fix to make useful contributions

        Show
        Ankur added a comment - Also it would be nice and I shall be thankful if you can recommend other bugs/issues that I can fix to make useful contributions
        Hide
        Devaraj Das added a comment -

        Cancelling the patch since this work is not complete yet and maybe requires further discussions..

        Show
        Devaraj Das added a comment - Cancelling the patch since this work is not complete yet and maybe requires further discussions..
        Hide
        Grant Mackey added a comment -

        I have use for this canceled patch as of present. Are there bits of the code that need to modified in order for it to run properly on hadoop 0.17, or should I be able to pop them into the mapred directory and go?

        Show
        Grant Mackey added a comment - I have use for this canceled patch as of present. Are there bits of the code that need to modified in order for it to run properly on hadoop 0.17, or should I be able to pop them into the mapred directory and go?
        Hide
        Ankur added a comment -

        There are 2 problems with this patch.

        1. It does not split the zip files efficiently. This is because there is no way in Java to construct a zip input stream that permits random seeks given a zip entry name.
        2. Java's handling of large zip file is not robust.

        The plan was to modify the code to make use of an external zip parsing library that is compatible with Apache license. It was decided to use zip/unzip (standard shell tools) code via JNI but support for large zip files if still missing from unzip (Zip 3.0 is out with large zip file support).

        So at the moment, just waiting for Unzip 6.0 to come out and modify the code accrodingly.

        Show
        Ankur added a comment - There are 2 problems with this patch. 1. It does not split the zip files efficiently. This is because there is no way in Java to construct a zip input stream that permits random seeks given a zip entry name. 2. Java's handling of large zip file is not robust. The plan was to modify the code to make use of an external zip parsing library that is compatible with Apache license. It was decided to use zip/unzip (standard shell tools) code via JNI but support for large zip files if still missing from unzip (Zip 3.0 is out with large zip file support). So at the moment, just waiting for Unzip 6.0 to come out and modify the code accrodingly.
        Hide
        steve_l added a comment -

        The most tested/stable Apache-licensed Java unzip code is in Ant's codebase; you can either take/fork that or try and get the changes back in, which, with suitable tests, I am sure will be happily accepted.

        Show
        steve_l added a comment - The most tested/stable Apache-licensed Java unzip code is in Ant's codebase; you can either take/fork that or try and get the changes back in, which, with suitable tests, I am sure will be happily accepted.
        Hide
        Patrick Angeles added a comment -

        Any updates on this issue? What's the current thinking on shell tools + JNI versus Ant's unzip code? Anything I can do to contribute? Regards...

        Show
        Patrick Angeles added a comment - Any updates on this issue? What's the current thinking on shell tools + JNI versus Ant's unzip code? Anything I can do to contribute? Regards...

          People

          • Assignee:
            indrajit
            Reporter:
            Doug Cutting
          • Votes:
            9 Vote for this issue
            Watchers:
            23 Start watching this issue

            Dates

            • Created:
              Updated:

              Development