Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.0.14
-
None
Description
As we processed large PDF files, many of which containing large image streams, we wanted to avoid loading the entire streams into memory. Instead, we implemented a mechanism that merely referenced their location on disk.
We eventually did this by subclassing COSStream, and then overriding COSParser.parseCOSStream(COSDictionary) to conditionally create our stream. Here is the code, this is currently still a work-in-progress. I've just refactored the entire mechanism.
public class ReferencedCOSStream extends COSStream { //~ Instance members ------------------------------------------------------------------------------------------------------------------------------ boolean isReference = false; File reference = null; long offset = -1; long length = -1; //~ Constructors ---------------------------------------------------------------------------------------------------------------------------------- private ReferencedCOSStream(final ScratchFile scratchFile) { super(scratchFile); } //~ Methods --------------------------------------------------------------------------------------------------------------------------------------- public static ReferencedCOSStream createFromCOSStream(final COSStream stream) { final ReferencedCOSStream out = new ReferencedCOSStream(stream.getScratchFile()); for (final Map.Entry<COSName, COSBase> entry : stream.entrySet()) { out.setItem(entry.getKey(), entry.getValue()); } return out; } @Override public COSInputStream createInputStream(final DecodeOptions options) throws IOException { if (this.isReference) { final InputStream in = new SlicedFileInputStream(this.reference, this.offset, this.length); return COSInputStream.create(getFilterList(), this, in, this.getScratchFile(), options); } else { return super.createInputStream(options); } } @Override public InputStream createRawInputStream() throws IOException { if (this.isReference) { return new SlicedFileInputStream(this.reference, this.offset, this.length); } else { return super.createRawInputStream(); } } @Override public OutputStream createOutputStream(final COSBase filters) throws IOException { this.isReference = false; return super.createOutputStream(filters); } @Override public OutputStream createRawOutputStream() throws IOException { this.isReference = false; return super.createRawOutputStream(); } public void setReference(final File file, final long offset, final long length) { this.isReference = true; this.reference = file; this.offset = offset; this.length = length; this.setLong(COSName.LENGTH, length); } //~ Inner Classes --------------------------------------------------------------------------------------------------------------------------------- private class SlicedFileInputStream extends FileInputStream { //~ Instance members --------------------------------------------------------------------------------------------------------------------------- private long index; private final long length; //~ Constructors ------------------------------------------------------------------------------------------------------------------------------- public SlicedFileInputStream(final File file, final long offset, final long length) throws FileNotFoundException, IOException { super(file); this.length = length; this.skip(offset); this.index = 0; } //~ Methods ------------------------------------------------------------------------------------------------------------------------------------ @Override public int available() throws IOException { final long remaining = length - index; if (remaining < 0) { return 0; } return (int)remaining; } @Override public int read(final byte[] b) throws IOException { final int remaining = this.available(); final int len = (remaining < b.length) ? remaining : b.length; index += len; if (len > 0) { return super.read(b, 0, len); } else { return -1; } } @Override public int read(final byte[] b, final int off, int len) throws IOException { final int remaining = this.available(); len = (remaining < len) ? remaining : len; index += len; if (len > 0) { return super.read(b, 0, len); } else { return -1; } } @Override public long skip(final long n) throws IOException { index += n; return super.skip(n); } @Override public FileChannel getChannel() { throw new UnsupportedOperationException("Obtaining a FileChannel is not supported because a correct offset cannot be ensured."); } } }
@Override protected COSStream parseCOSStream(final COSDictionary dic) throws IOException { /* * This needs to be dic.getItem because when we are parsing, the underlying object might still be null. */ final COSNumber streamLengthObj = getLength(dic.getItem(COSName.LENGTH), dic.getCOSName(COSName.TYPE)); COSStream stream = document.createCOSStream(dic); // read 'stream'; this was already tested in parseObjectsDynamically() readString(); skipWhiteSpaces(); if (streamLengthObj == null) { if (isLenient) { LOG.warn("The stream doesn't provide any stream length, using fallback readUntilEnd, at offset " + source.getPosition()); } else { throw new IOException("Missing length for stream."); } } if ((streamLengthObj != null) && (streamLengthObj.longValue() >= 1024)) { final long streamBegPos = source.getPosition(); final ReferencedCOSStream refStream = ReferencedCOSStream.createFromCOSStream(stream); try { readValidStream(null, streamLengthObj); } finally { stream.setItem(COSName.LENGTH, streamLengthObj); } refStream.setReference(new File(reference), streamBegPos, source.getPosition() - streamBegPos); stream = refStream; } else { try(final OutputStream out = stream.createRawOutputStream()) { if ((streamLengthObj != null) && validateStreamLength(streamLengthObj.longValue())) { readValidStream(out, streamLengthObj); } else { readUntilEndStream(new EndstreamOutputStream(out)); } } finally { stream.setItem(COSName.LENGTH, streamLengthObj); } } final String endStream = readString(); if (endStream.equals("endobj") && isLenient) { LOG.warn("stream ends with 'endobj' instead of 'endstream' at offset " + source.getPosition()); // avoid follow-up warning about missing endobj source.rewind(ENDOBJ.length); } else if ((endStream.length() > 9) && isLenient && endStream.substring(0, 9).equals(ENDSTREAM_STRING)) { LOG.warn("stream ends with '" + endStream + "' instead of 'endstream' at offset " + source.getPosition()); // unread the "extra" bytes source.rewind(endStream.substring(9).getBytes(ISO_8859_1).length); } else if (!endStream.equals(ENDSTREAM_STRING)) { throw new IOException("Error reading stream, expected='endstream' actual='" + endStream + "' at offset " + source.getPosition()); } return stream; }
The class ReferencedCOSStream exposes the underlying data in exactly the same way as it does COSStream, but instead of keeping the storage in memory, it always opens a FileInputStream to retrieve the content. SlicedFileInputStream basically wraps around a FileInputStream and tries to imitate the behaviour of an InputStream for this specific chunk of data.
I needed to expose some APIs for these classes, the method ReferencedCOSStream.createFromCOSStream(COSStream) would better be located in PDDocument and create the stream directly, I just didn't want to also modify PDDocument.
Right now, encrypted streams are currently loaded into memory by the SecurityHandler directly after creation. If you want to accept this proposal, it might make sense to move the decryption handling also into COSStream and ReferencedCOSStream and perform it upon request.