Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.3
    • Component/s: general, metadata
    • Labels:
    • Environment:

      The default ExternalEmbedder requires that sed be installed.

      Description

      This patch defines and implements the concept of embedding tika metadata into a file stream, the reverse of extraction.

      In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder implementation meant to be extended or configured are added. These classes are essentially a reverse flow of the existing Parser and ExternalParser classes.

      In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then verify the operation by parsing the resulting stream.

      1. embed_20121029.diff
        36 kB
        Ray Gauss II
      2. embed.diff
        33 kB
        Ray Gauss II
      3. tika-core-embed-patch.txt
        19 kB
        Ray Gauss II
      4. tika-parsers-embed-patch.txt
        8 kB
        Ray Gauss II

        Issue Links

          Activity

          Hide
          Jukka Zitting added a comment -

          I'd like to have a concrete use case for introducing a new concept like this. What exact need are you addressing? Also, are there other existing tools that could be used instead of coming up with a new API. This seems like a pretty significant new feature, so it would be best if we did it right from the beginning.

          Design-wise it would be better for the embed() method to write it's results to an OutputStream given as an argument (just like the Parser interface takes a ContentHandler argument). Returning an InputStream brings up all sorts of issues about timing, error reporting, etc.

          Show
          Jukka Zitting added a comment - I'd like to have a concrete use case for introducing a new concept like this. What exact need are you addressing? Also, are there other existing tools that could be used instead of coming up with a new API. This seems like a pretty significant new feature, so it would be best if we did it right from the beginning. Design-wise it would be better for the embed() method to write it's results to an OutputStream given as an argument (just like the Parser interface takes a ContentHandler argument). Returning an InputStream brings up all sorts of issues about timing, error reporting, etc.
          Hide
          Ray Gauss II added a comment -

          I think there are many use cases for embedding metadata in addition to extracting, but for us specifically: we're using extensions to Alfresco to enable users to modify or enter new metadata via its web interface which then triggers an Alfresco metadata embedder which will use these tika additions to do the work of actually writing the metadata to the file.

          We currently focus on images and embedding IPTC and XMP metadata but I'd envision people would have similar needs embedding things like ID3 in audio, MPEG-7 for video, etc. in all sorts of clients and apps.

          I'm sure there are other existing tools but Tika is, quite frankly, pretty sweet, and these 'write' capabilities seem like a perfect fit for tika's existing 'read' features, and the metadata tags and concepts are already present and well organized.

          I agree that great care should be taken in implementing this and I've tried to structure things so that they follow precedence set on the parsing side but I'm pretty new to the project.

          I'll have a look at refactoring for OutputStream as an argument.

          Thanks for taking a look!

          Show
          Ray Gauss II added a comment - I think there are many use cases for embedding metadata in addition to extracting, but for us specifically: we're using extensions to Alfresco to enable users to modify or enter new metadata via its web interface which then triggers an Alfresco metadata embedder which will use these tika additions to do the work of actually writing the metadata to the file. We currently focus on images and embedding IPTC and XMP metadata but I'd envision people would have similar needs embedding things like ID3 in audio, MPEG-7 for video, etc. in all sorts of clients and apps. I'm sure there are other existing tools but Tika is, quite frankly, pretty sweet, and these 'write' capabilities seem like a perfect fit for tika's existing 'read' features, and the metadata tags and concepts are already present and well organized. I agree that great care should be taken in implementing this and I've tried to structure things so that they follow precedence set on the parsing side but I'm pretty new to the project. I'll have a look at refactoring for OutputStream as an argument. Thanks for taking a look!
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.2
          Show
          Chris A. Mattmann added a comment - push out to 1.2
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.3
          Show
          Chris A. Mattmann added a comment - push to 1.3
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.3
          Show
          Chris A. Mattmann added a comment - push to 1.3
          Hide
          Ray Gauss II added a comment -

          Attached is a newer patch which:

          • Adds an Embedder interface, similar to Parser, which defines getSupportedEmbedTypes and an embed method
          • Adds a base ExternalEmbedder implementation of the Embedder interface, similar to ExternalParser, which can call a command line executable, the default being sed, to perform embedding
          • Adds a base ExternalEmbedderTest which 'embeds' lines in a text file then uses a TXTParser to verify the expected embedded metadata exists

          The embed methods have been refactored to take an output stream argument for writing to as suggested in the past here.

          Unless anyone sees an issue with the concepts or approach I'll commit in a few days.

          Show
          Ray Gauss II added a comment - Attached is a newer patch which: Adds an Embedder interface, similar to Parser, which defines getSupportedEmbedTypes and an embed method Adds a base ExternalEmbedder implementation of the Embedder interface, similar to ExternalParser, which can call a command line executable, the default being sed, to perform embedding Adds a base ExternalEmbedderTest which 'embeds' lines in a text file then uses a TXTParser to verify the expected embedded metadata exists The embed methods have been refactored to take an output stream argument for writing to as suggested in the past here. Unless anyone sees an issue with the concepts or approach I'll commit in a few days.
          Hide
          Jörg Ehrlich added a comment -

          Hi Ray,

          I think it would be great if Tika could also write Metadata back to files and it would be great to start on this rather sooner than later.
          But I have a couple of comments regarding your proposed implementation:

          1) Right now the Parsers do both content and metadata extraction. The proposed embedder does only Metadata embedding, which is fine because updating of content would be out of scope for Tika.
          But if we introduce separate APIs to embed just metadata I think it would make sense to also introduce APIs to only extract metadata. Actually at Adobe we had stop using Tika to retrieve Metadata from specific file formats because it always parses the whole content which is simply too heavy an operation to scale in a larger system.
          So I planned to get started on a new API and adjustments to parsers to just retrieve Metadata from files, but did not have time for this, yet. I guess it would make sense to synchronize these two new APIs, right?
          Being able to just parse Metadata from files is actually also very important for the embedding of it, which I will explain further down.

          2) Your documentation does not really specify in detail the behavior of the metadata update that should happen.
          Does it always update all metadata in the file, i.e. does it delete properties that are not in the Metadata object? Or does it only update those properties that are provided in the Metadata object? How do I delete properties then? Do I make the property empty? But empty properties are in most metadata containers a valid property value and should not delete the property.
          Where does the embedding take place? A lot of file formats have several metadata containers with similar properties. Does the embed method update all of them? Or just the ones, the parsers were looking at? What happens in case of inconsistencies? Do you read/write from specific fields or do you reconcile all of them together?
          What happens for properties where the file format specific fields have a fixed length or different encodings? Do you just write as much as possible and the rest is simply ignored?

          For all such questions, you have to think about whether it makes sense to provide the client with the ability to either configure the embedder or provide a callback API for the client to decide if specific scenarios arise or if the embedder should always just do a best guess for the client.

          In any such case, it is usually for the client important to get the original metadata from the file, before writing it back, so that no properties are wrongly deleted or changed. But even more so it is important for the Embedder as it would in most cases have to read the metadata anyway, in order to know how to update the file properly. It usually has to check if an in-place update of metadata can happen or if the whole file has to be restructured because the metadata chunks have grown too large to fit where they were before.
          That's why I think it would be important to have a get-only-metadata API and Parser capabilities available, before starting writing it back.

          3) This also leads me to the topic of error recovery and safe updating of files. I think the documentation should be more clear about what the Embedder will do in case of an error and what is expected by the client.
          There are all sorts of reasons the embedding could fail. If that happens, the original file usually ends up being corrupt and lost for the user. So it usually makes sense (for samller files) to do a safe update, which means writing the update in a new file and then swap it with the original one, after the update was successful.
          But what about scenarios where a partial update is possible? You often have files where just specific metadata sections are corrupt because some tool did not read the spec and wrote it wrongly. But the rest of the file is still ok, so other parts could still be updated. Do you want to provide a callback API for the client to be able to react to error scenarios and decide what he wants to do? The embedder could do a best guess action, but that is usually quite dangerous for the user's files.

          4) I take it that the expectation is that all parsers could also potentially implement the Embedder interface, so that both reading and writing is in one hand? Otherwise you probably end up with all sorts of inconsistencies between the two implementations regarding what metadata fields are read from where and what should be updated when, etc.

          5) Why do you pass in an InputStream? That would mean the Embedder has to open up an own OutputStream to be able to write. That would imply that Tika knows how to properly create OutputStreams in the client's environment. Wouldn't it be better to leave the client in control here? And why do you want to return the InputStream?

          6) I also agree with Jukka's comments that for such an important new feature we should spend some more thoughts on this. I think your proposal works ok for the external embedder scenario but I am not so sure for other scenarios.

          Sorry that I did not speak up earlier. This issue has been around for quite a while.
          Regards
          Jörg

          Show
          Jörg Ehrlich added a comment - Hi Ray, I think it would be great if Tika could also write Metadata back to files and it would be great to start on this rather sooner than later. But I have a couple of comments regarding your proposed implementation: 1) Right now the Parsers do both content and metadata extraction. The proposed embedder does only Metadata embedding, which is fine because updating of content would be out of scope for Tika. But if we introduce separate APIs to embed just metadata I think it would make sense to also introduce APIs to only extract metadata. Actually at Adobe we had stop using Tika to retrieve Metadata from specific file formats because it always parses the whole content which is simply too heavy an operation to scale in a larger system. So I planned to get started on a new API and adjustments to parsers to just retrieve Metadata from files, but did not have time for this, yet. I guess it would make sense to synchronize these two new APIs, right? Being able to just parse Metadata from files is actually also very important for the embedding of it, which I will explain further down. 2) Your documentation does not really specify in detail the behavior of the metadata update that should happen. Does it always update all metadata in the file, i.e. does it delete properties that are not in the Metadata object? Or does it only update those properties that are provided in the Metadata object? How do I delete properties then? Do I make the property empty? But empty properties are in most metadata containers a valid property value and should not delete the property. Where does the embedding take place? A lot of file formats have several metadata containers with similar properties. Does the embed method update all of them? Or just the ones, the parsers were looking at? What happens in case of inconsistencies? Do you read/write from specific fields or do you reconcile all of them together? What happens for properties where the file format specific fields have a fixed length or different encodings? Do you just write as much as possible and the rest is simply ignored? For all such questions, you have to think about whether it makes sense to provide the client with the ability to either configure the embedder or provide a callback API for the client to decide if specific scenarios arise or if the embedder should always just do a best guess for the client. In any such case, it is usually for the client important to get the original metadata from the file, before writing it back, so that no properties are wrongly deleted or changed. But even more so it is important for the Embedder as it would in most cases have to read the metadata anyway, in order to know how to update the file properly. It usually has to check if an in-place update of metadata can happen or if the whole file has to be restructured because the metadata chunks have grown too large to fit where they were before. That's why I think it would be important to have a get-only-metadata API and Parser capabilities available, before starting writing it back. 3) This also leads me to the topic of error recovery and safe updating of files. I think the documentation should be more clear about what the Embedder will do in case of an error and what is expected by the client. There are all sorts of reasons the embedding could fail. If that happens, the original file usually ends up being corrupt and lost for the user. So it usually makes sense (for samller files) to do a safe update, which means writing the update in a new file and then swap it with the original one, after the update was successful. But what about scenarios where a partial update is possible? You often have files where just specific metadata sections are corrupt because some tool did not read the spec and wrote it wrongly. But the rest of the file is still ok, so other parts could still be updated. Do you want to provide a callback API for the client to be able to react to error scenarios and decide what he wants to do? The embedder could do a best guess action, but that is usually quite dangerous for the user's files. 4) I take it that the expectation is that all parsers could also potentially implement the Embedder interface, so that both reading and writing is in one hand? Otherwise you probably end up with all sorts of inconsistencies between the two implementations regarding what metadata fields are read from where and what should be updated when, etc. 5) Why do you pass in an InputStream? That would mean the Embedder has to open up an own OutputStream to be able to write. That would imply that Tika knows how to properly create OutputStreams in the client's environment. Wouldn't it be better to leave the client in control here? And why do you want to return the InputStream? 6) I also agree with Jukka's comments that for such an important new feature we should spend some more thoughts on this. I think your proposal works ok for the external embedder scenario but I am not so sure for other scenarios. Sorry that I did not speak up earlier. This issue has been around for quite a while. Regards Jörg
          Hide
          Ray Gauss II added a comment -

          Hi Jörg,

          Note that the embed.diff file attached to the issue is more current and replaces the previous patch.txt files. I've also changed just a few things since posting embed.diff, primarily around error handling. I'll post another diff soon with Javadoc additions mentioned below.

          1) I'm not sure exactly what you mean here. The Parser interface only guarantees a parse method and supported types. It says nothing about requiring the entire content to be extracted by the implementation. The parser interface also makes no specification about how the given input stream must be read or processed, so each implementation can do that however it sees fit. Similarly the Embedder.embed method says nothing about requiring or preventing content from being updated, so if a particular embedder implementation wants to update the content itself I suppose there's no reason it couldn't.

          2) This is intentionally somewhat vague (but perhaps too much so) as each embedder may implement this slightly differently, though we should have a suggested approach, and in general I think that approach should favor preserving the source file's metadata unless explicitly specified. I will add some of this to the Javadoc but for your specific questions I think the answers would be:

          • Q: Does it always update all metadata in the file, i.e. does it delete properties that are not in the Metadata object?
          • A: Embedder implementations should only attempt to update metadata fields present in the given Metadata object
          • Q: How are empty properties set?
          • A: Embedder implementations should set properties as empty when the corresponding field in the Metadata object is an empty string, i.e. ""
          • Q: How do I delete properties?
          • A: Embedder implementations should nullify or delete properties corresponding to fields with a null value in the given Metadata object.
          • Q: Where does the embedding take place?
          • A: That's up to the embedder implementation and particular file format.
          • Q: Does the embed method update properties in all metadata containers?
          • A: Embedder implementations should set the property corresponding to a particular field in the given Metadata object in all metadata containers whenever possible and appropriate for the file format at the time. If a particular metadata container falls out of use and/or is superseded by another (such as IIC vs XMP for IPTC) it is up to the implementation to decide if and when to cease embedding in the alternate container.
          • Q: What happens for properties where the file format specific fields have a fixed length or different encodings?
          • A: Embedder implementations should attempt to embed as much of the metadata as accurately as possible. An implementation may choose a strict approach and throw an exception if a value to be embedded exceeds the length allowed or may choose to truncate the value.

          For that last one we could consider adding a second embed method to Embedder which also accepts a boolean isStrict parameter which would allow a single implementation to operate in a mode where it would throw exceptions on bad data vs. doing something like truncating. Implementations could always implement that themselves so I'm not sure we need it in the interface.

          3 and 5) The client is in control of the output stream as the client is responsible for creating it and passing it to the embed method. The Embedder needs the given input stream to read the source data and writes the final data with metadata embedded to the given output stream. As such, consumers of the embed method are dictating what that output stream is, which will probably be a temp file in most cases, and the client can refrain from an writing to the actual source file in the case of receiving an exception. See the ExternalEmbedderTest for an example of creating a temp file output stream for the embedder to write to.

          4) Yes, parser implementations could choose to implement the Embedder interface as well. That was the reason for naming getSupportedEmbedTypes differently than Parser's existing getSupportedTypes method.

          If the above doesn't answer your concerns I'm more than happy to flesh things out further.

          Regards,

          Ray

          Show
          Ray Gauss II added a comment - Hi Jörg, Note that the embed.diff file attached to the issue is more current and replaces the previous patch.txt files. I've also changed just a few things since posting embed.diff, primarily around error handling. I'll post another diff soon with Javadoc additions mentioned below. 1) I'm not sure exactly what you mean here. The Parser interface only guarantees a parse method and supported types. It says nothing about requiring the entire content to be extracted by the implementation. The parser interface also makes no specification about how the given input stream must be read or processed, so each implementation can do that however it sees fit. Similarly the Embedder.embed method says nothing about requiring or preventing content from being updated, so if a particular embedder implementation wants to update the content itself I suppose there's no reason it couldn't. 2) This is intentionally somewhat vague (but perhaps too much so) as each embedder may implement this slightly differently, though we should have a suggested approach, and in general I think that approach should favor preserving the source file's metadata unless explicitly specified. I will add some of this to the Javadoc but for your specific questions I think the answers would be: Q: Does it always update all metadata in the file, i.e. does it delete properties that are not in the Metadata object? A: Embedder implementations should only attempt to update metadata fields present in the given Metadata object Q: How are empty properties set? A: Embedder implementations should set properties as empty when the corresponding field in the Metadata object is an empty string, i.e. "" Q: How do I delete properties? A: Embedder implementations should nullify or delete properties corresponding to fields with a null value in the given Metadata object. Q: Where does the embedding take place? A: That's up to the embedder implementation and particular file format. Q: Does the embed method update properties in all metadata containers? A: Embedder implementations should set the property corresponding to a particular field in the given Metadata object in all metadata containers whenever possible and appropriate for the file format at the time. If a particular metadata container falls out of use and/or is superseded by another (such as IIC vs XMP for IPTC) it is up to the implementation to decide if and when to cease embedding in the alternate container. Q: What happens for properties where the file format specific fields have a fixed length or different encodings? A: Embedder implementations should attempt to embed as much of the metadata as accurately as possible. An implementation may choose a strict approach and throw an exception if a value to be embedded exceeds the length allowed or may choose to truncate the value. For that last one we could consider adding a second embed method to Embedder which also accepts a boolean isStrict parameter which would allow a single implementation to operate in a mode where it would throw exceptions on bad data vs. doing something like truncating. Implementations could always implement that themselves so I'm not sure we need it in the interface. 3 and 5) The client is in control of the output stream as the client is responsible for creating it and passing it to the embed method. The Embedder needs the given input stream to read the source data and writes the final data with metadata embedded to the given output stream. As such, consumers of the embed method are dictating what that output stream is, which will probably be a temp file in most cases, and the client can refrain from an writing to the actual source file in the case of receiving an exception. See the ExternalEmbedderTest for an example of creating a temp file output stream for the embedder to write to. 4) Yes, parser implementations could choose to implement the Embedder interface as well. That was the reason for naming getSupportedEmbedTypes differently than Parser's existing getSupportedTypes method. If the above doesn't answer your concerns I'm more than happy to flesh things out further. Regards, Ray
          Hide
          Ray Gauss II added a comment -

          Newer patch with slightly different error handling, better embed test values, and additional Javadoc discussed in the JIRA issue.

          Show
          Ray Gauss II added a comment - Newer patch with slightly different error handling, better embed test values, and additional Javadoc discussed in the JIRA issue.
          Hide
          Jörg Ehrlich added a comment -

          Hi Ray,

          thanks for the additional explanations and the new patch. I had indeed been looking at the wrong one before.
          At the moment I don't have any additional comments or concerns.

          Thanks
          Jörg

          Show
          Jörg Ehrlich added a comment - Hi Ray, thanks for the additional explanations and the new patch. I had indeed been looking at the wrong one before. At the moment I don't have any additional comments or concerns. Thanks Jörg
          Hide
          Ray Gauss II added a comment -

          I'll commit tomorrow unless there are objections then.

          Regards,

          Ray

          Show
          Ray Gauss II added a comment - I'll commit tomorrow unless there are objections then. Regards, Ray
          Hide
          Ray Gauss II added a comment -

          Committed in r1403941.

          Show
          Ray Gauss II added a comment - Committed in r1403941.
          Hide
          Jukka Zitting added a comment -

          There's a few problems with the implementation.

          • The ExternalEmbedderTest fails in a plain Windows environment since it can't find sed. I added a workaround in revision 1411238 that simply disables the test on Windows.
          • It would be better if ExternalEmbeddedTest was located in tika-core along with the ExternalEmbedder class itself. The use of TXTParser in the test case seems unnecessary.
          • More generally the test case is quite complicated. Is it being reused elsewhere, or can we simplify it? I'd just drop all the extra logging, error handling and flag variables.
          • The ExternalEmbedder class also seems quite complicated, though I notice much of it comes from ExternalParser. Can we for example refactor the common bits to a shared base class?
          • See the ExternalParser class for how you can (and should) use the TemporaryResources class to avoid all the complex cleanup logic. Used properly, the dispose() method takes care of all that.
          • It's usually a bad idea to capture InterruptedException and just ignore it. Throwing the exception (possibly wrapped into a TikaException) is probably a better approach.
          Show
          Jukka Zitting added a comment - There's a few problems with the implementation. The ExternalEmbedderTest fails in a plain Windows environment since it can't find sed . I added a workaround in revision 1411238 that simply disables the test on Windows. It would be better if ExternalEmbeddedTest was located in tika-core along with the ExternalEmbedder class itself. The use of TXTParser in the test case seems unnecessary. More generally the test case is quite complicated. Is it being reused elsewhere, or can we simplify it? I'd just drop all the extra logging, error handling and flag variables. The ExternalEmbedder class also seems quite complicated, though I notice much of it comes from ExternalParser. Can we for example refactor the common bits to a shared base class? See the ExternalParser class for how you can (and should) use the TemporaryResources class to avoid all the complex cleanup logic. Used properly, the dispose() method takes care of all that. It's usually a bad idea to capture InterruptedException and just ignore it. Throwing the exception (possibly wrapped into a TikaException) is probably a better approach.
          Hide
          Ray Gauss II added a comment -
          • The ExternalEmbedderTest fails in a plain Windows environment since it can't find sed. I added a workaround in revision 1411238 that simply disables the test on Windows.

          Great, thanks.

          • It would be better if ExternalEmbeddedTest was located in tika-core along with the ExternalEmbedder class itself. The use of TXTParser in the test case seems unnecessary.
          • More generally the test case is quite complicated. Is it being reused elsewhere, or can we simplify it? I'd just drop all the extra logging, error handling and flag variables.

          The ExternalEmbedderTest is indeed meant to be extended, an example can be found in the tika-exiftool project [1].

          The use of TXTParser in the test without that context does seem like overkill, but in general I think we'll want to encourage tests that verify through a relevant parser that the metadata was embedded properly.

          The test certainly could be simplified, I kept it on the verbose side since it's introducing a new concept.

          • The ExternalEmbedder class also seems quite complicated, though I notice much of it comes from ExternalParser. Can we for example refactor the common bits to a shared base class?

          Embedders aren't required to be parsers and vice versa, and since ExternalParser extends AbstractParser we can't have a common base class. Some methods could probably be made static and moved into utils classes though.

          • See the ExternalParser class for how you can (and should) use the TemporaryResources class to avoid all the complex cleanup logic. Used properly, the dispose() method takes care of all that.

          Yes, I followed the precedence there, but the timing of access and cleanup across the permutations of various streams for outputFromStdOut true or false, input file, output file, stdErr, etc. was a bit trickier than what the ExternalParser has to handle. I'm sure this could be optimized further though.

          • It's usually a bad idea to capture InterruptedException and just ignore it. Throwing the exception (possibly wrapped into a TikaException) is probably a better approach.

          Yes, copied and pasted from ExternalParser and did not give it proper diligence, my mistake. I'll refactor that and the ExternalParser to wrap in a TikaException which is my approach in most cases.

          I'm on holiday for a few weeks soon and not sure I'll be able to make the changes I mentioned before then, but I'll try.

          [1] https://github.com/Alfresco/tika-exiftool/blob/master/src/test/java/org/apache/tika/embedder/exiftool/ExiftoolExternalEmbedderTest.java

          Show
          Ray Gauss II added a comment - The ExternalEmbedderTest fails in a plain Windows environment since it can't find sed . I added a workaround in revision 1411238 that simply disables the test on Windows. Great, thanks. It would be better if ExternalEmbeddedTest was located in tika-core along with the ExternalEmbedder class itself. The use of TXTParser in the test case seems unnecessary. More generally the test case is quite complicated. Is it being reused elsewhere, or can we simplify it? I'd just drop all the extra logging, error handling and flag variables. The ExternalEmbedderTest is indeed meant to be extended, an example can be found in the tika-exiftool project [1] . The use of TXTParser in the test without that context does seem like overkill, but in general I think we'll want to encourage tests that verify through a relevant parser that the metadata was embedded properly. The test certainly could be simplified, I kept it on the verbose side since it's introducing a new concept. The ExternalEmbedder class also seems quite complicated, though I notice much of it comes from ExternalParser. Can we for example refactor the common bits to a shared base class? Embedders aren't required to be parsers and vice versa, and since ExternalParser extends AbstractParser we can't have a common base class. Some methods could probably be made static and moved into utils classes though. See the ExternalParser class for how you can (and should) use the TemporaryResources class to avoid all the complex cleanup logic. Used properly, the dispose() method takes care of all that. Yes, I followed the precedence there, but the timing of access and cleanup across the permutations of various streams for outputFromStdOut true or false, input file, output file, stdErr, etc. was a bit trickier than what the ExternalParser has to handle. I'm sure this could be optimized further though. It's usually a bad idea to capture InterruptedException and just ignore it. Throwing the exception (possibly wrapped into a TikaException) is probably a better approach. Yes, copied and pasted from ExternalParser and did not give it proper diligence, my mistake. I'll refactor that and the ExternalParser to wrap in a TikaException which is my approach in most cases. I'm on holiday for a few weeks soon and not sure I'll be able to make the changes I mentioned before then, but I'll try. [1] https://github.com/Alfresco/tika-exiftool/blob/master/src/test/java/org/apache/tika/embedder/exiftool/ExiftoolExternalEmbedderTest.java
          Hide
          Ray Gauss II added a comment -

          According to a few posts on the subject including one on developerWorks [1] it looks like it's more appropriate to reassert the thread's interrupt status with:

          ...
          } catch (InterruptedException ignore) {
              Thread.currentThread().interrupt();
          }
          ...
          

          rather than refactoring ExternalParser and ExternalEmbedder to re-throw it or wrap in a TikaException.

          I too would prefer ExternalEmbedderTest to be in core, but I do feel that we want to confirm the embedding with a known working parser. Would anyone have issue with moving TXTParser and its test into core? There don't seem to be any issues with dependencies when trying it.

          [1] http://www.ibm.com/developerworks/java/library/j-jtp05236/index.html

          Show
          Ray Gauss II added a comment - According to a few posts on the subject including one on developerWorks [1] it looks like it's more appropriate to reassert the thread's interrupt status with: ... } catch (InterruptedException ignore) { Thread .currentThread().interrupt(); } ... rather than refactoring ExternalParser and ExternalEmbedder to re-throw it or wrap in a TikaException . I too would prefer ExternalEmbedderTest to be in core, but I do feel that we want to confirm the embedding with a known working parser. Would anyone have issue with moving TXTParser and its test into core? There don't seem to be any issues with dependencies when trying it. [1] http://www.ibm.com/developerworks/java/library/j-jtp05236/index.html
          Hide
          Nick Burch added a comment -

          Could you maybe add a simple dummy parser for testing with?

          Another option is to have the test in the parsers package, even though the main code is in core. We have quite a few examples of that, eg some of the mime magic stuff is tested in parsers because that's where the test files live

          Show
          Nick Burch added a comment - Could you maybe add a simple dummy parser for testing with? Another option is to have the test in the parsers package, even though the main code is in core. We have quite a few examples of that, eg some of the mime magic stuff is tested in parsers because that's where the test files live
          Hide
          Jukka Zitting added a comment -

          {{ catch (InterruptedException ignore) { Thread.currentThread().interrupt(); } }}

          Resetting the interrupt status of the thread is a good idea when we can't just re-throw the InterruptedException, but IMHO we should in that case also throw a TikaException to indicate that the requested operation couldn't be completed normally:

          ...
          } catch (InterruptedException e) {
              Thread.currentThread().interrupt();
              throw new TikaException("Interrupted", e);
          }
          ...
          

          When the Process.waitFor() call is interrupted we can't just blindly assume that the process completed normally. Explicitly throwing an exception is a better approach in such cases.

          Show
          Jukka Zitting added a comment - {{ catch (InterruptedException ignore) { Thread.currentThread().interrupt(); } }} Resetting the interrupt status of the thread is a good idea when we can't just re-throw the InterruptedException, but IMHO we should in that case also throw a TikaException to indicate that the requested operation couldn't be completed normally: ... } catch (InterruptedException e) { Thread .currentThread().interrupt(); throw new TikaException( "Interrupted" , e); } ... When the Process.waitFor() call is interrupted we can't just blindly assume that the process completed normally. Explicitly throwing an exception is a better approach in such cases.
          Hide
          Ray Gauss II added a comment -

          @Jukka, I'm new to InterruptedException but from what I've read it seems like it's thrown when a request to stop the thread has been received, not necessarily when something went wrong. So if I understand it correctly, throwing a TikaException would result in the possibility of a behavior where some code asks Tika to stop and Tika blows up with an exception rather than throwing an expected InterruptedException or setting the Thread's interrupt status.

          @Nick, TXTParser seems pretty simple already, I'd hate to duplicate work. The code is currently spread out just as you suggest, with ExternalEmbedder in core and ExternalEmbedderTest in parsers.

          Show
          Ray Gauss II added a comment - @Jukka, I'm new to InterruptedException but from what I've read it seems like it's thrown when a request to stop the thread has been received, not necessarily when something went wrong. So if I understand it correctly, throwing a TikaException would result in the possibility of a behavior where some code asks Tika to stop and Tika blows up with an exception rather than throwing an expected InterruptedException or setting the Thread's interrupt status. @Nick, TXTParser seems pretty simple already, I'd hate to duplicate work. The code is currently spread out just as you suggest, with ExternalEmbedder in core and ExternalEmbedderTest in parsers.
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.4
          Show
          Chris A. Mattmann added a comment - push out to 1.4
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.4
          Show
          Chris A. Mattmann added a comment - push out to 1.4
          Hide
          Ray Gauss II added a comment -

          This code is already on trunk.

          Can we re-resolve for 1.3 and open new, 'smaller' issues for 1.4 if there are still specific concerns?

          Show
          Ray Gauss II added a comment - This code is already on trunk. Can we re-resolve for 1.3 and open new, 'smaller' issues for 1.4 if there are still specific concerns?
          Hide
          Chris A. Mattmann added a comment -

          +1 Ray! I just moved anything that wasn't resolved to 1.4 to make the release process go cleaner

          Let's close this one, resolve for 1.3 (go for it!) and then open smaller, focused issues for 1.4 +1

          Show
          Chris A. Mattmann added a comment - +1 Ray! I just moved anything that wasn't resolved to 1.4 to make the release process go cleaner Let's close this one, resolve for 1.3 (go for it!) and then open smaller, focused issues for 1.4 +1

            People

            • Assignee:
              Ray Gauss II
              Reporter:
              Ray Gauss II
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development