Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1765

Some doc and docx store multiple authors as semi-colon delimited list

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      It looks like doc and docx are storing multiple authors in a single author field delimited by semi-colons. We should parse this value and add multiple authors where appropriate.

      Notes: when I tried to add an author with a semicolon in the name, the result was two authors...doesn't look like there is any escaping going on.

      We should check to see what's going on in the other MS formats and with other metadata items that are allowed to be multivalued in Dublin Core.

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Would anyone mind if I changed OfficeOpenXMLExtended.MANAGER to Property.externalTextBag from externalText?

        The reason that I'd want to make manager multi-valued is that we can store multiple managers in MSOffice Word, Excel and PPT just as we can store multiple authors (semicolon delimited).

        I tried to find any reference in ECMA to the standard for handling multiple authors, and all examples (that I found) show a single author. There's even less documentation for "manager".

        Show
        tallison@mitre.org Tim Allison added a comment - Would anyone mind if I changed OfficeOpenXMLExtended.MANAGER to Property.externalTextBag from externalText ? The reason that I'd want to make manager multi-valued is that we can store multiple managers in MSOffice Word, Excel and PPT just as we can store multiple authors (semicolon delimited). I tried to find any reference in ECMA to the standard for handling multiple authors, and all examples (that I found) show a single author. There's even less documentation for "manager".
        Hide
        tallison@mitre.org Tim Allison added a comment -

        r1707427

        Show
        tallison@mitre.org Tim Allison added a comment - r1707427
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #865 (See https://builds.apache.org/job/tika-trunk-jdk1.7/865/)
        TIKA-1765 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1707427)

        • trunk/CHANGES.txt
        • trunk/tika-core/src/main/java/org/apache/tika/metadata/Metadata.java
        • trunk/tika-core/src/main/java/org/apache/tika/metadata/OfficeOpenXMLExtended.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/SummaryExtractor.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java
        • trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
        • trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • trunk/tika-parsers/src/test/resources/test-documents/testWORD_multi_authors.doc
        • trunk/tika-parsers/src/test/resources/test-documents/testWORD_multi_authors.docx
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #865 (See https://builds.apache.org/job/tika-trunk-jdk1.7/865/ ) TIKA-1765 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1707427 ) trunk/CHANGES.txt trunk/tika-core/src/main/java/org/apache/tika/metadata/Metadata.java trunk/tika-core/src/main/java/org/apache/tika/metadata/OfficeOpenXMLExtended.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/SummaryExtractor.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java trunk/tika-parsers/src/test/resources/test-documents/testWORD_multi_authors.doc trunk/tika-parsers/src/test/resources/test-documents/testWORD_multi_authors.docx

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development