Tika
  1. Tika
  2. TIKA-845

Check for Existing Value in Multi-Value Fields in XML Metadata Handler

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.1
    • Component/s: parser
    • Labels:
      None

      Description

      The XML Abstract metdata handler should check for an existing value for multi-valued fields as well as simple text fields.

      Similar metadata may be stored in multiple fields in the source and a developer may choose to map several source fields to the same tika field, in which case no check is made for duplicates of existing delimited values.

      For example, a developer may want to dump any values contained in legacy IPTC keywords and dc:subject into tika keywords. If IPTC keywords = ['rock','tree','dog'] and dc:subject = ['rock','tree','K9'] then currently tika keywords = ['rock','tree','dog','rock','tree','K9'] instead of the desired ['rock','tree','dog','K9'].

        Activity

        Hide
        Ray Gauss II added a comment -

        Patch to check for existing multi-value.

        Show
        Ray Gauss II added a comment - Patch to check for existing multi-value.
        Hide
        Nick Burch added a comment -

        I think the current logic isn't quite correct. Rather than ending up with a proper multivalued metadata object, we end up with a single string of comma separated values, which seems wrong to me

        What I've done is fix up the logic, which allows for what looks to be a cleaner way to check for duplicates

        I've also fixed up the single unit test that depending on the old comma concatination, DcXMLParserTest, to now check for the correct multivalued approach

        I've committed this in r1234873.

        Show
        Nick Burch added a comment - I think the current logic isn't quite correct. Rather than ending up with a proper multivalued metadata object, we end up with a single string of comma separated values, which seems wrong to me What I've done is fix up the logic, which allows for what looks to be a cleaner way to check for duplicates I've also fixed up the single unit test that depending on the old comma concatination, DcXMLParserTest, to now check for the correct multivalued approach I've committed this in r1234873.
        Hide
        Ray Gauss II added a comment -

        I was following precedence there and actually not even calling that code since ElementMetadataHandler correctly stores as a multivalued object, but you're right and your changes look spot on.

        Show
        Ray Gauss II added a comment - I was following precedence there and actually not even calling that code since ElementMetadataHandler correctly stores as a multivalued object, but you're right and your changes look spot on.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ray Gauss II
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development