Tika
  1. Tika
  2. TIKA-858

Tika add parsing support for ANPA-1312 news wire feeds

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.10
    • Fix Version/s: None
    • Component/s: mime, parser
    • Labels:
      None

      Description

      This submission adds support for ANPA-1312 news wire feeds.

      Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts.

      This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community.

      1. tika-mimetypes_ANPA.patch
        0.7 kB
        Craig Stires
      2. org.apache.tika.parser.Parser_ANPA.patch
        0.5 kB
        Craig Stires
      3. IptcAnpaParser.java
        34 kB
        Craig Stires
      4. 7901V5.pdf
        535 kB
        Craig Stires

        Activity

        Hide
        Craig Stires added a comment -

        This is the file recognition for ANPA file types. This patch goes against apache-tika-0.10/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml

        Show
        Craig Stires added a comment - This is the file recognition for ANPA file types. This patch goes against apache-tika-0.10/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        Hide
        Craig Stires added a comment - - edited

        This is the change to the parser module, which recognizes the ANPA parser.
        This patch goes against apache-tika-0.10/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.Parser

        Show
        Craig Stires added a comment - - edited This is the change to the parser module, which recognizes the ANPA parser. This patch goes against apache-tika-0.10/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.Parser
        Hide
        Craig Stires added a comment -

        The file which parses and categorizes the ANPA wire feeds.
        This gets added to apache-tika-0.10/tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java

        Show
        Craig Stires added a comment - The file which parses and categorizes the ANPA wire feeds. This gets added to apache-tika-0.10/tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java
        Hide
        Nick Burch added a comment -

        Are you able to supply a sample file, and a unit test that uses it?

        (Without a unit test, it'll be hard to verify that it works properly, and doesn't accidentally get broken in the future)

        Show
        Nick Burch added a comment - Are you able to supply a sample file, and a unit test that uses it? (Without a unit test, it'll be hard to verify that it works properly, and doesn't accidentally get broken in the future)
        Hide
        Nick Burch added a comment -

        Additionally, what reference did you find for the chosen mimetype for these files? (I couldn't spot one from a quick check was all)

        Show
        Nick Burch added a comment - Additionally, what reference did you find for the chosen mimetype for these files? (I couldn't spot one from a quick check was all)
        Hide
        Craig Stires added a comment -

        Attaching the specification docs for the ANPA formats. [7901V5.pdf]
        This discusses the start of header for mime-type recognition, as well as the spec for how the rest of the document structure.

        Show
        Craig Stires added a comment - Attaching the specification docs for the ANPA formats. [7901V5.pdf] This discusses the start of header for mime-type recognition, as well as the spec for how the rest of the document structure.
        Hide
        Nick Burch added a comment -

        Thanks for the patch, I've applied it in r1331794.

        However, we do still need a unit test for this. Are you able to get a small, sample ANPA file for us to use in a unit test?

        Show
        Nick Burch added a comment - Thanks for the patch, I've applied it in r1331794. However, we do still need a unit test for this. Are you able to get a small, sample ANPA file for us to use in a unit test?

          People

          • Assignee:
            Unassigned
            Reporter:
            Craig Stires
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development