Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2875

Support Google Takeout MBOX format for GChat Messages

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.20
    • None
    • parser
    • None
    • java version "1.8.0_181"

      Java(TM) SE Runtime Environment (build 1.8.0_181-b13)

      Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

    Description

      The Google Takeout tool allows a user to export Gmail and GChat messages as an MBOX archive. Tika's content type detection properly asserts this format as MBOX. However, the provided MBOX parser does not seem to support the format of the `From`  header for GChat messages. I've included an example chat in the ticket. You can see the format of the From header also includes a from address and the sent timestamp. As I understand this is a valid From header format. I would expect the Tika MBOX parser to properly parse the From header and set the sent time as the value parsed from the From header format in the provided example.

      Attachments

        1. Sample.mbox
          3 kB
          Tucker Barbour

        Activity

          People

            Unassigned Unassigned
            tucker Tucker Barbour
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: