Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-295

Rough cut of mbox parser

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.4
    • 0.5
    • None
    • None

    Description

      Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.

      • The first email headers are used to fill in metadata. Subsequent email headers are tossed.
      • Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
      • Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

      Attachments

        1. tika-295.patch
          33 kB
          Kenneth William Krugler

        Activity

          People

            jukkaz Jukka Zitting
            kkrugler Kenneth William Krugler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: