Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3100

RFC822Parser ignore charset when extractAllAlternatives set to true

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.24.1
    • None
    • parser
    •  

      Windows 10 x64

      OpenJDK 14

    Description

      In default mode RFC822Parser seems to ignore charset defined in headers when detect content. When I set "extractAllAlternatives " to false then content seems fine.

      Test case:

          @Test
          public void testQuotedPrintableCharset() {
              Metadata metadata = new Metadata();
              InputStream stream = getStream("test-documents/testRFC822_quoted_charset_iso_8859_2");
              ContentHandler handler = new BodyContentHandler();
              ParseContext context = new ParseContext();
              
              try {
                  RFC822Parser emailparser = new RFC822Parser();
                  emailparser.setExtractAllAlternatives(true);            
                  emailparser.parse(stream, handler, metadata, context);
                  String bodyText = handler.toString();
                  assertTrue(bodyText.contains("Dzie\u0144 dobry."));
                  
              } catch (Exception e) {
                  fail("Exception thrown: " + e.getMessage());
              }
          }
      

      Attachments

        1. testRFC822_quoted_charset_iso_8859_2
          0.4 kB
          Mariusz Cieślukowski

        Activity

          People

            Unassigned Unassigned
            ciesla05 Mariusz Cieślukowski
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: