Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2759

ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.18
    • 2.0.0, 1.20
    • parser
    • None

    Description

      We extract Javascript as text content while instead it is actually a script tag with base64 inline. This inline code is decoded and reported in the characters() method of our custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start tag itself is never reported to startElement(). The Javascript is reported to characters() after we left the head and entered the body.

      HTML file is attached

      The following script tag:

        <script src="data:text/javascript;base64,Oyh3aW5kb3cuanExODN8fGpRdWVyeSkoZnVuY3Rpb24oJCl7bmV3IEltcHJvdmVkQUpBWExvZ2luKHsNCmlkOiAxNTcsDQppc0d1ZXN0OiAxLA0Kb2F1dGg6IHsiZmFjZWJvb2siOiJodHRwczpcL1wvd3d3LmZhY2Vib29rLmNvbVwvZGlhbG9nXC9vYXV0aD9zY29wZT1lbWFpbCZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9MTcyODk0MjQzMDY1MDQ4NiZyZWRpcmVjdF91cmk9aHR0cCUzQSUyRiUyRnBldHJvbGljaW91cy5jb20lMkZpbmRleC5waHAlM0ZvcHRpb24lM0Rjb21faW1wcm92ZWRfYWpheF9sb2dpbiUyNnRhc2slM0RmYWNlYm9vayIsImdvb2dsZSI6Imh0dHBzOlwvXC9hY2NvdW50cy5nb29nbGUuY29tXC9vXC9vYXV0aDJcL2F1dGg/c2NvcGU9aHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8uZW1haWwraHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8ucHJvZmlsZSZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9ODQ5NDk3NjQ3ODUzLW1mOThqNGdlOGkwYzlkaTFrbG9zc2YxbmdibWI2cG12LmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tJnJlZGlyZWN0X3VyaT1odHRwJTNBJTJGJTJGcGV0cm9saWNpb3VzLmNvbSUyRmluZGV4LnBocCUzRm9wdGlvbiUzRGNvbV9pbXByb3ZlZF9hamF4X2xvZ2luJTI2dGFzayUzRGdvb2dsZSJ9LA0KYmdPcGFjaXR5OiAwLjQsDQpyZXR1cm5Vcmw6ICcvaXMtdGhpcy1kdXRjaC1jbGFzc2ljLWZpbmFsbHktYXMtY29vbC1hcy1hLWJtdycsDQpib3JkZXI6IHBhcnNlSW50KCdmNWY1ZjV8KnwzfCp8YzRjNGM0fCp8Nycuc3BsaXQoJ3wqfCcpWzFdKSwNCnBhZGRpbmc6IDQsDQp1c2VBSkFYOiAwLA0Kb3BlbkV2ZW50OiAnb25jbGljaycsDQp3bmRDZW50ZXI6IDAsDQpyZWdQb3B1cDogMSwNCmR1cjogMzAwLA0KdGltZW91dDogMCwNCmJhc2U6ICcvJywNCnRoZW1lOiAncGV0cm9saWNpb3VzJywNCnNvY2lhbFByb2ZpbGU6ICcnLA0Kc29jaWFsVHlwZTogJ2J0bkljbycsDQpjc3NQYXRoOiAnL21vZHVsZXMvbW9kX2ltcHJvdmVkX2FqYXhfbG9naW4vY2FjaGUvMTU3LzNkNDE4Mzk2NDk2N2Y2ZWVlYjI5MTdhOTI2OGM2MTIxLmNzcycsDQpyZWdQYWdlOiAnam9vbWxhJywNCmNhcHRjaGE6ICcnLA0Kc2hvd0hpbnQ6IDAsDQpnZW9sb2NhdGlvbjogZmFsc2UsDQp3aW5kb3dBbmltOiAnJw0KfSl9KTs=" type="text/javascript"></script>
      

      gets reported outside the head (in html.p) as:

      ;(window.jq183||jQuery)(function($){new ImprovedAJAXLogin({
      id: 157,
      isGuest: 1,
      oauth: {"facebook":"https:\/\/www.facebook.com\/dialog\/oauth?scope=email&response_type=code&display=popup&client_id=1728942430650486&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dfacebook","google":"https:\/\/accounts.google.com\/o\/oauth2\/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.profile&response_type=code&display=popup&client_id=849497647853-mf98j4ge8i0c9di1klossf1ngbmb6pmv.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dgoogle"},
      bgOpacity: 0.4,
      returnUrl: '/is-this-dutch-classic-finally-as-cool-as-a-bmw',
      border: parseInt('f5f5f5|*|3|*|c4c4c4|*|7'.split('|*|')[1]),
      padding: 4,
      useAJAX: 0,
      openEvent: 'onclick',
      wndCenter: 0,
      regPopup: 1,
      dur: 300,
      timeout: 0,
      base: '/',
      theme: 'petrolicious',
      socialProfile: '',
      socialType: 'btnIco',
      cssPath: '/modules/mod_improved_ajax_login/cache/157/3d4183964967f6eeeb2917a9268c6121.css',
      regPage: 'joomla',
      captcha: '',
      showHint: 0,
      geolocation: false,
      windowAnim: ''
      })});
      

      Attachments

        1. petrolicious.html
          90 kB
          Markus Jelsma

        Activity

          People

            tallison Tim Allison
            markus17 Markus Jelsma
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: