Tika
  1. Tika
  2. TIKA-1224

Adding Source code (Java, Groovy, C) parser

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      We can parser some source code file formats:
      text/x-java-source
      text/x-groovy
      text/x-c

      for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight

        Activity

        Hide
        Ken Krugler added a comment -

        For many languages, parsing needs to be fuzzy (e.g. for C code, without knowing the values for conditional compilation, it's impossible to accurately parse many source files). One quick & dirty approach is to use syntax highlighters, though the deeper question is what exactly to extract as the text - i.e. what would Tika return that's different from the (original) text?

        Show
        Ken Krugler added a comment - For many languages, parsing needs to be fuzzy (e.g. for C code, without knowing the values for conditional compilation, it's impossible to accurately parse many source files). One quick & dirty approach is to use syntax highlighters, though the deeper question is what exactly to extract as the text - i.e. what would Tika return that's different from the (original) text?
        Hide
        Hong-Thai Nguyen added a comment -

        I agree that parsing deeply each language is not simple. This work (already done) is just providing HTML format of source languages and some metadata possible (as author, version ...) extracting from javadoc comment and probably interesting others as LoC. When we need more detailed result on a language, we must implement a dedicated parser.
        This parser is useful in search application.

        Show
        Hong-Thai Nguyen added a comment - I agree that parsing deeply each language is not simple. This work (already done) is just providing HTML format of source languages and some metadata possible (as author, version ...) extracting from javadoc comment and probably interesting others as LoC. When we need more detailed result on a language, we must implement a dedicated parser. This parser is useful in search application.
        Hide
        Hong-Thai Nguyen added a comment -

        Commited on 1563902

        Show
        Hong-Thai Nguyen added a comment - Commited on 1563902
        Hide
        Markus Jelsma added a comment -

        A patch seems to be missing here.

        Show
        Markus Jelsma added a comment - A patch seems to be missing here.
        Hide
        Benoit Moreau added a comment -

        I'm disappointed because it does not work !

        For examples:

        > java -jar tika-app-1.5.jar -t Test.java
        Output is empty

        > java -jar tika-app-1.5.jar -h Test.java
        Output is stange

        > java -jar tika-app-1.5.jar -T Test.java
        Output is what I expect for -h ?

        <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
             "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="htt
        p://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head>     <meta http-equiv=
        "content-type" content="text/html; charset=ISO-8859-1" />     <meta name="genera
        tor" content="JHighlight v1.0 (http://jhighlight.dev.java.net)" />     <title>Te
        st.java</title>     <link rel="Help" href="http://jhighlight.dev.java.net" />
          <style type="text/css"> .java_type { color: rgb(0,44,221); } .java_keyword { c
        olor: rgb(0,0,0); font-weight: bold; } .java_javadoc_comment { color: rgb(147,14
        7,147); background-color: rgb(247,247,247); font-style: italic; } .java_comment
        { color: rgb(147,147,147); background-color: rgb(247,247,247); } .java_operator
        { color: rgb(0,124,31); } .java_plain { color: rgb(0,0,0); } .java_literal { col
        or: rgb(188,0,0); } code { color: rgb(0,0,0); font-family: monospace; font-size:
         12px; white-space: nowrap; } .java_javadoc_tag { color: rgb(147,147,147); backg
        round-color: rgb(247,247,247); font-style: italic; font-weight: bold; } .java_se
        parator { color: rgb(0,33,255); } h1 { font-family: sans-serif; font-size: 16pt;
         font-weight: bold; color: rgb(0,0,0); background: rgb(210,210,210); border: sol
        id 1px black; padding: 5px; text-align: center; }     </style> </head> <body> <h
        1>Test.java</h1><code><span class="java_javadoc_comment">/**&nbsp;*&nbsp;Class&n
        bsp;Test.&nbsp;*&nbsp;*&nbsp;</span><span class="java_javadoc_tag">@author</span
        ><span class="java_javadoc_comment">&nbsp;ben.12&nbsp;*/</span><span class="java
        _keyword">public</span><span class="java_plain">&nbsp;</span><span class="java_k
        eyword">class</span><span class="java_plain">&nbsp;</span><span class="java_type
        ">Test</span><span class="java_plain">&nbsp;</span><span class="java_separator">
        {</span><span class="java_plain">&nbsp;&nbsp;</span><span class="java_comment">/
        /&nbsp;Class&nbsp;Test}</span><br /> </code> </body> </html>
        

        But all is in only one line, indentation is lost and file name appears at beginning.
        Author is not in head meta tags.
        The last "}" is highlighted as a comment.


        My input java file:

        Test.java
        /**
         * Class Test.
         *
         * @author ben.12
         */
        public class Test {
        	// Class Test
        }
        
        Show
        Benoit Moreau added a comment - I'm disappointed because it does not work ! For examples: > java -jar tika-app-1.5.jar -t Test.java Output is empty > java -jar tika-app-1.5.jar -h Test.java Output is stange > java -jar tika-app-1.5.jar -T Test.java Output is what I expect for -h ? <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" > <html xmlns="htt p://www.w3.org/1999/xhtml " xml:lang=" en " lang=" en"> <head> <meta http-equiv= "content-type" content= "text/html; charset=ISO-8859-1" /> <meta name="genera tor " content=" JHighlight v1.0 (http://jhighlight.dev.java.net)" /> <title> Te st.java </title> <link rel= "Help" href= "http://jhighlight.dev.java.net" /> <style type= "text/css" > .java_type { color: rgb(0,44,221); } .java_keyword { c olor: rgb(0,0,0); font-weight: bold; } .java_javadoc_comment { color: rgb(147,14 7,147); background-color: rgb(247,247,247); font-style: italic; } .java_comment { color: rgb(147,147,147); background-color: rgb(247,247,247); } .java_operator { color: rgb(0,124,31); } .java_plain { color: rgb(0,0,0); } .java_literal { col or: rgb(188,0,0); } code { color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: nowrap; } .java_javadoc_tag { color: rgb(147,147,147); backg round-color: rgb(247,247,247); font-style: italic; font-weight: bold; } .java_se parator { color: rgb(0,33,255); } h1 { font-family: sans-serif; font-size: 16pt; font-weight: bold; color: rgb(0,0,0); background: rgb(210,210,210); border: sol id 1px black; padding: 5px; text-align: center; } </style> </head> <body> <h 1>Test.java </h1> <code> <span class= "java_javadoc_comment" > /**&nbsp;*&nbsp;Class&n bsp;Test.&nbsp;*&nbsp;*&nbsp; </span> <span class= "java_javadoc_tag" > @author</span > <span class= "java_javadoc_comment" > &nbsp;ben.12&nbsp;*/ </span> <span class="java _keyword ">public </span> <span class=" java_plain "> &nbsp; </span> <span class=" java_k eyword ">class </span> <span class=" java_plain "> &nbsp; </span> <span class=" java_type ">Test </span> <span class=" java_plain "> &nbsp; </span> <span class=" java_separator"> { </span> <span class= "java_plain" > &nbsp;&nbsp; </span> <span class= "java_comment" > / /&nbsp;Class&nbsp;Test} </span> <br /> </code> </body> </html> But all is in only one line, indentation is lost and file name appears at beginning. Author is not in head meta tags. The last "}" is highlighted as a comment. My input java file: Test.java /** * Class Test. * * @author ben.12 */ public class Test { // Class Test }
        Hide
        Nick Burch added a comment -

        Benoit - Does Tika correctly detect your files? The right parser won't kick in if Tika is confused about the mime type

        Show
        Nick Burch added a comment - Benoit - Does Tika correctly detect your files? The right parser won't kick in if Tika is confused about the mime type
        Hide
        Benoit Moreau added a comment -

        In debug, Tika uses org.apache.tika.SourceCodeParser with "x-java-source" mime-type. It removes all end of lines (why?, mistake? readLine() doesn't return \n or/and \r), then gives the result to JHightlight. JHightlight result (entire html) is used as argument of characters() method of ContentHandler.

        I just start with Tika, but I don't think that is good.

        Show
        Benoit Moreau added a comment - In debug, Tika uses org.apache.tika.SourceCodeParser with "x-java-source" mime-type. It removes all end of lines (why?, mistake? readLine() doesn't return \n or/and \r), then gives the result to JHightlight. JHightlight result (entire html) is used as argument of characters() method of ContentHandler. I just start with Tika, but I don't think that is good.
        Hide
        Hong-Thai Nguyen added a comment -

        Thank Benoit Moreau for feedback.
        For line return problem at output, I created a new issue: TIKA-1279
        For -t option in TikaCLI, It's ambiguous on mimetype of java file. It's could be text/plain (in this case, TxtParser will be used to return original text as is), x-java-source (SourceCodeParser will be used).

        For -h option, output is normally something:

        Author: Hong-Thai.Nguyen
        Content-Encoding: windows-1252
        Content-Length: 4899
        Content-Type: text/x-java-source
        LoC: 133
        creator: Hong-Thai.Nguyen
        dc:creator: Hong-Thai.Nguyen
        meta:author: Hong-Thai.Nguyen
        resourceName: SourceCodeParser.java
        

        the creator is from 'author' annotation in javadoc.

        This parser is quite generic (quick and dirty as mentioned by Ken Krugler) and simplistic. We can make a more dedicate Java source parser and extract more metadata (member, attributes...). If you interest this kind of parser, please create new issue and eventually an investigation on this work is warmly welcome.

        Regards,

        Show
        Hong-Thai Nguyen added a comment - Thank Benoit Moreau for feedback. For line return problem at output, I created a new issue: TIKA-1279 For -t option in TikaCLI, It's ambiguous on mimetype of java file. It's could be text/plain (in this case, TxtParser will be used to return original text as is), x-java-source (SourceCodeParser will be used). For -h option, output is normally something: Author: Hong-Thai.Nguyen Content-Encoding: windows-1252 Content-Length: 4899 Content-Type: text/x-java-source LoC: 133 creator: Hong-Thai.Nguyen dc:creator: Hong-Thai.Nguyen meta:author: Hong-Thai.Nguyen resourceName: SourceCodeParser.java the creator is from 'author' annotation in javadoc. This parser is quite generic (quick and dirty as mentioned by Ken Krugler ) and simplistic. We can make a more dedicate Java source parser and extract more metadata (member, attributes...). If you interest this kind of parser, please create new issue and eventually an investigation on this work is warmly welcome. Regards,

          People

          • Assignee:
            Unassigned
            Reporter:
            Hong-Thai Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development