Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      ==Background on our open source project==

      The Apache ODF Toolkit is a set of Java modules that allow programmatic creation, scanning and manipulation of Open Document Format (ISO/IEC 26300 == ODF) documents. Unlike other approaches which rely on runtime manipulation of heavy-weight editors via an automation interface, the ODF Toolkit is lightweight and ideal for server use.

      http://incubator.apache.org/odftoolkit/index.html

      ==The Idea==

      GNU/Linux, and UNIX before then has shown the great power of a text processing via simple command line tools, combined with operating facilities for piping and redirection. This filter-baed text processing is what makes shell programming so powerful. But it only works well for pure text documents. But what about more complex, WYSIWYG documents, spreadsheets, word processors, with more complex formats? The existing tool set becomes far weaker.

      The Apache ODF Toolkit is a Java API that gives a high level view of a document, and enables programmatic manipulation of a document. We have functions for doing things like search & replace, adding paragraphs, accessing cells in a spreadsheeting, etc., all from a Java application. No traditional editors is involved. Pure Java, stuff you could run on a server even.

      You can look at our "cookbook" for examples of our "Simple API" in action:

      http://incubator.apache.org/odftoolkit/simple/document/cookbook/index.html

      There is a lot you can do using this API. But it still requires Java programming, and that limits its reach to professional programmers.

      What if we could write, using the ODF Toolkit, a set of command line utilities that made it easy to do both simple and complex text manipulation tasks form a command line, things like:

      1) Concatenate documents
      2) Replace slide 3 in presentation A with slide 3 from presentation B
      3) Apply the styles of document A to all documents in the current directory
      4) Find all occurrences of "sausages" in the given document and add a hyperlink to sausages.com

      and so on.

      The audience for such a tool could be:

      1) Data wranglers, who want to extract information from a large number of ODF documents.

      2) Power users who want to automate some repetitive document automation tasks, like filling in a template,or an off-line mail merge

      3) QA testers of office editors, who use simple scripts to generate test cases as well as to test editor-generated documents for correctness

      4) Web developers who want to generate a data-driven document on-the-fly

      So think generally in that space. Not system programmers. Not application developers. But command line gurus, with a little scripting ability at most. That is the "sweet spot".

      Some technical aspects you might want to consider:

      1) The real value of the Unix text utilities is that they could easily be combined. For example, I recently did this to search for all openoffice.org email address on downloaded copy of the openoffice website, deduping and sorting by how many times each address appeared:

      grep -o -r -i --no-filename --include=".html" "[[:alnum:]+\._-]@openoffice.org" . | sort | uniq -c | sort -n -r

      So, powerful command line tools that each do one thing well. And then a way to pipe the outputs of one to become the inputs of another. Can we define a similar set of basic operations on ODF documents, as well as the glue to combine these commands into more powerful pipelines?

      2) Useful example tools are cat, grep, diff and sed, etc.. Maybe even something awk-like that works with spreadsheets? No need to be slavish to the original tools, but create something of similar power, but which operate on ODF documents.

      3) The trick will be that an ODF document is a ZIP file containing multiple XML files, and possibly other resources, like JPG images. If we pipe the binary ZIP, then we're forcing each tool in the chain to do the uncompress/compress, which is bad for performance. There is also the issue of repeated parsing/serialization of the XML. So how can we do this all efficiently?

      Note: These are just ideas to get you thinking in this general area. I would be pleased to review any GSoC proposals related to the ODF Toolkit.

        Activity

        Hide
        Charitha Madusanka added a comment -

        Exposing ODF toolkit functions through a command line interface idea is a very interesting. Here is abstraction of my approach.

        First we need to identify the functions we provide through the CLI and implement that fuctions ( ex : search, merge, replace, format ....)

        Then we need to figure out commands structure. ( eg. odf < function > <option ... > <files....> ).

        We can to use a Java CLI parsing library [1][2] to process the command line options.

        Finally we need to invoke the back-end functions appropriate to command.

        [1] - http://commons.apache.org/cli/
        [2] - http://pholser.github.com/jopt-simple/

        Show
        Charitha Madusanka added a comment - Exposing ODF toolkit functions through a command line interface idea is a very interesting. Here is abstraction of my approach. First we need to identify the functions we provide through the CLI and implement that fuctions ( ex : search, merge, replace, format ....) Then we need to figure out commands structure. ( eg. odf < function > <option ... > <files....> ). We can to use a Java CLI parsing library [1] [2] to process the command line options. Finally we need to invoke the back-end functions appropriate to command. [1] - http://commons.apache.org/cli/ [2] - http://pholser.github.com/jopt-simple/
        Hide
        Rob Weir added a comment -

        Good thoughts. The other part is the glue between the command line tools. That was always the real power of the Unix tools, that they could easily be combined. For example, I recently did this to search for all openoffice.org email address on downloaded copy of the openoffice website, deduping and sorting by how many times each address appeared:

        grep -o -r -i --no-filename --include=".html" "[[:alnum:]+\._-]@openoffice.org" . | sort | uniq -c | sort -n -r

        So, powerful command line tools that each do one thing well. And then a way to pipe the outputs of one to become the inputs of another. The trick will be that an ODF document is a ZIP file containing multiple XML files, and possibly other resources, like JPG images. If we pipe the binary ZIP, then we're forcing each tool in the chain to do the uncompress/compress, which is bad for performance. There is also the issue of repeated parsing/serialization of the XML. So perhaps we don't use the OS's command line but create our own command line processor, entirely in a single JVM instance. Or there might be other clever ways of making this efficient.

        Show
        Rob Weir added a comment - Good thoughts. The other part is the glue between the command line tools. That was always the real power of the Unix tools, that they could easily be combined. For example, I recently did this to search for all openoffice.org email address on downloaded copy of the openoffice website, deduping and sorting by how many times each address appeared: grep -o -r -i --no-filename --include=" .html" "[ [:alnum:] +\._-] @openoffice.org" . | sort | uniq -c | sort -n -r So, powerful command line tools that each do one thing well. And then a way to pipe the outputs of one to become the inputs of another. The trick will be that an ODF document is a ZIP file containing multiple XML files, and possibly other resources, like JPG images. If we pipe the binary ZIP, then we're forcing each tool in the chain to do the uncompress/compress, which is bad for performance. There is also the issue of repeated parsing/serialization of the XML. So perhaps we don't use the OS's command line but create our own command line processor, entirely in a single JVM instance. Or there might be other clever ways of making this efficient.
        Hide
        Charitha Madusanka added a comment - - edited

        In command-line paser we can identify a command output may input to another command . So implement own piping machanisum for combaine commands (search, replace, merge) make help to minimue uncompress/compress attempts.

        Show
        Charitha Madusanka added a comment - - edited In command-line paser we can identify a command output may input to another command . So implement own piping machanisum for combaine commands (search, replace, merge) make help to minimue uncompress/compress attempts.

          People

          • Assignee:
            Rob Weir
            Reporter:
            Rob Weir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development