[ODFTOOLKIT-308] GSoC: ODF Command Line Tools - ASF JIRA

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- gsoc2012
- mentor

Description

==Background on our open source project==

The Apache ODF Toolkit is a set of Java modules that allow programmatic creation, scanning and manipulation of Open Document Format (ISO/IEC 26300 == ODF) documents. Unlike other approaches which rely on runtime manipulation of heavy-weight editors via an automation interface, the ODF Toolkit is lightweight and ideal for server use.

http://incubator.apache.org/odftoolkit/index.html

==The Idea==

GNU/Linux, and UNIX before then has shown the great power of a text processing via simple command line tools, combined with operating facilities for piping and redirection. This filter-baed text processing is what makes shell programming so powerful. But it only works well for pure text documents. But what about more complex, WYSIWYG documents, spreadsheets, word processors, with more complex formats? The existing tool set becomes far weaker.

The Apache ODF Toolkit is a Java API that gives a high level view of a document, and enables programmatic manipulation of a document. We have functions for doing things like search & replace, adding paragraphs, accessing cells in a spreadsheeting, etc., all from a Java application. No traditional editors is involved. Pure Java, stuff you could run on a server even.

You can look at our "cookbook" for examples of our "Simple API" in action:

http://incubator.apache.org/odftoolkit/simple/document/cookbook/index.html

There is a lot you can do using this API. But it still requires Java programming, and that limits its reach to professional programmers.

What if we could write, using the ODF Toolkit, a set of command line utilities that made it easy to do both simple and complex text manipulation tasks form a command line, things like:

1) Concatenate documents
2) Replace slide 3 in presentation A with slide 3 from presentation B
3) Apply the styles of document A to all documents in the current directory
4) Find all occurrences of "sausages" in the given document and add a hyperlink to sausages.com

and so on.

The audience for such a tool could be:

1) Data wranglers, who want to extract information from a large number of ODF documents.

2) Power users who want to automate some repetitive document automation tasks, like filling in a template,or an off-line mail merge

3) QA testers of office editors, who use simple scripts to generate test cases as well as to test editor-generated documents for correctness

4) Web developers who want to generate a data-driven document on-the-fly

So think generally in that space. Not system programmers. Not application developers. But command line gurus, with a little scripting ability at most. That is the "sweet spot".

Some technical aspects you might want to consider:

1) The real value of the Unix text utilities is that they could easily be combined. For example, I recently did this to search for all openoffice.org email address on downloaded copy of the openoffice website, deduping and sorting by how many times each address appeared:

grep -o -r -i --no-filename --include=".html" "[[:alnum:]+\._-]@openoffice.org" . | sort | uniq -c | sort -n -r

So, powerful command line tools that each do one thing well. And then a way to pipe the outputs of one to become the inputs of another. Can we define a similar set of basic operations on ODF documents, as well as the glue to combine these commands into more powerful pipelines?

2) Useful example tools are cat, grep, diff and sed, etc.. Maybe even something awk-like that works with spreadsheets? No need to be slavish to the original tools, but create something of similar power, but which operate on ODF documents.

3) The trick will be that an ODF document is a ZIP file containing multiple XML files, and possibly other resources, like JPG images. If we pipe the binary ZIP, then we're forcing each tool in the chain to do the uncompress/compress, which is bad for performance. There is also the issue of repeated parsing/serialization of the XML. So how can we do this all efficiently?

Note: These are just ideas to get you thinking in this general area. I would be pleased to review any GSoC proposals related to the ODF Toolkit.

GSoC: ODF Command Line Tools

Details

Description

Attachments

Activity

People

Dates