Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1699

Integrate the GROBID PDF extractor in Tika

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11
    • Component/s: parser
    • Labels:

      Description

      GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.
      It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc.

      It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon.

        Issue Links

          Activity

          Hide
          sujenshah Sujen Shah added a comment -

          Working towards publishing GROBID to Maven Central though Sonatype.

          Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837
          Grobid issue - https://github.com/kermitt2/grobid/issues/59

          Show
          sujenshah Sujen Shah added a comment - Working towards publishing GROBID to Maven Central though Sonatype. Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837 Grobid issue - https://github.com/kermitt2/grobid/issues/59
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user sujen1412 opened a pull request:

          https://github.com/apache/tika/pull/55

          Fix for TIKA-1699 contributed by Sujen Shah

          Waiting for GROBID to get published to maven central.
          Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/sujen1412/tika TIKA-1699

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/55.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #55


          commit 4f067107d01e99bd81a66c78163f2a4baf3f817f
          Author: Sujen Shah <sujen1412@gmail.com>
          Date: 2015-07-29T13:49:00Z

          Added grobid dependencies

          commit 323ba33816a9beabe22d351c8eac4350fa010be0
          Author: Sujen Shah <sujen1412@gmail.com>
          Date: 2015-07-29T13:49:36Z

          Registering journal parser

          commit 71cdd0970fb17aeec85469d07dc1ee6460d2f4da
          Author: Sujen Shah <sujen1412@gmail.com>
          Date: 2015-07-29T13:54:07Z

          Code for integrating GROBID Parser in to Tika

          commit b6e9f8724b308e0c830f73702994cbe1c5932cd2
          Author: Sujen Shah <sujen1412@gmail.com>
          Date: 2015-07-29T13:58:08Z

          Grobid properties files

          commit 57b70ce38a77cc349588d2f513938bc4f18d4ad4
          Author: Sujen Shah <sujen1412@gmail.com>
          Date: 2015-07-29T13:58:58Z

          Added unit test for journal parser

          Corrected formatting

          Corrected formatting

          Corrected formatting


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user sujen1412 opened a pull request: https://github.com/apache/tika/pull/55 Fix for TIKA-1699 contributed by Sujen Shah Waiting for GROBID to get published to maven central. Sonatype issue - https://issues.sonatype.org/browse/OSSRH-16837 You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/tika TIKA-1699 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/55.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #55 commit 4f067107d01e99bd81a66c78163f2a4baf3f817f Author: Sujen Shah <sujen1412@gmail.com> Date: 2015-07-29T13:49:00Z Added grobid dependencies commit 323ba33816a9beabe22d351c8eac4350fa010be0 Author: Sujen Shah <sujen1412@gmail.com> Date: 2015-07-29T13:49:36Z Registering journal parser commit 71cdd0970fb17aeec85469d07dc1ee6460d2f4da Author: Sujen Shah <sujen1412@gmail.com> Date: 2015-07-29T13:54:07Z Code for integrating GROBID Parser in to Tika commit b6e9f8724b308e0c830f73702994cbe1c5932cd2 Author: Sujen Shah <sujen1412@gmail.com> Date: 2015-07-29T13:58:08Z Grobid properties files commit 57b70ce38a77cc349588d2f513938bc4f18d4ad4 Author: Sujen Shah <sujen1412@gmail.com> Date: 2015-07-29T13:58:58Z Added unit test for journal parser Corrected formatting Corrected formatting Corrected formatting
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Sujen please update the PR with my 2 comments/updates and then also please let me know when the rest of the JAR files are on central then I think we can integrate this. We should also make a custom tika-config to override the default PDF parser, or better yet to somehow combine it with this. That's one thing I thought too - it would make sense to combine these, right, or are they separate parsers, really? It seems like they should be separate because potentially they have overlapping keys, right?

          We also need to make a page on the Tika wiki that describes how to install Grobid: http://wiki.apache.org/tika/GrobidParser maybe?

          Show
          chrismattmann Chris A. Mattmann added a comment - Sujen please update the PR with my 2 comments/updates and then also please let me know when the rest of the JAR files are on central then I think we can integrate this. We should also make a custom tika-config to override the default PDF parser, or better yet to somehow combine it with this. That's one thing I thought too - it would make sense to combine these, right, or are they separate parsers, really? It seems like they should be separate because potentially they have overlapping keys, right? We also need to make a page on the Tika wiki that describes how to install Grobid: http://wiki.apache.org/tika/GrobidParser maybe?
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          I got this working!

          Starting Tika Server

          java -Dorg.apache.tika.service.error.warn=true -classpath $HOME/git/grobidparser-resources/:$HOME/src/tika-server/target/tika-server-1.11-SNAPSHOT.jar:$HOME/grobid/lib/\* org.apache.tika.server.TikaServerCli --config tika-config.xml
          

          cURL command to test

          curl -T $HOME/git/grobid/papers/ICSE06.pdf -H "Content-Disposition: attachment;filename=ICSE06.pdf" http://localhost:9998/rmeta | python -mjson.tool
          

          Output

          [
              {
                  "Author": "End User Computing Services",
                  "Company": "ACM",
                  "Content-Type": "application/pdf",
                  "Creation-Date": "2006-02-15T21:13:58Z",
                  "Last-Modified": "2006-02-15T21:16:01Z",
                  "Last-Save-Date": "2006-02-15T21:16:01Z",
                  "SourceModified": "D:20060215211344",
                  "X-Parsed-By": [
                      "org.apache.tika.parser.CompositeParser",
                      "org.apache.tika.parser.journal.JournalParser"
                  ],
                  "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProceedings Template - WORD\n\n\nA Software Architecture-Based Framework for Highly \nDistributed and Data Intensive Scientific Applications \n\n \nChris A. Mattmann1, 2        Daniel J. Crichton1        Nenad Medvidovic2        Steve Hughes1 \n\n \n1Jet Propulsion Laboratory \n\nCalifornia Institute of Technology \nPasadena, CA 91109, USA \n\n{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov \n\n2Computer Science Department \nUniversity of Southern California  \n\nLos Angeles, CA 90089, USA \n{mattmann,neno}@usc.edu \n\n \nABSTRACT \nModern scientific research is increasingly conducted by virtual \ncommunities of scientists distributed around the world. The data \nvolumes created by these communities are extremely large, and \ngrowing rapidly. The management of the resulting highly \ndistributed, virtual data systems is a complex task, characterized \nby a number of formidable technical challenges, many of which \nare of a software engineering nature.  In this paper we describe \nour experience over the past seven years in constructing and \ndeploying OODT, a software framework that supports large, \ndistributed, virtual scientific communities. We outline the key \nsoftware engineering challenges that we faced, and addressed, \nalong the way. We argue that a major contributor to the success of \nOODT was its explicit focus on software architecture. We \ndescribe several large-scale, real-world deployments of OODT, \nand the manner in which OODT helped us to address the domain-\nspecific challenges induced by each deployment.  \n\nCategories and Subject Descriptors \nD.2 Software Engineering, D.2.11 Domain Specific Architectures \n\nKeywords \nOODT, Data Management, Software Architecture. \n\n1. INTRODUCTION \nSoftware systems of today are very large, highly complex, \n\noften widely distributed, increasingly decentralized, dynamic, and \nmobile.  There are many causes behind this, spanning virtually all \nfacets of human endeavor: desired advances in education, \nentertainment, medicine, military technology, \ntelecommunications, transportation, and so on.   \n\nOne major driver of software\u2019s growing complexity is \nscientific research and exploration.  Today\u2019s scientists are solving \nproblems of until recently unimaginable complexity with the help \nof software.  They also actively and regularly collaborate with \n\ncolleagues around the world, something that has become possible \nonly relatively recently, again ultimately thanks to software. They \nare collecting, producing, sharing, and disseminating large \namounts of data, which are growing by orders of magnitude in \nvolume in remarkably short time periods. \n\nIt is this latter problem that NASA\u2019s Jet Propulsion \nLaboratory (JPL) began facing several years ago.  Until recently, \nJPL would disseminate data collected by various instruments \n(Earth-based, orbiting, and in outer space) to the interested \nscientists around the United States by \u201cburning\u201d CD-ROMs and \nmailing them via the U.S. Postal Service.  In addition to being \nslow, sequential, unidirectional, and lacking interactivity, this \nmethod was expensive, costing hundreds of thousands of dollars. \nFurthermore, the method was prone to security breaches, and the \nexact data distribution (determining which data goes to which \ndestinations) had to be calculated for each individual shipment. It \nhad become increasingly difficult to manage this process as the \nnumber of projects and missions, as well as involved scientists, \ngrew.  An even more critical limiting factor became the sheer \nvolume of data that the current (e.g., Planetary Data System, or \nPDS), pending (e.g., Mars Reconnaissance Orbiter, or MRO), and \nplanned (e.g., Lunar Reconnaissance Orbiter, or LRO) missions \nwould produce: from terabytes (PDS), to hundreds of terabytes \n(MRO), to petabytes or more (LRO).  Clearly, spending millions \nof dollars just to distribute the data to scientists is impractical. \n\nThis prompted NASA\u2019s Office of Space Science to explore \nconstruction of an end-to-end software framework that would \nlower the cost of distributing and managing scientific data, from \nthe inception of data at a science processing center to its ultimate \narrival on the desks of interested users. Because of increasing data \nvolumes, the framework had to be scalable and have native \nsupport for evolution to hundreds of sites and thousands of data \ntypes. Additionally, the framework had to enable the \nvirtualization of heterogeneous data (and processing) sources, and \nto address wide-scale (national and international) distribution of \ndata. The framework needed to be flexible: it needed to support \nfully automated processing of data throughout its lifecycle, while \nstill allowing interactivity and intervention from an operator when \nneeded. Furthermore because data is itself distributed across \nNASA agencies, any software framework that distributes NASA\u2019s \ndata would require the capability for tailorable levels of security \nand for varying types of users belonging to multiple \norganizations. \n\nThere were also miscellaneous issues of data ownership that \nneeded to be overcome. Ultimately, because NASA\u2019s science data \nis so distributed, the owners of data systems (e.g., a Planetary \n\n \n\nPermission to make digital or hard copies of all or part of this work for \npersonal or classroom use is granted without fee provided that copies are \nnot made or distributed for profit or commercial advantage and that \ncopies bear this notice and the full citation on the first page. To copy \notherwise, or republish, to post on servers or to redistribute to lists, \nrequires prior specific permission and/or a fee. \nICSE06\u2019, May 20\u201328, 2006, Shanghai, China. \nCopyright 2006 ACM 1-58113-000-0/00/0004\u2026$5.00. \n \n\n\n\nScience Principal Investigator) feel hard pressed to control their \ndata, as the successful operation and maintenance of their data \nsystems are essential services that they provide. As such, any \nframework that virtualizes science data sources across NASA \nshould be transparent and unobtrusive: it should enable \ndissemination and retrieval of data across data systems, each of \nwhich may have their own external interfaces and services; at the \nsame time, it should enable scientists to maintain and operate their \ndata systems independently. Finally, to lower costs, once the \nframework was built and installed, it needed to be reusable, free, \nand distributable to other NASA sites and centers for use. \n\nOver the past seven years we have designed, implemented \nand deployed a framework called OODT (Object Oriented Data \nTechnology) that has met these rigorous demands. In this paper \nwe discuss the significant software engineering challenges we \nfaced in developing OODT.  The primary objective of the paper is \nto demonstrate how OODT\u2019s explicit software architectural basis \nenabled us to effectively address these challenges.  In particular, \nwe will detail the architectural decisions we found most difficult \nand/or critical to OODT\u2019s ultimate success. We highlight several \nrepresentative examples of OODT\u2019s use to date both at NASA \nand externally. We contrast our solution with related approaches, \nand argue that a major differentiator of this work, in addition to its \nexplicit architectural foundation, is its native support for \narchitecture-based development of distributed scientific \napplications. \n\n2. SOFTWARE ENGINEERING \nCHALLENGES \n\nTo develop OODT, we needed to address several significant \nsoftware engineering challenges, the bulk of which surfaced in \nlight of the complex data management and distribution issues \nregularly faced within a distributed, large-scale government \norganization such as NASA. In this paper we will focus on nine \nkey challenges: Complexity, Heterogeneity, Location \nTransparency, Autonomy, Dynamism, Scalability, Distribution, \nDecentralization, and Performance. \n\nComplexity \u2013 We envisioned OODT to be a large, multi-site, \nmulti-user, complex system. At the software level, complexity \nranged from understanding how to install, integrate, and manage \nthe software remotely deployed at participating organizations, to \nunderstanding how to manage information such as access \nprivileges and security credentials across both NASA and non-\nNASA sites. There were also complexities at the software \nnetworking layer, including varying firewall capabilities at each \ninstitution, and data repositories that would periodically go offline \nand needed to be remotely restarted. Just understanding the \nvarying types of data held at sites linked together via OODT was \na significant task. Even sites within the same science domain \n(e.g., planetary science) describe similar data sets in decidedly \ndifferent ways. Discerning in what ways these different data \nmodels were common and what attributes of data could be shared, \ndone away with, or amended, was a huge challenge. Finally, the \ndifferent interfaces to data, ranging from third-party, well-\nengineered database management systems, to in-house data \nsystems, ultimately to flat text file-based data was a particularly \ndifficult challenge that we had to hurdle. \n\nHeterogeneity \u2013 In order to drive down the data management \ncosts for science missions, the same OODT framework needed to \n\nspan multiple science domains. The domains initially targeted \nwere earth and planetary; this has subsequently been expanded to \nspace, biomedical sciences, and the modeling and simulation \ncommunities. As such, the same core set of OODT software \ncomponents, system designs, and implementation-level facilities \nhad to work across widely varying science domains.  \n\nThe data management processes within the organizations that \nuse OODT also added to its heterogeneity. For instance, OODT \ncomponents needed to have interfaces with end users and support \ninteractive sessions, but also with scientific instruments, which \nmost likely were automatic and non-interactive. Scientific \ninstruments could push data to certain components in OODT, \nwhile other OODT components would need to distribute data to \nusers outside of OODT. End-users in some cases wanted to \nperform transformations on the data sent to them by OODT, and \nthen to return the data back into OODT. The framework needed to \nsupport scenarios such as these seamlessly. \n\nMany other constraints also imposed the heterogeneity \nrequirement on OODT. We can group these constraints into two \nmajor categories: \n\u2022 Organizational \u2013 As we briefly alluded above, discipline \n\nexperts who wanted to disseminate their data via OODT \nreally wanted the data to reside at their respective \ninstitutions. This constraint non-negotiable, and significantly \nimpacted the space of technical solutions that we could \ninvestigate for OODT.  \n\n\u2022 Technical \u2013 Since OODT had to federate many different data \nholdings and catalogs, we faced the constraints of linking \nthem together and federating very different schemas and \nvarying levels of sophistication in the data system interfaces \n(e.g., flat files, DBMS, web pages). Even those systems \nmanaging data through \u201chigher level APIs\u201d and middleware \n(e.g., RMI, CORBA, SOAP) proved non-trivial to integrate. \nThe constraints enjoined by heterogeneity alone led us to \n\nrealize that the OODT framework would need to draw heavily \nfrom multiple areas. Database systems, although used \nsuccessfully for many years to manage large amounts of data at \nmany sites, lacked the flexibility and interface capability to \nintegrate data from other more crude APIs and storage systems \n(such as a PI-led web site). Databases also did not address the \ndistribution of data and \u201cownership\u201d issues. The advent of the \nweb, although a promising means for providing openness and \nflexible interfaces to data, would not alone address the issues such \nas multi-institutional security and access. Furthermore, its \nrequest/reply nature would not easily handle other distribution \nscenarios, e.g., subscribe/notify. Research in the area of grid \ncomputing [1] has defined \u201cout of the box\u201d services for managing \ndata systems (e.g., GridFTP), but which utilized alone would not \naddress our other challenges (e.g., complexity). \n\nLocation Transparency \u2013 Even though data could potentially \nbe input into and output from the system from many \ngeographically disparate and distributed sites, it should appear to \nthe end-users as if the data flow occurred from a single location. \nThis requirement was reinforced by the need to dynamically add \ndata producers and consumers to a system supported by OODT, \nas will be further discussed below. \n\nAutonomy \u2013 When designing the OODT framework, we could \nnot dictate how data providers should store, process, find, evolve, \nor retire their data. Instead, the framework needed to be \n\n\n\ntransparent, allowing data providers to continue with their regular \nbusiness processes, while managing and disseminating their \ninformation unobtrusively.  \n\nDynamism \u2013 It is expected that data providers for the most part \nwill be stable organizations. However, there are cases in which \nnew data producing (occasionally) and consuming (frequently) \nnodes will need to be brought on-line. Back-end data sources need \nto be pluggable, with little or no direct impact on the end-user of \nthe OODT system, or on the organization that owns the data \nsource. New end-users (or client hosts) should also be able to \n\u201ccome and go\u201d without any disruption to the rest of the system. In \nthe end, we realized this meant the whole infrastructure must be \ncapable of some level of dynamism in order to meet these \nconstraints. \n\nScalability \u2013 OODT needed to manage large volumes of data, \nfrom at least hundreds of gigabytes at its inception to the current \nmissions which will produce hundreds of terabytes. The \nframework needed to support at least dozens of institutional data \nproviders (which themselves may have subordinate data system \nproviders), dozens of user types (e.g., scientists, teachers, \nstudents, policy makers), thousands of users, hundreds of \ngeographic sites, and thousands of different data types to manage \nand disseminate. \n\nDistribution \u2013 The framework should be able to handle the \nphysical distribution of data across sites nationally and \ninternationally, and ultimately the physical distribution of the \nsystem interfaces which provide the data. \n\nDecentralization \u2013 Each site may have its own data \nmanagement processes, interfaces and data types, which were \noperating independently for some time. We needed to devise a \nway of coordinating and managing data between these data sites \nand providers without centralizing control of their systems, or \ninformation. In other words, the requirement was that the different \nsites retain their full autonomy, and that OODT adapts instead. \n\nPerformance \u2013 Despite its scale and interaction with many \norganizations, data systems, and providers, OODT still needed to \nperform under stringent demands. Queries for information needed \nto be serviced quickly: in many cases response time under five \nseconds was used as a baseline. Additionally, OODT needed to be \noperational whenever any of the participating scientists wanted to \nlocate, access, or process their data. \n\n3. BACKGROUND AND RELATED WORK \nSeveral large-scale software technologies that distribute, \n\nmanage, and process information have been constructed over the \npast decade. Each of these technologies falls into one or more of \nfour distinct areas: grid-computing, information integration, \ndatabases, and middleware. In this section, we briefly survey \nrelated projects in each of these areas and compare their foci and \naccomplishments to those of OODT. Additionally, since a major \nfocal point of OODT is software architecture, we start out by \nproviding some brief software architecture background and \nterminology to set the context. \n\nTraditionally, software architecture has referred to the \nabstraction of a software system into its fundamental building \nblocks: software components, their methods of interaction (or \nsoftware connectors), and the governing rules that guide the \n\ncomposition of software components and software connectors \n(configurations) [2, 3]. Software architecture has been recognized \nin many ways to be the linchpin of the software development \nprocess. Ideally, the software requirements are reflected within \nthe software system\u2019s components and interactions; the \ncomponents and interactions are captured within the system\u2019s \narchitecture; and the architecture is used to guide the design, \nimplementation, and evolution of the system. Design guidelines \nthat have been proven effective are often codified into \narchitectural styles, while specific architectural solutions (e.g., \nconcrete system structures, component types and interfaces, and \ninteraction facilities) within specific domains are captured as \nreusable reference architectures. \n\nGrid computing deals with highly complex and distributed \ncomputational problems and large volume data management \ntasks. Massive parallel computation, distributed workflow, and \npetabyte scale data distribution are only a small cross-section of \nthe grid\u2019s capabilities. Grid projects are usually broken down into \ntwo areas. Computational grid systems are concerned with \nsolving complex scientific problems involving supercomputing \nscale resources dispersed across various organizational \nboundaries. The representative computational grid system is the \nGlobus Toolkit [4]. Globus is built on top of a web-services [5] \nsubstrate and provides resource management components, \ndistributed workflow and security infrastructure. Other \ncomputational grid systems provide similar capabilities. For \nexample, Alchemi [6] is a .NET-based grid technology that \nsupports distributed job scheduling and an object-oriented grid \ndevelopment environment. JCGrid [7] is a light weight, Java-\nbased open source computational grid project whose goal is to \nsupport distributed job scheduling and the splitting of CPU-\nintensive tasks across multiple machines.  \n\nThe other class of grid systems, Data grids, is involved in the \nmanagement, processing, and distribution of large data volumes to \ndisbursed and heterogeneous users, user types, and geographic \nlocations. There are several major data grid projects. The LHC \nComputing Grid [8] is a system whose main goal is to provide a \ndata management and processing infrastructure for the high \nenergy physics community. The Earth System Grid [9] is geared \ntowards supporting climate modeling research and distribution of \nclimate data sets and metadata to the climate and weather \nscientific community.  \n\nTwo independently conducted studies [10, 11] have \nidentified three key areas that the current grid implementations \nmust address more effectively in order to promote data and \nsoftware interoperability: (1) formality in grid requirements \nspecification, (2) rigorous architectural description, and (3) \ninteroperability between grid solutions. As we will discuss in this \npaper, our work to date on OODT has the potential to be a \nstepping stone in each of these areas: its explicit focus on \narchitectures for data-intensive, \u201cgrid-like\u201d systems naturally \naddresses the three concerns.  \n\nThere have been several well-known efforts within the AI \nand database communities that have delved into the topic of \ninformation integration, or the shared access, search, and retrieval \nof distributed, heterogeneous information resources. Within the \npast decade, there has been significant interest in building \ninformation mediators that can integrate information from \nmultiple data sources. Mediators federate information by querying \nmultiple data sources, and fusing back the gathered results. The \nrepresentative systems using this approach include TSIMMS [12], \n\n\n\nInformation Manifold [13], The Internet Softbot [14], InfoSleuth \n[15], Infomaster [16], DISCO [17], SIMS [18] and Ariadne  [19]. \nEach of these approaches focuses on fundamental algorithmic \ncomponents of information integration: (1) formulating \nexpressive, efficient query languages (such as Theseus [20]) that \nquery many heterogeneous data stores; (2) accurately and reliably \ndescribing both global, and source data models (e.g. the Global-\nas-view [12] and Local-as-view [21] approaches); (3) providing a \nmeans for global-to-source data model integration; and (4) \nimproving queries and deciding which data sources to query (e.g. \nquery reformulation [22] and query rewriting [22, 23]).  \n\nHowever, these algorithmic techniques fail to address the \nsoftware engineering side of information integration. For instance, \nexisting literature fails to answer questions such as, which of the \ncomponents in the different systems\u2019 architectures are common; \nhow can they be reused; which portions of their implementations \nare tied to (which) software components; which software \nconnectors are the components using to interact; are the \ninteraction mechanisms replaceable (e.g., can a client-server \ninteraction in Ariadne become a peer-to-peer interaction); and so \non. Additionally, none of the above related mediator systems have \nformalized a process for designing, implementing, deploying, and \nmaintaining the software components belonging to each system.  \n\nSeveral middleware technologies such as CORBA, \nEnterprise Java Beans [24], Java RMI [25], and more recently \nSOAP and Web services [5] have been suggested as \u201csilver \nbullets\u201d that address the problem of integrating and utilizing \nheterogeneous software computing and data resources. Each of \nthese technologies provides three basic services: (1) an \n\nimplementation and composition framework for software \ncomponents, possibly written in different languages but \nconforming to a specific middleware interface; (2) a naming \nregistry used to locate components; and (3) a set of basic services \nsuch as (un-)marshalling of data, concurrency, distribution and \nsecurity.  \n\nAlthough middleware is very useful \u201cglue\u201d that can connect \nsoftware components written in different languages or deployed \nin heterogeneous environments, middleware technologies do not \nprovide any \u201cout of the box\u201d services that deal with computing \nand data resource management across organizational boundaries \nand across computing environments at a national scale. These \nkinds of services usually have to be engineered into the \nmiddleware itself. We should note that in grid computing such \nservices are explicitly called out and provided at a higher layer of \nabstraction. In fact, the combination of these higher-level grid \nservices and an underlying middleware platform is typically \nreferred to as a \u201cgrid technology\u201d [11].  \n\n4. OODT ARCHITECTURE \nOODT\u2019s architecture is a reference architecture that is \n\nintended to be instantiated and tailored for use across science \ndomains and projects. The reference architecture comprises \nseveral components and connectors.  A particular instance of this \nreference architecture, that of NASA\u2019s planetary data system \n(PDS) project, is shown in Figure 1. OODT is installed on a given \nhost inside a \u201csandbox\u201d, and is aware of and interacts only with \nthe designated external data sources outside its sandbox. OODT\u2019s \n\nm\nessaging layer (H\n\nTTP)\n\n\u2026\n.. \u2026..\n\n \nFigure 1. The Planetary Data System (PDS) OODT Architecture Instantiation \n\n\n\ncomponents are responsible for delivering data from \nheterogeneous data stores, identifying and locating data within the \nsystem, and ingesting and processing data into underlying data \nstores. The connectors are responsible for integrating OODT with \nheterogeneous data sources; providing reliable messaging to the \nsoftware components; marshalling resource descriptions and \ntransferring data between components; transactional \ncommunication between components; and security related issues \nsuch as identification, authorization, and authentication. In this \nsection, we describe the guiding principles behind the reference \narchitecture. We then describe each of the OODT reference \ncomponents and connectors in detail. In Section 5, we describe \nspecific instantiations of the reference architecture in the context \nof several projects that are using OODT. \n\n4.1 Guiding Principles \nThe software engineering challenges discussed in Section 2 \n\nmotivated and framed the development of OODT. Conquering \nthese challenges led us to a set of four guiding principles behind \nthe OODT reference architecture.  \n\nThe first guiding principle is division of labor. Each \ncapability provided by OODT (e.g., processing, ingestion, search, \nand retrieval of data, access to heterogeneous data, and so on) is \ncarefully divided among separate, independent architectural \ncomponents and connectors. As will be further detailed below, the \nprinciple is upheld through OODT\u2019s rigorous separation of \nconcerns, and modularity enforced by explicit interfaces. This \nprinciple addresses the complexity, heterogeneity, dynamism, and \ndecentralization challenges. \n\nClosely related to the preceding principle is technology \nindependence. This principle involves keeping up-to-date with the \nevolution of software technology (both in-house and third-party), \nwhile avoiding tying the OODT architecture to any specific \nimplementation. By allowing us to select the technology most \nappropriate to a given task or specific need, this principle helps us \nto address the challenges of complexity, scalability, security, \ndistribution, location transparency, performance, and dynamism.  \nFor instance, OODT\u2019s initial reference implementation used \nCORBA as the substrate for its messaging layer connector. When \nthe CORBA vendor decided to begin charging JPL significant \nlicense fees (thus violating NASA\u2019s objective of producing a \nsolution that would be free to its users), the principle of \ntechnology independence came into play. Because the OODT \nmessaging layer connector supports a wrapper interface around \nthe lower-level distribution technology, we were able to replace \nour initial CORBA-based connector with one using Java\u2019s open \nsource RMI middleware, and redeploy the new connector to the \nOODT user sites, within three person days.  \n\nAnother guiding principle of OODT is the distinguishing of \nmetadata as a first-class citizen in the reference architecture, and \nseparating metadata from data. The job of metadata (i.e., \u201cdata \nabout data\u201d) is to describe the data universe in which the system \nis operating. Since OODT is meant to be a technology that \nintegrates diverse data sources, this data universe is highly \nheterogeneous and possibly dynamic. Metadata in OODT is \nmeant to catalog information, allowing a user to locate and \ndescribe the actual data in which she is interested. On the other \nhand, the job of data in OODT is to describe physical or scientific \nphenomena; it is the ultimate end user product that an OODT \nsystem should deliver. This principle helps to address the \n\nchallenges of heterogeneity, autonomy of data providers, and \ndecentralization. \n\nSeparating the data model from the software is another key \nprinciple behind the reference architecture. Akin to ontology/data-\ndriven systems, OODT components should not be tied to the data \nand metadata that they manipulate. Instead, the components \nshould be flexible enough to understand many (meta-)data models \nused across different scientific domains, without reengineering or \ntailoring of the component implementations. This principle helps \nto address the challenges of complexity and heterogeneity. \n\nThese four guiding principles are reified in a reference \narchitecture comprising four pairs of component types and two \nclasses of connectors organized in a canonical structure. One \ninstantiation of the reference architecture reflecting the canonical \nstructure is depicted in Figure 1.  Each OODT architectural \nelement (component and connector) serves a specific purpose, \nwith its functionality exported through a well-defined interface.  \nThis supports OODT\u2019s constant evolution, allowing us to add, \nremove, and substitute, if necessary dynamically (i.e., at runtime), \nelements of a given type. It also allows us to introduce flexibility \nin the individual instances of the reference architecture while, at \nthe same time, controlling the legal system configurations.  \nFinally, the explicit connectors and well-defined component \ninterfaces allow OODT in principle to integrate with a wide \nvariety of third-party systems (e.g., [26]).  The outcome of the \nguiding principles (described above) and design decisions \n(detailed below) is an architecture that is \u201ceasy to build, hard to \nbreak\u201d. \n\n4.2 OODT Components \n4.2.1 Product Server and Product Client \n\nThe Product Server is used to retrieve data from \nheterogeneous data stores. The product server accepts a query \nstructure that identifies a set of zero or more products which \nshould be returned the issuer of the query. A product is a unit of \ndata in OODT and represents anything that a user of the system is \ninterested in retrieving: a JPEG image of Mars, an MS Word \ndocument, a zip file containing text file results of a cancer study, \nand so on. Product servers can be located at remote data sites, \ngeographically and/or institutionally disparate from other OODT \ncomponents. Alternatively, product servers can be centralized, \nlocated at a single site. The objective of the product server is to \ndeliver data from otherwise heterogeneous data stores and \nsystems. As long as a data store (or system) provides some kind \nof access interface to get its data, a product server can \u201cwrap\u201d \nthose interfaces with the help of Handler connectors described in \nSection 4.3 below. \n\nThe Product Client component communicates with a product \nserver via the Messaging Layer connectors described in Section \n4.3. A product client resides at the end-user\u2019s (e.g., scientist\u2019s) \nsite.  It must know the location of at least one product server, and \nthe query structure that identifies the set of products that the user \nwants to retrieve. At the same time, it is completely insulated \nfrom any changes in the physical location or actual representation \nof the data; its only interface is to the product server(s).  Many \nproduct clients may communicate with the same product server, \nand many product servers can return data to the same product \nclient. This adds flexibility to the architecture without introducing \nunwanted long-term dependencies: a product client can be added, \n\n\n\nremoved, or replaced with another one that depends on different \nproduct servers, without any effect on the rest of the architecture. \n\n4.2.2 Profile Server and Profile Client \nThe Profile Server manages resource description \n\ninformation, i.e., metadata, in a system built with OODT. \nResource description information is divided into three main \ncategories: \n\u2022 Housekeeping Information \u2013 Metadata such as ID, Last \n\nModified Date, Last Revised By. This information is kept \nabout the resource descriptions themselves and is used by the \nprofile server to inventory and catalog resource descriptions. \nThis is a fixed set of metadata. \n\n\u2022 Resource Information \u2013 This includes metadata such as Title, \nAuthor, Creator, Publisher, Resource Type, and Resource \nLocation. This information is kept for all the data in the \nsystem, and is an extended version of the Dublin Core \nMetadata for describing electronic resources [27]. This is \nalso a fixed set of metadata. \n\n\u2022 Domain-Specific Information \u2013 This includes metadata \nspecific to a particular data domain. For instance, in a cancer \nresearch system this may include metadata such as Blood \nSpecimen Type, Site ID, and Protocol/Study Description. \nThis set of metadata is flexible and is expected to change. \n\nAs with product servers, profile servers can be decentralized at \nmultiple sites or centralized at a single site. The objective of the \nprofile server is to deliver metadata that gives a user enough \ninformation to locate the actual data within OODT regardless of \nthe underlying system\u2019s exact configuration, and degrees of \ncomplexity and heterogeneity; the user then retrieves the data via \none or more product servers. Because profile servers do not serve \nthe actual data, they need not have a direct interface to the data \nthat they describe. In addition to the complete separation of duties \nbetween profile and product servers, this ensures their location \nindependence, allows their separate evolution, and minimizes the \neffects of component and/or network failures in an OODT system. \n\nProfile Client components communicate with profile servers \nover the messaging layer connectors. The client must know the \nlocation of the profile server, and must provide a query that \nidentifies the metadata that a user is interested in retrieving. There \ncan be many profile clients speaking with a single profile server, \nand many profile servers speaking with a single profile client.  \nThe architectural effects are analogous to those in the case of \nproduct clients and servers. \n\n4.2.3 Query Server and Query Client \nThe Query Server component provides an integrated search \n\nand retrieval capability for the OODT reference architecture. \nQuery servers interact with profile and product servers to retrieve \nmetadata and data requested by system users. A query server is \nseeded with an initial set of references to profile servers. Upon \nreceiving a query from a user, the query server passes it along to \neach profile server from its list, and collects the metadata \nreturned. Part of this metadata is a resource location (recall \nSection 4.2.2) in the form of a URI [28]. A URI can be a link to a \nproduct server, to a web site with the actual data, or to some \nexternal data providing system. This directly supports \nheterogeneity, location transparency, and autonomy of data \nproviders in OODT.  \n\nAnother novel aspect of OODT\u2019s architecture is that if a \nprofile server is unable to service the query, or if it believes that \n\nother profile servers it is aware of may contain relevant metadata, \nit will return the URIs of those profile servers; the query server \nmay then forward the query to them. As a result, query servers are \ncompletely decoupled from product servers (and from any \n\u201cexposed\u201d external data sources), and are also decoupled from \nmost of the profile servers. In turn, this lessens the complexity of \nimplementing, integrating, and evolving query servers. Once the \nresource metadata is returned, the query server will either allow \nthe user herself to use the supplied URIs to find the data in which \nshe was interested (interactive mode), or it will retrieve, package, \nand deliver the data to the user (non-interactive mode). As with \nthe product and profile servers, query servers can be centrally \nlocated at a single site, or they can be decentralized across \nmultiple sites.   \n\nQuery Client components communicate with the query \nservers. The query client must provide a query server with a query \nthat identifies the data in which the user is interested, and it must \nset a mode for the query server (interactive or non-interactive \nmode). The query client may know the location of the query \nserver that it wants to contact, or it may rely on the messaging \nlayer connector to route its queries to one or more query servers.   \n\n4.2.4 Catalog and Archive Server and Client \nThe Catalog and Archive Server (CAS) component in OODT \n\nis responsible for providing a common mechanism for ingestion \nof data into a data store, including any processing required as a \nresult of ingestion. For instance, prior to the ingestion of a poor-\nresolution image of Mars, the image may need to be refined and \nthe resolution improved. CAS would handle this type of \nprocessing. Any data ingested into CAS must include associated \nmetadata information so that the data can be cataloged for search \nand retrieval purposes. Upon ingestion, the data is sent to a data \nstore for preservation, and the corresponding metadata is sent to \nthe associated catalog. The data store and catalog need not be \nlocated on the same host; they may be located on remote sites \nprovided there is an access mechanism to store and retrieve data \nfrom each. The goal of CAS is to streamline and standardize the \nprocess of adding data to an OODT-aware system.  Note that a \nsystem whose data stores were populated prior to its integration \ninto OODT can still use CAS for its new data.  Since the CAS \ncomponent populates data stores and catalogs with both data and \nmetadata, specialized product and profile server components have \nbeen developed to serve data and metadata from the CAS backend \ndata stores and catalogs more efficiently. Any older data can still \nbe served with existing product and profile servers. \n\nThe Archive Client component communicates with CAS. The \narchive client must know the location of the CAS component, and \nmust provide it with data to ingest. Many archive clients can \ncommunicate with a single CAS component, and vice versa.  Both \nthe archive client and CAS components are completely \nindependent of the preceding three pairs of component types in \nthe OODT reference architecture. \n\n4.3 OODT Connectors \n4.3.1 Handler Connectors \n\nHandler connectors are responsible for enabling the \ninteraction between OODT\u2019s components and third-party data \nstores.  A handler connector performs the transformation between \nan underlying (meta-)data store\u2019s internal API for retrieving data \nand its (meta-)data format on the one hand, and the OODT system \n\n\n\non the other. Each handler connector is typically developed for a \nclass of data stores and metadata systems. For example, for a \ngiven DBMS such as Oracle, and a given internal representation \nschema for metadata, a generic Oracle handler connector is \ntypically developed and then reused. Similarly, for a given \nfilesystem scheme for storing data, a generic filesystem handler \nconnector is developed and reused across like filesystem data \nstores.  \n\nEach profile server and product server relies on one or more \nhandler connectors. Profile servers use profile handlers, and \nproduct servers use query handlers. Handler connectors thereby \ncompletely insulate product and profile servers from the third-\nparty data stores.  Handlers also allow for different types of \ntransformations on (meta-)data to be introduced dynamically \nwithout any effect on the rest of OODT components. For \nexample, a product server that distributes Mars image data might \nbe serviced by a query handler connector that returns high-\nresolution (e.g., 10 GB) JPEG image files of the latest summit \nclimbed by a Mars rover; if the system ends up experiencing \nperformance problems, another handler may be (temporarily) \nadded to return lower-resolution (e.g., 1 MB) JPEG image files of \nthe same scenario. Likewise, a profile server may have two \nprofile handler connectors, one that returns image-quality \nmetadata (e.g., resolution and bits/pixel) and another that returns \ninstrument metadata about Mars rover images (e.g., instrument \nname or image creation date). \n\n4.3.2 Messaging Layer Connector \nThe Messaging Layer connector is responsible for \n\nmarshalling data and metadata between components in an OODT \nsystem. The messaging layer must keep track of the locations of \nthe components, what types of components reside in which \nlocations, and if components are still running or not. Additionally, \nthe messaging layer is responsible for taking care of any needed \nsecurity mechanisms such as authentication against an LDAP \ndirectory service, or authorization of a user to perform certain \nrole-based actions. \n\nThe messaging layer in OODT provides synchronous \ninteraction among the components, and some delivery guarantees \non messages transferred between the software components. \nTypically in any large-scale data system, the asynchronous mode \nof interaction is not encouraged because partial data transfers are \nof no use to users such as scientists who need to make analysis on \nentire data sets. \n\nThe messaging layer supports communication between any \nnumber of connected OODT software components. In addition, \nthe messaging layer natively supports connections to other \nmessaging layer connectors as well.  This provides us with the \nability to extend and adapt an OODT system\u2019s architecture, as \nwell as easily tailor the architecture for any specific interaction \nneeds (e.g., by adding data encryption and/or compression \ncapabilities to the connector). \n\n5. EXPERIENCE AND CASE STUDIES \nThe OODT framework has been used both within and \n\noutside NASA. JPL, NASA\u2019s Ames Research Center, the \nNational Institutes of Health (NIH), the National Cancer Institute \n(NCI), several research universities, and U.S. Federally Funded \nResearch and Development Centers (FFRDCs) are all using \nOODT in some form or fashion. OODT is also available for \ndownload through a large open-source software distributor [29]. \n\nOODT components are found in planetary science, earth science, \nbiomedical, and clinical research projects. In this section, we \ndiscuss our experience with OODT in several representative \nprojects within these scientific areas. We compare and contrast \nhow the projects were handled before and after OODT. We sketch \nsome of the domain-specific technical challenges we encountered \nand identify how OODT helped to solve them. \n\nTo begin using OODT, a user designs a deployment \narchitecture from one or more of the reference OODT \ncomponents (e.g., product and profile servers), and the reference \nOODT connectors. The user must determine if any existing \nhandler connectors can be reused, or if specialized handler \nconnectors need to be developed. Once all the components are \nready, the user has two options for deploying her architecture to \nthe target hosts: (1) the user may translate her design into a \nspecialized OODT deployment descriptor XML file, which can \nthen be used to start each program on the target host(s); or (2) the \nuser can deploy her OODT architecture using a remote server \ncontrol component, adding components, and connectors via a \ngraphical user interface. The GUI allows the user to send \ncomponent and connector code to the target hosts, to start, shut-\ndown, and restart the components and connectors, and to monitor \ntheir health during execution. \n\n5.1 Planetary Data System \nOne of the flagship deployments of OODT has been for \n\nNASA\u2019s Planetary Data System (PDS) [30]. PDS consists of \nseven \u201cdiscipline nodes\u201d and an engineering and management \nnode. Each node resides at a different U.S. university or \ngovernment agency, and is managed autonomously.  \n\nFor many years PDS distributed its data and metadata on \nphysical media, primarily CD-ROM. Each CD-ROM was \nformatted a according to a \u201chome-grown\u201d directory layout \nstructure called an archive volume, which later was turned into a \nPDS standard. PDS metadata was constructed using a common, \nwell-structured set of 1200 metadata elements, such as Target \nName and Instrument Type, that were identified from the onset of \nthe PDS project by planetary scientists. Beginning in the late \n1990s the advent of the WWW and the increasing data volumes of \nmissions led NASA managers to impose a new paradigm for \ndistributing data to the users of the PDS: data and metadata were \nnow to be distributed electronically, via a single, unified web \nportal. The web portal and accompanying infrastructure to \ndistribute PDS data and metadata was built in 2001 using OODT \nin the manner depicted in Figure 1. \n\nWe faced several technical challenges deploying OODT to \nPDS. PDS data and metadata were highly distributed, spanning all \nseven of the scientific discipline nodes across the country. \nAlthough the entire data volume across PDS at the time was \naround 7 terabytes, it was estimated that the volume would grow \nto 10 terabytes by 2004. Consequently, the system needed to be \nscalable and respond to large growth spurts caused by new data \nproducing missions. The flexibility and modularity of the OODT \nproduct and profile server components were particularly useful in \nthis regard. Using a product and/or profile server, each new data \nproducing system in the PDS could be dynamically \u201cplugged in\u201d \nto the existing PDS infrastructure that we constructed, without \ndisturbing existing components and processes.  \n\nWe also faced the problem of heterogeneity. Almost every \nnode within PDS had a different operating system, ranging from \nLinux, to Windows, to Solaris, to Mac OS X.  Each node \n\n\n\nEDRN \nQuery \nServer\n\nm\nessaging layer (R\n\nM\nI)\n\nProduct \nServer\n\nDBMS \n(Specimen \nMetadata)\n\nmoffitt.usf.edu (win2k server)\n\nMS SQL DBMS \n(Specimen \nProducts)\n\nSpecimen \nQuery \n\nHandler\n\nSpecimen Profile \nHandler (MS SQL)\n\nOODT \u201cSandbox\u201d\n\nOODT \u201cSandbox\u201d\n\nProduct \nServer\n\nProfile \nServer\n\nanother.erne.server (AnotherOS)\n\nCAS Profile \nHandler\n\nCAS Query \nHandler\n\nOODT \u201cSandbox\u201d\nCatalog and \n\nArchive Server\n\nLung Images \n(Filesystem)\n\nOther \nApplications\n\nginger.fhcrc.org (win2k)\n\nOther Applications\n\nERNE Web \nPortal\n\n(Query Client)\n\nuser host\n\nProfile \nClient\n\nProduct \nClient\n\nProfile ServerOther \nApplications\n\nOther \nApplications\n\nOther Applications\n\nOther Applications\n\nSpecimen Inventory\n(MS SQL)\n\nOther Applications\n\nOther Applications\n\npds.jpl.nasa.gov (Linux)\nLegend:\n\nOODT \nComponent\n\nData/metadata \nstore\n\nOODT Connector Hardware \nhost\n\nOODT \ncontrolled \nportion of \nmachine\n\ndata/control flow\nBlack Box\n\n \n \n\nFigure 2. The Early Detection Research Network (EDRN) OODT Architecture Instantiation \n\nmaintained its own local catalog system. Although each node in \nPDS had different file system implementations dictated by their \nOS, each node stored their data and metadata according to the \narchive volume structure. Because of this, we were able to write a \nsingle, reusable PDS Query Handler which could serve back \nproducts from a PDS archive volume structure located on a file \nsystem. Plugging into each node\u2019s catalog system proved to be a \nsignificant challenge. For nearly all of the nodes, specialized \nprofile handler connectors were constructed to interface with the \nunderlying catalog systems, which ranged from static text files \ncalled PDS label files to dynamic web site inventory systems \nconstructed using Java Server Pages. Because each of the catalogs \ntagged PDS data using the common set of 1200 elements, we \nwere able to share much of the code base among the profile \nhandler connectors, ultimately only changing the portion of the \ncode that made the particular JSP page call, or read the selected \nset of metadata from the label file. The entire code base of the \nPDS including all the domain specific handler connectors is only \nslightly over 15 KSLOC, illustrating the high degree of \nreusability provided by the OODT framework. \n\n5.2 Early Detection Research Network \nOODT is also supporting the National Cancer Institute\u2019s \n\n(NCI) Early Detection Research Network (EDRN). EDRN is a \ndistributed research program that unites researchers from over \nthirty institutions across the United States. Tens of thousands of \nscientists participate in the EDRN. Each institution is focused on \nthe discovery of cancer biomarkers as indicators for disease [31]. \n\nA critical need for the EDRN is an electronic infrastructure to \nsupport discovery and validation of these markers.  \n\nIn 2001 we worked with the EDRN program to develop the \nfirst component of their electronic biomarker infrastructure called \nthe EDRN Resource Network Exchange (ERNE). The (partial) \ncorresponding architecture is depicted in Figure 2. One of the \nmajor goals of ERNE was to provide real-time access to bio-\nspecimen information across the institutions of the EDRN. Bio-\nspecimen information typically consisted of gigabytes of \nspecimen images, and location and contact metadata for obtaining \nthe specimen from its origin study institution. The previous \nmethod of obtaining bio-specimen information was very human-\nintensive: it involved phone calls and some forms of electronic \ncommunication such as email. Specimen information was not \nsearchable across institutions participating in the EDRN. The bio-\nspecimen catalogs were largely out-of-date, and out-of-synch with \ncurrent holdings at each participating institution.  \n\nOne of the initial technical challenges we faced with EDRN \nwas scale. The EDRN was over three times as large as the PDS. \nBecause of this we chose to target ten institutions initially, rather \nthan the entire set of thirty one. Again, OODT\u2019s modularity and \nscalability came into play as we could phase deployment at each \ndeployment institution. As we instantiated new product, profile, \nquery, and archive servers at each institution, we could do so \nwithout interrupting any existing OODT infrastructure already \ndeployed.  \n\nAnother challenge that we encountered was dealing with \neach participating site\u2019s Institutional Review Board (IRB). An \nIRB is required to review and ensure compliance of projects with \n\n\n\nfederal laws related to working with data from research projects \ninvolving human subjects. To satisfy the IRB, any OODT \ncomponents deployed at an EDRN site had to provide an adequate \nsecurity capability in order to get approval to share the data \nexternally from an institution. OODT\u2019s separation of data and \nmetadata explicitly allowed us to satisfy this requirement. We \ndesigned ERNE so that each institution could remain in control of \ntheir specimen holding data by instantiating product server \ncomponents at each site, rather than distributing the information \nacross ERNE which would have violated the IRB agreements.  \n\nAnother significant challenge we faced in developing ERNE \nwas lack of a consistent metadata model for each ERNE site. We \nwere forced to develop a common specimen metadata model and \nthen to create specific mappings to link each local site to the \ncommon model. OODT aided us once again in this endeavor as \nthe common mappings we developed were easily codified into a \nquery handler connector, and reused across each ERNE site.  \n\nThe entire code base of ERNE, including all its specialized \nhandler connectors is only slightly over 5.3 KSLOC, highlighting \nthe high degree of reusability of the shared framework code base \nand the handler code base. \n\n \n\n5.3 Science Processing Systems \nOODT has also been deployed in several science processing \n\nsystem missions both, operational and under development. Due to \nspace limitations, we can only briefly summarize each of the \nOODT deployments in these systems.  \n\nSeaWinds, a NASA-funded earth science instrument flying \non the Japanese ADEOS-II spacecraft, used the OODT CAS \ncomponent as a workflow and processing component for its \nProcessing and Analysis Center (SeaPAC). SeaWinds produced \nseveral gigabytes of data during its six year mission. CAS was \nused to control the execution and data flow of mission-specific \ndata processor components, which calibrated and created derived \ndata products from raw instrument data, and archived those \nproducts for distribution into the data store managed by CAS. A \nmajor challenge we faced during the development of SeaPAC was \nthat  the processor components were developed by a group \noutside of the SeaWinds project. We had to provide a mechanism \nfor integrating their source code into the OODT SeaPAC \nframework. OODT\u2019s separation of concerns allowed us to address \nthis issue with relative ease: once the data processors were \nfinished, we were able wrap and tailor them internally within \nCAS, without disturbing the existing SeaPaC infrastructure. \n\nThe success of the CAS within SeaWinds led to its reuse on \nseveral different missions. Another earth science mission called \nQuikSCAT retrofitted and replaced some of their existing \nprocessing components with CAS, using the SeaWinds experience \nas an example. The Orbiting Carbon Observatory (OCO) mission \nthat will fly in 2009, and that is currently under development, is \nalso utilizing CAS to ingest and process existing FTS CO2 \nspectrometer data from earth-based instruments. The James Web \nTelescope (JWT) is using the CAS for to implement its workflow \nand processing capabilities for astrophysics data and metadata. \nEach of these science processing systems will face similar \ntechnical challenges, including separation of concerns between \nthe actual processing framework and the developers writing the \nprocessor code, the volume of data that must be handled by the \nprocessing system (OCO is projected to produce over 150 \nterabytes), and the flexibility and tailorability of the workflow \n\nneeded to process the data. We believe that OODT is uniquely \npositioned to address these difficult challenges. \n\n5.4 Computer Modeling Simulation and \nVisualization \n\nOODT has also been deployed to aid the Computer \nModeling Simulation and Visualization (CMSV) community at \nJPL, by linking together several institutional model repositories \nacross the organizations within the lab, and creating a web portal \ninterface to query the integrated model repositories. We \ndeveloped specialized profile server components that locate and \nlink to different model resources across JPL, such as power \nsubsystem models of the Mars Exploration Rovers (MER), CAD-\ndrawing models of different spacecraft assembly parts, and \nsystems architecture models for engineering and design of \nspacecraft. Each of these different model types lived in separate \nindependent repositories across JPL. For instance, the CAD \nmodels were stored in a commercial product called TeamCenter \nEnterprise [32], while the power and systems architecture models \nwere stored in a commercial product called Xerox Docushare \n[33].  \n\nTo integrate these model repositories for CMSV, we had to \nderive a common set of metadata across the wide spectrum of \ndifferent model types that existed at JPL. OODT\u2019s separation of \ndata from metadata allowed us to rapidly instantiate our common \nmetadata model once we developed it, by constructing specialized \nprofile handler connectors that mapped each repository\u2019s local \nmodel to the common model. Reusability levels were high across \nthe connectors, resulting in an extremely small code base of 2.57 \nKSLOC.  \n\nAnother challenge in light of this mapping activity was \ninterfacing with the APIs of the underlying model repositories. In \nthe above two cases, the APIs were commercial products, and \npoorly documented. In some cases, such as the Docushare \nrepository, the APIs did not fully conform to their stated \nspecifications. The division of labor amongst OODT components \ncame into play on this task. It allowed us to focus on deploying \nthe rest of the OODT supporting infrastructure, such as the web \nportal, and the profile handler connectors, and not getting stalled \nwaiting for the support teams from each of the commercial \nvendors to debug our API problems. Once the OODT CMSV \ninfrastructure was deployed, the modeling and simulation \ncommunity at JPL immediately began adopting it and sharing \ntheir models across the lab. During the past year, the system has \nreceived around 40,000 hits on the web portal, and over 9,000 \nqueries for models. \n\n6. CONCLUSIONS \nWhen the need arose at NASA seven years ago for a data \n\ndistribution and management solution that satisfied the formidable \nrequirements outlined in this paper, it was not clear to us initially \nhow to approach the problem.  On the surface, several applicable \nsolutions already existed (middleware, information integration \nsystems, and the emerging grid technologies).  Adopting one of \nthem seemed to be a preferable path because it would have saved \nus precious time.  However, upon closer inspection we realized \nthat each of these options could be instructive, but that none of \nthem solved the problem we were facing (and that even some of \nthese technologies themselves were facing). \n\nThe observation that directly inspired OODT was that we \nwere dealing with software engineering challenges, and that those \n\n\n\nchallenges naturally required a software engineering solution.  \nOODT is a large, complex, dynamic system, distributed across \nmany sites, servicing many different users, and classes of users, \nwith large amounts of heterogeneous data, possibly spanning \nmultiple domains. Software engineering research and practice \nboth suggest that success in developing such a system will be \ndetermined to a large extent by the system\u2019s software \narchitecture.  It therefore became imperative that we rely on our \nexperience within the domain of data-intensive systems (e.g., \nJPL\u2019s PDS project), as well as our study of related research and \npractice, in order to develop an architecture for OODT that will \naddress the challenges we discussed in Section 2.  Once the \narchitecture was designed and evaluated, OODT\u2019s initial \nimplementation and its subsequent adaptations followed naturally. \n\nAs OODT\u2019s developers we are heartened, but as software \nengineering researchers and practitioners disappointed, that \nOODT still appears to be the only system of its kind. The \nintersection of middleware, information management, and grid \ncomputing is rapidly growing, yet it is still characterized by one-\noff solutions targeted at very specific problems in specific \ndomains. Unfortunately, these solutions are sometimes clever by \naccident and more frequently little more than \u201chacks\u201d.  We \nbelieve that OODT\u2019s approach is more appropriate, more \neffective, more broadly applicable, and certainly more helpful to \ndevelopers of future systems in this area.  We consider OODT\u2019s \ndemonstrated ability to evolve and its applicability in a growing \nnumber of science domains to be a testament to its explicit, \ncarefully crafted software architecture. \n\n7. ACKNOWLEDGEMENTS \nThis material is based upon work supported by the Jet \n\nPropulsion Laboratory, managed by the California Institute of \nTechnology. Effort also supported by the National Science \nFoundation under Grant Numbers CCR-9985441 and ITR-\n0312780.  \n\n8. REFERENCES \n[1] A. Chervenak, I. Foster, et al., \"The Data Grid: Towards an \n\nArchitecture for the Distributed Management and Analysis of \nLarge Scientific Data Sets,\" J. of Network and Computer \nApplications, vol. 23, pp. 187-200, 2000. \n\n[2] N. Medvidovic and R. N. Taylor, \"A Classification and \nComparison Framework for Software Architecture Description \nLanguages,\" IEEE TSE, vol. 26, pp. 70-93, 2000. \n\n[3] D. E. Perry and A. L. Wolf, \"Foundations for the Study of \nSoftware Architecture,\" Software Engineering Notes (SEN), \nvol. 17, pp. 40-52, 1992. \n\n[4] \"The Globus Alliance (http://www.globus.org),\" 2005. \n[5] \"Webservices.org (http://www.webservices.org),\" 2005. \n[6] A. Luther, R. Buyya, et al., \"Alchemi: A .NET-based \n\nEnterprise Grid Computing System,\" in Proc. of 6th \nInternational Conference on Internet Computing, Las Vegas, \nNV, USA, 2005. \n\n[7] \"JCGrid Web Site (http://jcgrid.sourceforge.net),\" 2005. \n[8] \"LHC Computing Grid (http://lcg.web.cern.ch/LCG/),\" 2005. \n[9] D. Bernholdt, S. Bharathi, et al., \"The Earth System Grid: \n\nSupporting the Next Generation of Climate Modeling \nResearch,\" Proceedings of the IEEE, vol. 93, pp. 485-495, \n2005. \n\n[10] A. Finkelstein, C. Gryce, et al., \"Relating Requirements and \nArchitectures: A Study of Data Grids,\" J. of Grid Computing, \nvol. 2, pp. 207-222, 2004. \n\n[11] C. A. Mattmann, N. Medvidovic, et al., \"Unlocking the Grid,\" \nin Proc. of CBSE, St. Louis, MO, pp. 322-336, 2005. \n\n[12] J. Hammer, H. Garcia-Molina, et al., \"Information translation, \nmediation, and mosaic-based browsing in the tsimmis system,\" \nin Proc. of ACM SIGMOD International Conference on \nManagement of Data, San Jose, CA, pp. 483-487, 1995. \n\n[13] T. Kirk, A. Y. Levy, et al., \"The information manifold,\" \nWorking Notes of the AAAI Spring Symposium on Information \nGathering in Heterogeneous, Distributed Environment, Menlo \nPark, CA, Technical Report SS-95-08, 1995. \n\n[14] O. Etzioni and D. S. Weld, \"A softbot-based interface to the \nInternet,\" CACM, vol. 37, pp. 72-76, 1994. \n\n[15] A. Go\u00f1i, A. Illarramendi, et al., \"An optimal cache for a \nfederated database system,\" Journal of Intelligent Information \nSystems, vol. 9, pp. 125-155, 1997. \n\n[16] M. R. Genesereth, A. Keller, et al., \"Infomaster: An \ninformation integration system,\" in Proc. of ACM SIGMOD \nInternational Conference on Management of Data, Tucson, \nAZ, pp. 539-542, 1997. \n\n[17] A. Tomasic, L. Raschid, et al., \"A data model and query \nprocessing techniques for scaling access to distributed \nheterogeneous databases in disco,\" IEEE Transactions on \nComputers, 1997. \n\n[18] Y. Arens, C. A. Knoblock, et al., \"Query Reformulation for \nDynamic Information Integration,\" Journal of Intelligent \nInformation Systems, vol. 6, pp. 99-130, 1996. \n\n[19] J. Ambite, N. Ashish, et al., \"Ariadne: A system for \nconstructing mediators for internet sources,\" in Proc. of ACM \nSIGMOD International Conference on Management of Data, \nSeattle, WA, pp. 561-563, 1998. \n\n[20] G. Barish and C. A. Knoblock, \"An Expressive and Efficient \nLanguage for Information Gathering on the Web,\" in Proc. of \n6th International Conference on AI Planning and Scheduling \n(AIPS-2002) Workshop, Toulouse, France, 2002. \n\n[21] A. Y. Halevy, \"Answering queries using views: A survey,\" \nVLDB Journal, vol. 10, pp. 270-294, 2001. \n\n[22] J. L. Ambite, C. A. Knoblock, et al., \"Compiling Source \nDescriptions for Efficient and Flexible Information \nIntegration,\" Information Systems Journal, vol. 16, pp. 149-\n187, 2001. \n\n[23] E. Lambrecht and S. Kambhampati, \"Planning for Information \nGathering:  A Tutorial Survey,\" ASU CSE Technical Report \n96-017, May 1997. \n\n[24] \"Enterprise Java Beans (http://java.sun.com/ejb),\" pp. 2005. \n[25] \"Java RMI (http://java.sun.com/rmi/),\" 2005. \n[26] C. A. Mattmann, S. Malek, et al., \"GLIDE:  A Grid-based \n\nLightweight Infrastructure for Data-intensive Environments,\" \nin Proc. of European Grid Conference, Amsterdam, the \nNetherlands, pp. 68-77, 2005. \n\n[27] DCMI, \"Dublin Core Metadata Element Set,\" 1999. \n[28] T. Berners-Lee, R. Fielding, et al., \"Uniform Resource \n\nIdentifiers (URI): Generic Syntax,\" 1998. \n[29] \"Open Channel Foundation: Request Object Oriented Data \n\nTechnology (OODT) - \n(http://openchannelsoftware.com/orders/index.php?group_id=3\n32),\" 2005. \n\n[30] J. S. Hughes and S. K. McMahon, \"The Planetary Data System. \nA Case Study in the Development and Management of Meta-\nData for a Scientific Digital Library.,\" in Proc. of ECDL, pp. \n335-350, 1998. \n\n[31] S. Srivastava, Informatics in proteomics. Boca Raton, FL: \nTaylor & Francis/CRC Press, 2005. \n\n[32] \"UGS Products: TeamCenter \n(http://www.ugs.com/products/teamcenter/),\" 2005. \n\n[33] \"Document Management | Xerox Docushre \n(http://docushare.xerox.com/ds/),\" 2005. \n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\tINTRODUCTION\n\tSOFTWARE ENGINEERING CHALLENGES\n\tBACKGROUND AND RELATED WORK\n\tOODT ARCHITECTURE\n\tGuiding Principles\n\tOODT Components\n\tProduct Server and Product Client\n\tProfile Server and Profile Client\n\tQuery Server and Query Client\n\tCatalog and Archive Server and Client\n\n\tOODT Connectors\n\tHandler Connectors\n\tMessaging Layer Connector\n\n\n\tEXPERIENCE AND CASE STUDIES\n\tPlanetary Data System\n\tEarly Detection Research Network\n\tScience Processing Systems\n\tComputer Modeling Simulation and Visualization\n\n\tCONCLUSIONS\n\tACKNOWLEDGEMENTS\n\tREFERENCES\n\n",
                  "X-TIKA:parse_time_millis": "11123",
                  "access_permission:assemble_document": "true",
                  "access_permission:can_modify": "true",
                  "access_permission:can_print": "true",
                  "access_permission:can_print_degraded": "true",
                  "access_permission:extract_content": "true",
                  "access_permission:extract_for_accessibility": "true",
                  "access_permission:fill_in_form": "true",
                  "access_permission:modify_annotations": "true",
                  "created": "Wed Feb 15 13:13:58 PST 2006",
                  "creator": "End User Computing Services",
                  "date": "2006-02-15T21:16:01Z",
                  "dc:creator": "End User Computing Services",
                  "dc:format": "application/pdf; version=1.4",
                  "dc:title": "Proceedings Template - WORD",
                  "dcterms:created": "2006-02-15T21:13:58Z",
                  "dcterms:modified": "2006-02-15T21:16:01Z",
                  "grobid:header_Abstract": "Modern scientific research is increasingly conducted by virtual communities of scientists distributed around the world. The data volumes created by these communities are extremely large, and growing rapidly. The management of the resulting highly distributed, virtual data systems is a complex task, characterized by a number of formidable technical challenges, many of which are of a software engineering nature. In this paper we describe our experience over the past seven years in constructing and deploying OODT, a software framework that supports large, distributed, virtual scientific communities. We outline the key software engineering challenges that we faced, and addressed, along the way. We argue that a major contributor to the success of OODT was its explicit focus on software architecture. We describe several large-scale, real-world deployments of OODT, and the manner in which OODT helped us to address the domain-specific challenges induced by each deployment.",
                  "grobid:header_AbstractHeader": "ABSTRACT",
                  "grobid:header_Address": "Pasadena, CA 91109, USA Los Angeles, CA 90089, USA",
                  "grobid:header_Affiliation": "1 Jet Propulsion Laboratory California Institute of Technology ; 2 Computer Science Department University of Southern California",
                  "grobid:header_Authors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1 Nenad Medvidovic 2 Steve Hughes 1",
                  "grobid:header_BeginPage": "-1",
                  "grobid:header_Class": "class org.grobid.core.data.BiblioItem",
                  "grobid:header_Email": "{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov ; {mattmann,neno}@usc.edu",
                  "grobid:header_EndPage": "-1",
                  "grobid:header_Error": "true",
                  "grobid:header_FirstAuthorSurname": "Mattmann",
                  "grobid:header_FullAffiliations": "[Affiliation{name='null', url='null', institutions=[California Institute of Technology], departments=null, laboratories=[Jet Propulsion Laboratory], country='USA', postCode='91109', postBox='null', region='CA', settlement='Pasadena', addrLine='null', marker='1', addressString='null', affiliationString='null', failAffiliation=false}, Affiliation{name='null', url='null', institutions=[University of Southern California], departments=[Computer Science Department], laboratories=null, country='USA', postCode='90089', postBox='null', region='CA', settlement='Los Angeles', addrLine='null', marker='2', addressString='null', affiliationString='null', failAffiliation=false}]",
                  "grobid:header_FullAuthors": "[Chris A Mattmann, Daniel J Crichton, Nenad Medvidovic, Steve Hughes]",
                  "grobid:header_Item": "-1",
                  "grobid:header_Keyword": "Categories and Subject Descriptors D2 Software Engineering, D211 Domain Specific Architectures Keywords OODT, Data Management, Software Architecture",
                  "grobid:header_Keywords": "[D2 Software Engineering, D211 Domain Specific Architectures  (type:subject-headers), Keywords  (type:subject-headers), OODT, Data Management, Software Architecture  (type:subject-headers)]",
                  "grobid:header_Language": "en",
                  "grobid:header_NbPages": "-1",
                  "grobid:header_OriginalAuthors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1 Nenad Medvidovic 2 Steve Hughes 1",
                  "grobid:header_Title": "A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications",
                  "meta:author": "End User Computing Services",
                  "meta:creation-date": "2006-02-15T21:13:58Z",
                  "meta:save-date": "2006-02-15T21:16:01Z",
                  "modified": "2006-02-15T21:16:01Z",
                  "pdf:PDFVersion": "1.4",
                  "pdf:encrypted": "false",
                  "producer": "Acrobat Distiller 6.0 (Windows)",
                  "resourceName": "ICSE06.pdf",
                  "title": "Proceedings Template - WORD",
                  "xmp:CreatorTool": "Acrobat PDFMaker 6.0 for Word",
                  "xmpTPg:NPages": "10"
              }
          ]
          

          Great work, Sujen Shah. I'm going to commit this now and start work on the Wiki page!

          Show
          chrismattmann Chris A. Mattmann added a comment - I got this working! Starting Tika Server java -Dorg.apache.tika.service.error.warn=true -classpath $HOME/git/grobidparser-resources/:$HOME/src/tika-server/target/tika-server-1.11-SNAPSHOT.jar:$HOME/grobid/lib/\* org.apache.tika.server.TikaServerCli --config tika-config.xml cURL command to test curl -T $HOME/git/grobid/papers/ICSE06.pdf -H "Content-Disposition: attachment;filename=ICSE06.pdf" http://localhost:9998/rmeta | python -mjson.tool Output [ { "Author": "End User Computing Services", "Company": "ACM", "Content-Type": "application/pdf", "Creation-Date": "2006-02-15T21:13:58Z", "Last-Modified": "2006-02-15T21:16:01Z", "Last-Save-Date": "2006-02-15T21:16:01Z", "SourceModified": "D:20060215211344", "X-Parsed-By": [ "org.apache.tika.parser.CompositeParser", "org.apache.tika.parser.journal.JournalParser" ], "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProceedings Template - WORD\n\n\nA Software Architecture-Based Framework for Highly \nDistributed and Data Intensive Scientific Applications \n\n \nChris A. Mattmann1, 2 Daniel J. Crichton1 Nenad Medvidovic2 Steve Hughes1 \n\n \n1Jet Propulsion Laboratory \n\nCalifornia Institute of Technology \nPasadena, CA 91109, USA \n\n{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov \n\n2Computer Science Department \nUniversity of Southern California \n\nLos Angeles, CA 90089, USA \n{mattmann,neno}@usc.edu \n\n \nABSTRACT \nModern scientific research is increasingly conducted by virtual \ncommunities of scientists distributed around the world. The data \nvolumes created by these communities are extremely large, and \ngrowing rapidly. The management of the resulting highly \ndistributed, virtual data systems is a complex task, characterized \nby a number of formidable technical challenges, many of which \nare of a software engineering nature. In this paper we describe \nour experience over the past seven years in constructing and \ndeploying OODT, a software framework that supports large, \ndistributed, virtual scientific communities. We outline the key \nsoftware engineering challenges that we faced, and addressed, \nalong the way. We argue that a major contributor to the success of \nOODT was its explicit focus on software architecture. We \ndescribe several large-scale, real-world deployments of OODT, \nand the manner in which OODT helped us to address the domain-\nspecific challenges induced by each deployment. \n\nCategories and Subject Descriptors \nD.2 Software Engineering, D.2.11 Domain Specific Architectures \n\nKeywords \nOODT, Data Management, Software Architecture. \n\n1. INTRODUCTION \nSoftware systems of today are very large, highly complex, \n\noften widely distributed, increasingly decentralized, dynamic, and \nmobile. There are many causes behind this, spanning virtually all \nfacets of human endeavor: desired advances in education, \nentertainment, medicine, military technology, \ntelecommunications, transportation, and so on. \n\nOne major driver of software\u2019s growing complexity is \nscientific research and exploration. Today\u2019s scientists are solving \nproblems of until recently unimaginable complexity with the help \nof software. They also actively and regularly collaborate with \n\ncolleagues around the world, something that has become possible \nonly relatively recently, again ultimately thanks to software. They \nare collecting, producing, sharing, and disseminating large \namounts of data, which are growing by orders of magnitude in \nvolume in remarkably short time periods. \n\nIt is this latter problem that NASA\u2019s Jet Propulsion \nLaboratory (JPL) began facing several years ago. Until recently, \nJPL would disseminate data collected by various instruments \n(Earth-based, orbiting, and in outer space) to the interested \nscientists around the United States by \u201cburning\u201d CD-ROMs and \nmailing them via the U.S. Postal Service. In addition to being \nslow, sequential, unidirectional, and lacking interactivity, this \nmethod was expensive, costing hundreds of thousands of dollars. \nFurthermore, the method was prone to security breaches, and the \nexact data distribution (determining which data goes to which \ndestinations) had to be calculated for each individual shipment. It \nhad become increasingly difficult to manage this process as the \nnumber of projects and missions, as well as involved scientists, \ngrew. An even more critical limiting factor became the sheer \nvolume of data that the current (e.g., Planetary Data System, or \nPDS), pending (e.g., Mars Reconnaissance Orbiter, or MRO), and \nplanned (e.g., Lunar Reconnaissance Orbiter, or LRO) missions \nwould produce: from terabytes (PDS), to hundreds of terabytes \n(MRO), to petabytes or more (LRO). Clearly, spending millions \nof dollars just to distribute the data to scientists is impractical. \n\nThis prompted NASA\u2019s Office of Space Science to explore \nconstruction of an end-to-end software framework that would \nlower the cost of distributing and managing scientific data, from \nthe inception of data at a science processing center to its ultimate \narrival on the desks of interested users. Because of increasing data \nvolumes, the framework had to be scalable and have native \nsupport for evolution to hundreds of sites and thousands of data \ntypes. Additionally, the framework had to enable the \nvirtualization of heterogeneous data (and processing) sources, and \nto address wide-scale (national and international) distribution of \ndata. The framework needed to be flexible: it needed to support \nfully automated processing of data throughout its lifecycle, while \nstill allowing interactivity and intervention from an operator when \nneeded. Furthermore because data is itself distributed across \nNASA agencies, any software framework that distributes NASA\u2019s \ndata would require the capability for tailorable levels of security \nand for varying types of users belonging to multiple \norganizations. \n\nThere were also miscellaneous issues of data ownership that \nneeded to be overcome. Ultimately, because NASA\u2019s science data \nis so distributed, the owners of data systems (e.g., a Planetary \n\n \n\nPermission to make digital or hard copies of all or part of this work for \npersonal or classroom use is granted without fee provided that copies are \nnot made or distributed for profit or commercial advantage and that \ncopies bear this notice and the full citation on the first page. To copy \notherwise, or republish, to post on servers or to redistribute to lists, \nrequires prior specific permission and/or a fee. \nICSE06\u2019, May 20\u201328, 2006, Shanghai, China. \nCopyright 2006 ACM 1-58113-000-0/00/0004\u2026$5.00. \n \n\n\n\nScience Principal Investigator) feel hard pressed to control their \ndata, as the successful operation and maintenance of their data \nsystems are essential services that they provide. As such, any \nframework that virtualizes science data sources across NASA \nshould be transparent and unobtrusive: it should enable \ndissemination and retrieval of data across data systems, each of \nwhich may have their own external interfaces and services; at the \nsame time, it should enable scientists to maintain and operate their \ndata systems independently. Finally, to lower costs, once the \nframework was built and installed, it needed to be reusable, free, \nand distributable to other NASA sites and centers for use. \n\nOver the past seven years we have designed, implemented \nand deployed a framework called OODT (Object Oriented Data \nTechnology) that has met these rigorous demands. In this paper \nwe discuss the significant software engineering challenges we \nfaced in developing OODT. The primary objective of the paper is \nto demonstrate how OODT\u2019s explicit software architectural basis \nenabled us to effectively address these challenges. In particular, \nwe will detail the architectural decisions we found most difficult \nand/or critical to OODT\u2019s ultimate success. We highlight several \nrepresentative examples of OODT\u2019s use to date both at NASA \nand externally. We contrast our solution with related approaches, \nand argue that a major differentiator of this work, in addition to its \nexplicit architectural foundation, is its native support for \narchitecture-based development of distributed scientific \napplications. \n\n2. SOFTWARE ENGINEERING \nCHALLENGES \n\nTo develop OODT, we needed to address several significant \nsoftware engineering challenges, the bulk of which surfaced in \nlight of the complex data management and distribution issues \nregularly faced within a distributed, large-scale government \norganization such as NASA. In this paper we will focus on nine \nkey challenges: Complexity, Heterogeneity, Location \nTransparency, Autonomy, Dynamism, Scalability, Distribution, \nDecentralization, and Performance. \n\nComplexity \u2013 We envisioned OODT to be a large, multi-site, \nmulti-user, complex system. At the software level, complexity \nranged from understanding how to install, integrate, and manage \nthe software remotely deployed at participating organizations, to \nunderstanding how to manage information such as access \nprivileges and security credentials across both NASA and non-\nNASA sites. There were also complexities at the software \nnetworking layer, including varying firewall capabilities at each \ninstitution, and data repositories that would periodically go offline \nand needed to be remotely restarted. Just understanding the \nvarying types of data held at sites linked together via OODT was \na significant task. Even sites within the same science domain \n(e.g., planetary science) describe similar data sets in decidedly \ndifferent ways. Discerning in what ways these different data \nmodels were common and what attributes of data could be shared, \ndone away with, or amended, was a huge challenge. Finally, the \ndifferent interfaces to data, ranging from third-party, well-\nengineered database management systems, to in-house data \nsystems, ultimately to flat text file-based data was a particularly \ndifficult challenge that we had to hurdle. \n\nHeterogeneity \u2013 In order to drive down the data management \ncosts for science missions, the same OODT framework needed to \n\nspan multiple science domains. The domains initially targeted \nwere earth and planetary; this has subsequently been expanded to \nspace, biomedical sciences, and the modeling and simulation \ncommunities. As such, the same core set of OODT software \ncomponents, system designs, and implementation-level facilities \nhad to work across widely varying science domains. \n\nThe data management processes within the organizations that \nuse OODT also added to its heterogeneity. For instance, OODT \ncomponents needed to have interfaces with end users and support \ninteractive sessions, but also with scientific instruments, which \nmost likely were automatic and non-interactive. Scientific \ninstruments could push data to certain components in OODT, \nwhile other OODT components would need to distribute data to \nusers outside of OODT. End-users in some cases wanted to \nperform transformations on the data sent to them by OODT, and \nthen to return the data back into OODT. The framework needed to \nsupport scenarios such as these seamlessly. \n\nMany other constraints also imposed the heterogeneity \nrequirement on OODT. We can group these constraints into two \nmajor categories: \n\u2022 Organizational \u2013 As we briefly alluded above, discipline \n\nexperts who wanted to disseminate their data via OODT \nreally wanted the data to reside at their respective \ninstitutions. This constraint non-negotiable, and significantly \nimpacted the space of technical solutions that we could \ninvestigate for OODT. \n\n\u2022 Technical \u2013 Since OODT had to federate many different data \nholdings and catalogs, we faced the constraints of linking \nthem together and federating very different schemas and \nvarying levels of sophistication in the data system interfaces \n(e.g., flat files, DBMS, web pages). Even those systems \nmanaging data through \u201chigher level APIs\u201d and middleware \n(e.g., RMI, CORBA, SOAP) proved non-trivial to integrate. \nThe constraints enjoined by heterogeneity alone led us to \n\nrealize that the OODT framework would need to draw heavily \nfrom multiple areas. Database systems, although used \nsuccessfully for many years to manage large amounts of data at \nmany sites, lacked the flexibility and interface capability to \nintegrate data from other more crude APIs and storage systems \n(such as a PI-led web site). Databases also did not address the \ndistribution of data and \u201cownership\u201d issues. The advent of the \nweb, although a promising means for providing openness and \nflexible interfaces to data, would not alone address the issues such \nas multi-institutional security and access. Furthermore, its \nrequest/reply nature would not easily handle other distribution \nscenarios, e.g., subscribe/notify. Research in the area of grid \ncomputing [1] has defined \u201cout of the box\u201d services for managing \ndata systems (e.g., GridFTP), but which utilized alone would not \naddress our other challenges (e.g., complexity). \n\nLocation Transparency \u2013 Even though data could potentially \nbe input into and output from the system from many \ngeographically disparate and distributed sites, it should appear to \nthe end-users as if the data flow occurred from a single location. \nThis requirement was reinforced by the need to dynamically add \ndata producers and consumers to a system supported by OODT, \nas will be further discussed below. \n\nAutonomy \u2013 When designing the OODT framework, we could \nnot dictate how data providers should store, process, find, evolve, \nor retire their data. Instead, the framework needed to be \n\n\n\ntransparent, allowing data providers to continue with their regular \nbusiness processes, while managing and disseminating their \ninformation unobtrusively. \n\nDynamism \u2013 It is expected that data providers for the most part \nwill be stable organizations. However, there are cases in which \nnew data producing (occasionally) and consuming (frequently) \nnodes will need to be brought on-line. Back-end data sources need \nto be pluggable, with little or no direct impact on the end-user of \nthe OODT system, or on the organization that owns the data \nsource. New end-users (or client hosts) should also be able to \n\u201ccome and go\u201d without any disruption to the rest of the system. In \nthe end, we realized this meant the whole infrastructure must be \ncapable of some level of dynamism in order to meet these \nconstraints. \n\nScalability \u2013 OODT needed to manage large volumes of data, \nfrom at least hundreds of gigabytes at its inception to the current \nmissions which will produce hundreds of terabytes. The \nframework needed to support at least dozens of institutional data \nproviders (which themselves may have subordinate data system \nproviders), dozens of user types (e.g., scientists, teachers, \nstudents, policy makers), thousands of users, hundreds of \ngeographic sites, and thousands of different data types to manage \nand disseminate. \n\nDistribution \u2013 The framework should be able to handle the \nphysical distribution of data across sites nationally and \ninternationally, and ultimately the physical distribution of the \nsystem interfaces which provide the data. \n\nDecentralization \u2013 Each site may have its own data \nmanagement processes, interfaces and data types, which were \noperating independently for some time. We needed to devise a \nway of coordinating and managing data between these data sites \nand providers without centralizing control of their systems, or \ninformation. In other words, the requirement was that the different \nsites retain their full autonomy, and that OODT adapts instead. \n\nPerformance \u2013 Despite its scale and interaction with many \norganizations, data systems, and providers, OODT still needed to \nperform under stringent demands. Queries for information needed \nto be serviced quickly: in many cases response time under five \nseconds was used as a baseline. Additionally, OODT needed to be \noperational whenever any of the participating scientists wanted to \nlocate, access, or process their data. \n\n3. BACKGROUND AND RELATED WORK \nSeveral large-scale software technologies that distribute, \n\nmanage, and process information have been constructed over the \npast decade. Each of these technologies falls into one or more of \nfour distinct areas: grid-computing, information integration, \ndatabases, and middleware. In this section, we briefly survey \nrelated projects in each of these areas and compare their foci and \naccomplishments to those of OODT. Additionally, since a major \nfocal point of OODT is software architecture, we start out by \nproviding some brief software architecture background and \nterminology to set the context. \n\nTraditionally, software architecture has referred to the \nabstraction of a software system into its fundamental building \nblocks: software components, their methods of interaction (or \nsoftware connectors), and the governing rules that guide the \n\ncomposition of software components and software connectors \n(configurations) [2, 3]. Software architecture has been recognized \nin many ways to be the linchpin of the software development \nprocess. Ideally, the software requirements are reflected within \nthe software system\u2019s components and interactions; the \ncomponents and interactions are captured within the system\u2019s \narchitecture; and the architecture is used to guide the design, \nimplementation, and evolution of the system. Design guidelines \nthat have been proven effective are often codified into \narchitectural styles, while specific architectural solutions (e.g., \nconcrete system structures, component types and interfaces, and \ninteraction facilities) within specific domains are captured as \nreusable reference architectures. \n\nGrid computing deals with highly complex and distributed \ncomputational problems and large volume data management \ntasks. Massive parallel computation, distributed workflow, and \npetabyte scale data distribution are only a small cross-section of \nthe grid\u2019s capabilities. Grid projects are usually broken down into \ntwo areas. Computational grid systems are concerned with \nsolving complex scientific problems involving supercomputing \nscale resources dispersed across various organizational \nboundaries. The representative computational grid system is the \nGlobus Toolkit [4]. Globus is built on top of a web-services [5] \nsubstrate and provides resource management components, \ndistributed workflow and security infrastructure. Other \ncomputational grid systems provide similar capabilities. For \nexample, Alchemi [6] is a .NET-based grid technology that \nsupports distributed job scheduling and an object-oriented grid \ndevelopment environment. JCGrid [7] is a light weight, Java-\nbased open source computational grid project whose goal is to \nsupport distributed job scheduling and the splitting of CPU-\nintensive tasks across multiple machines. \n\nThe other class of grid systems, Data grids, is involved in the \nmanagement, processing, and distribution of large data volumes to \ndisbursed and heterogeneous users, user types, and geographic \nlocations. There are several major data grid projects. The LHC \nComputing Grid [8] is a system whose main goal is to provide a \ndata management and processing infrastructure for the high \nenergy physics community. The Earth System Grid [9] is geared \ntowards supporting climate modeling research and distribution of \nclimate data sets and metadata to the climate and weather \nscientific community. \n\nTwo independently conducted studies [10, 11] have \nidentified three key areas that the current grid implementations \nmust address more effectively in order to promote data and \nsoftware interoperability: (1) formality in grid requirements \nspecification, (2) rigorous architectural description, and (3) \ninteroperability between grid solutions. As we will discuss in this \npaper, our work to date on OODT has the potential to be a \nstepping stone in each of these areas: its explicit focus on \narchitectures for data-intensive, \u201cgrid-like\u201d systems naturally \naddresses the three concerns. \n\nThere have been several well-known efforts within the AI \nand database communities that have delved into the topic of \ninformation integration, or the shared access, search, and retrieval \nof distributed, heterogeneous information resources. Within the \npast decade, there has been significant interest in building \ninformation mediators that can integrate information from \nmultiple data sources. Mediators federate information by querying \nmultiple data sources, and fusing back the gathered results. The \nrepresentative systems using this approach include TSIMMS [12], \n\n\n\nInformation Manifold [13], The Internet Softbot [14], InfoSleuth \n[15], Infomaster [16], DISCO [17], SIMS [18] and Ariadne [19]. \nEach of these approaches focuses on fundamental algorithmic \ncomponents of information integration: (1) formulating \nexpressive, efficient query languages (such as Theseus [20]) that \nquery many heterogeneous data stores; (2) accurately and reliably \ndescribing both global, and source data models (e.g. the Global-\nas-view [12] and Local-as-view [21] approaches); (3) providing a \nmeans for global-to-source data model integration; and (4) \nimproving queries and deciding which data sources to query (e.g. \nquery reformulation [22] and query rewriting [22, 23]). \n\nHowever, these algorithmic techniques fail to address the \nsoftware engineering side of information integration. For instance, \nexisting literature fails to answer questions such as, which of the \ncomponents in the different systems\u2019 architectures are common; \nhow can they be reused; which portions of their implementations \nare tied to (which) software components; which software \nconnectors are the components using to interact; are the \ninteraction mechanisms replaceable (e.g., can a client-server \ninteraction in Ariadne become a peer-to-peer interaction); and so \non. Additionally, none of the above related mediator systems have \nformalized a process for designing, implementing, deploying, and \nmaintaining the software components belonging to each system. \n\nSeveral middleware technologies such as CORBA, \nEnterprise Java Beans [24], Java RMI [25], and more recently \nSOAP and Web services [5] have been suggested as \u201csilver \nbullets\u201d that address the problem of integrating and utilizing \nheterogeneous software computing and data resources. Each of \nthese technologies provides three basic services: (1) an \n\nimplementation and composition framework for software \ncomponents, possibly written in different languages but \nconforming to a specific middleware interface; (2) a naming \nregistry used to locate components; and (3) a set of basic services \nsuch as (un-)marshalling of data, concurrency, distribution and \nsecurity. \n\nAlthough middleware is very useful \u201cglue\u201d that can connect \nsoftware components written in different languages or deployed \nin heterogeneous environments, middleware technologies do not \nprovide any \u201cout of the box\u201d services that deal with computing \nand data resource management across organizational boundaries \nand across computing environments at a national scale. These \nkinds of services usually have to be engineered into the \nmiddleware itself. We should note that in grid computing such \nservices are explicitly called out and provided at a higher layer of \nabstraction. In fact, the combination of these higher-level grid \nservices and an underlying middleware platform is typically \nreferred to as a \u201cgrid technology\u201d [11]. \n\n4. OODT ARCHITECTURE \nOODT\u2019s architecture is a reference architecture that is \n\nintended to be instantiated and tailored for use across science \ndomains and projects. The reference architecture comprises \nseveral components and connectors. A particular instance of this \nreference architecture, that of NASA\u2019s planetary data system \n(PDS) project, is shown in Figure 1. OODT is installed on a given \nhost inside a \u201csandbox\u201d, and is aware of and interacts only with \nthe designated external data sources outside its sandbox. OODT\u2019s \n\nm\nessaging layer (H\n\nTTP)\n\n\u2026\n.. \u2026..\n\n \nFigure 1. The Planetary Data System (PDS) OODT Architecture Instantiation \n\n\n\ncomponents are responsible for delivering data from \nheterogeneous data stores, identifying and locating data within the \nsystem, and ingesting and processing data into underlying data \nstores. The connectors are responsible for integrating OODT with \nheterogeneous data sources; providing reliable messaging to the \nsoftware components; marshalling resource descriptions and \ntransferring data between components; transactional \ncommunication between components; and security related issues \nsuch as identification, authorization, and authentication. In this \nsection, we describe the guiding principles behind the reference \narchitecture. We then describe each of the OODT reference \ncomponents and connectors in detail. In Section 5, we describe \nspecific instantiations of the reference architecture in the context \nof several projects that are using OODT. \n\n4.1 Guiding Principles \nThe software engineering challenges discussed in Section 2 \n\nmotivated and framed the development of OODT. Conquering \nthese challenges led us to a set of four guiding principles behind \nthe OODT reference architecture. \n\nThe first guiding principle is division of labor. Each \ncapability provided by OODT (e.g., processing, ingestion, search, \nand retrieval of data, access to heterogeneous data, and so on) is \ncarefully divided among separate, independent architectural \ncomponents and connectors. As will be further detailed below, the \nprinciple is upheld through OODT\u2019s rigorous separation of \nconcerns, and modularity enforced by explicit interfaces. This \nprinciple addresses the complexity, heterogeneity, dynamism, and \ndecentralization challenges. \n\nClosely related to the preceding principle is technology \nindependence. This principle involves keeping up-to-date with the \nevolution of software technology (both in-house and third-party), \nwhile avoiding tying the OODT architecture to any specific \nimplementation. By allowing us to select the technology most \nappropriate to a given task or specific need, this principle helps us \nto address the challenges of complexity, scalability, security, \ndistribution, location transparency, performance, and dynamism. \nFor instance, OODT\u2019s initial reference implementation used \nCORBA as the substrate for its messaging layer connector. When \nthe CORBA vendor decided to begin charging JPL significant \nlicense fees (thus violating NASA\u2019s objective of producing a \nsolution that would be free to its users), the principle of \ntechnology independence came into play. Because the OODT \nmessaging layer connector supports a wrapper interface around \nthe lower-level distribution technology, we were able to replace \nour initial CORBA-based connector with one using Java\u2019s open \nsource RMI middleware, and redeploy the new connector to the \nOODT user sites, within three person days. \n\nAnother guiding principle of OODT is the distinguishing of \nmetadata as a first-class citizen in the reference architecture, and \nseparating metadata from data. The job of metadata (i.e., \u201cdata \nabout data\u201d) is to describe the data universe in which the system \nis operating. Since OODT is meant to be a technology that \nintegrates diverse data sources, this data universe is highly \nheterogeneous and possibly dynamic. Metadata in OODT is \nmeant to catalog information, allowing a user to locate and \ndescribe the actual data in which she is interested. On the other \nhand, the job of data in OODT is to describe physical or scientific \nphenomena; it is the ultimate end user product that an OODT \nsystem should deliver. This principle helps to address the \n\nchallenges of heterogeneity, autonomy of data providers, and \ndecentralization. \n\nSeparating the data model from the software is another key \nprinciple behind the reference architecture. Akin to ontology/data-\ndriven systems, OODT components should not be tied to the data \nand metadata that they manipulate. Instead, the components \nshould be flexible enough to understand many (meta-)data models \nused across different scientific domains, without reengineering or \ntailoring of the component implementations. This principle helps \nto address the challenges of complexity and heterogeneity. \n\nThese four guiding principles are reified in a reference \narchitecture comprising four pairs of component types and two \nclasses of connectors organized in a canonical structure. One \ninstantiation of the reference architecture reflecting the canonical \nstructure is depicted in Figure 1. Each OODT architectural \nelement (component and connector) serves a specific purpose, \nwith its functionality exported through a well-defined interface. \nThis supports OODT\u2019s constant evolution, allowing us to add, \nremove, and substitute, if necessary dynamically (i.e., at runtime), \nelements of a given type. It also allows us to introduce flexibility \nin the individual instances of the reference architecture while, at \nthe same time, controlling the legal system configurations. \nFinally, the explicit connectors and well-defined component \ninterfaces allow OODT in principle to integrate with a wide \nvariety of third-party systems (e.g., [26]). The outcome of the \nguiding principles (described above) and design decisions \n(detailed below) is an architecture that is \u201ceasy to build, hard to \nbreak\u201d. \n\n4.2 OODT Components \n4.2.1 Product Server and Product Client \n\nThe Product Server is used to retrieve data from \nheterogeneous data stores. The product server accepts a query \nstructure that identifies a set of zero or more products which \nshould be returned the issuer of the query. A product is a unit of \ndata in OODT and represents anything that a user of the system is \ninterested in retrieving: a JPEG image of Mars, an MS Word \ndocument, a zip file containing text file results of a cancer study, \nand so on. Product servers can be located at remote data sites, \ngeographically and/or institutionally disparate from other OODT \ncomponents. Alternatively, product servers can be centralized, \nlocated at a single site. The objective of the product server is to \ndeliver data from otherwise heterogeneous data stores and \nsystems. As long as a data store (or system) provides some kind \nof access interface to get its data, a product server can \u201cwrap\u201d \nthose interfaces with the help of Handler connectors described in \nSection 4.3 below. \n\nThe Product Client component communicates with a product \nserver via the Messaging Layer connectors described in Section \n4.3. A product client resides at the end-user\u2019s (e.g., scientist\u2019s) \nsite. It must know the location of at least one product server, and \nthe query structure that identifies the set of products that the user \nwants to retrieve. At the same time, it is completely insulated \nfrom any changes in the physical location or actual representation \nof the data; its only interface is to the product server(s). Many \nproduct clients may communicate with the same product server, \nand many product servers can return data to the same product \nclient. This adds flexibility to the architecture without introducing \nunwanted long-term dependencies: a product client can be added, \n\n\n\nremoved, or replaced with another one that depends on different \nproduct servers, without any effect on the rest of the architecture. \n\n4.2.2 Profile Server and Profile Client \nThe Profile Server manages resource description \n\ninformation, i.e., metadata, in a system built with OODT. \nResource description information is divided into three main \ncategories: \n\u2022 Housekeeping Information \u2013 Metadata such as ID, Last \n\nModified Date, Last Revised By. This information is kept \nabout the resource descriptions themselves and is used by the \nprofile server to inventory and catalog resource descriptions. \nThis is a fixed set of metadata. \n\n\u2022 Resource Information \u2013 This includes metadata such as Title, \nAuthor, Creator, Publisher, Resource Type, and Resource \nLocation. This information is kept for all the data in the \nsystem, and is an extended version of the Dublin Core \nMetadata for describing electronic resources [27]. This is \nalso a fixed set of metadata. \n\n\u2022 Domain-Specific Information \u2013 This includes metadata \nspecific to a particular data domain. For instance, in a cancer \nresearch system this may include metadata such as Blood \nSpecimen Type, Site ID, and Protocol/Study Description. \nThis set of metadata is flexible and is expected to change. \n\nAs with product servers, profile servers can be decentralized at \nmultiple sites or centralized at a single site. The objective of the \nprofile server is to deliver metadata that gives a user enough \ninformation to locate the actual data within OODT regardless of \nthe underlying system\u2019s exact configuration, and degrees of \ncomplexity and heterogeneity; the user then retrieves the data via \none or more product servers. Because profile servers do not serve \nthe actual data, they need not have a direct interface to the data \nthat they describe. In addition to the complete separation of duties \nbetween profile and product servers, this ensures their location \nindependence, allows their separate evolution, and minimizes the \neffects of component and/or network failures in an OODT system. \n\nProfile Client components communicate with profile servers \nover the messaging layer connectors. The client must know the \nlocation of the profile server, and must provide a query that \nidentifies the metadata that a user is interested in retrieving. There \ncan be many profile clients speaking with a single profile server, \nand many profile servers speaking with a single profile client. \nThe architectural effects are analogous to those in the case of \nproduct clients and servers. \n\n4.2.3 Query Server and Query Client \nThe Query Server component provides an integrated search \n\nand retrieval capability for the OODT reference architecture. \nQuery servers interact with profile and product servers to retrieve \nmetadata and data requested by system users. A query server is \nseeded with an initial set of references to profile servers. Upon \nreceiving a query from a user, the query server passes it along to \neach profile server from its list, and collects the metadata \nreturned. Part of this metadata is a resource location (recall \nSection 4.2.2) in the form of a URI [28]. A URI can be a link to a \nproduct server, to a web site with the actual data, or to some \nexternal data providing system. This directly supports \nheterogeneity, location transparency, and autonomy of data \nproviders in OODT. \n\nAnother novel aspect of OODT\u2019s architecture is that if a \nprofile server is unable to service the query, or if it believes that \n\nother profile servers it is aware of may contain relevant metadata, \nit will return the URIs of those profile servers; the query server \nmay then forward the query to them. As a result, query servers are \ncompletely decoupled from product servers (and from any \n\u201cexposed\u201d external data sources), and are also decoupled from \nmost of the profile servers. In turn, this lessens the complexity of \nimplementing, integrating, and evolving query servers. Once the \nresource metadata is returned, the query server will either allow \nthe user herself to use the supplied URIs to find the data in which \nshe was interested (interactive mode), or it will retrieve, package, \nand deliver the data to the user (non-interactive mode). As with \nthe product and profile servers, query servers can be centrally \nlocated at a single site, or they can be decentralized across \nmultiple sites. \n\nQuery Client components communicate with the query \nservers. The query client must provide a query server with a query \nthat identifies the data in which the user is interested, and it must \nset a mode for the query server (interactive or non-interactive \nmode). The query client may know the location of the query \nserver that it wants to contact, or it may rely on the messaging \nlayer connector to route its queries to one or more query servers. \n\n4.2.4 Catalog and Archive Server and Client \nThe Catalog and Archive Server (CAS) component in OODT \n\nis responsible for providing a common mechanism for ingestion \nof data into a data store, including any processing required as a \nresult of ingestion. For instance, prior to the ingestion of a poor-\nresolution image of Mars, the image may need to be refined and \nthe resolution improved. CAS would handle this type of \nprocessing. Any data ingested into CAS must include associated \nmetadata information so that the data can be cataloged for search \nand retrieval purposes. Upon ingestion, the data is sent to a data \nstore for preservation, and the corresponding metadata is sent to \nthe associated catalog. The data store and catalog need not be \nlocated on the same host; they may be located on remote sites \nprovided there is an access mechanism to store and retrieve data \nfrom each. The goal of CAS is to streamline and standardize the \nprocess of adding data to an OODT-aware system. Note that a \nsystem whose data stores were populated prior to its integration \ninto OODT can still use CAS for its new data. Since the CAS \ncomponent populates data stores and catalogs with both data and \nmetadata, specialized product and profile server components have \nbeen developed to serve data and metadata from the CAS backend \ndata stores and catalogs more efficiently. Any older data can still \nbe served with existing product and profile servers. \n\nThe Archive Client component communicates with CAS. The \narchive client must know the location of the CAS component, and \nmust provide it with data to ingest. Many archive clients can \ncommunicate with a single CAS component, and vice versa. Both \nthe archive client and CAS components are completely \nindependent of the preceding three pairs of component types in \nthe OODT reference architecture. \n\n4.3 OODT Connectors \n4.3.1 Handler Connectors \n\nHandler connectors are responsible for enabling the \ninteraction between OODT\u2019s components and third-party data \nstores. A handler connector performs the transformation between \nan underlying (meta-)data store\u2019s internal API for retrieving data \nand its (meta-)data format on the one hand, and the OODT system \n\n\n\non the other. Each handler connector is typically developed for a \nclass of data stores and metadata systems. For example, for a \ngiven DBMS such as Oracle, and a given internal representation \nschema for metadata, a generic Oracle handler connector is \ntypically developed and then reused. Similarly, for a given \nfilesystem scheme for storing data, a generic filesystem handler \nconnector is developed and reused across like filesystem data \nstores. \n\nEach profile server and product server relies on one or more \nhandler connectors. Profile servers use profile handlers, and \nproduct servers use query handlers. Handler connectors thereby \ncompletely insulate product and profile servers from the third-\nparty data stores. Handlers also allow for different types of \ntransformations on (meta-)data to be introduced dynamically \nwithout any effect on the rest of OODT components. For \nexample, a product server that distributes Mars image data might \nbe serviced by a query handler connector that returns high-\nresolution (e.g., 10 GB) JPEG image files of the latest summit \nclimbed by a Mars rover; if the system ends up experiencing \nperformance problems, another handler may be (temporarily) \nadded to return lower-resolution (e.g., 1 MB) JPEG image files of \nthe same scenario. Likewise, a profile server may have two \nprofile handler connectors, one that returns image-quality \nmetadata (e.g., resolution and bits/pixel) and another that returns \ninstrument metadata about Mars rover images (e.g., instrument \nname or image creation date). \n\n4.3.2 Messaging Layer Connector \nThe Messaging Layer connector is responsible for \n\nmarshalling data and metadata between components in an OODT \nsystem. The messaging layer must keep track of the locations of \nthe components, what types of components reside in which \nlocations, and if components are still running or not. Additionally, \nthe messaging layer is responsible for taking care of any needed \nsecurity mechanisms such as authentication against an LDAP \ndirectory service, or authorization of a user to perform certain \nrole-based actions. \n\nThe messaging layer in OODT provides synchronous \ninteraction among the components, and some delivery guarantees \non messages transferred between the software components. \nTypically in any large-scale data system, the asynchronous mode \nof interaction is not encouraged because partial data transfers are \nof no use to users such as scientists who need to make analysis on \nentire data sets. \n\nThe messaging layer supports communication between any \nnumber of connected OODT software components. In addition, \nthe messaging layer natively supports connections to other \nmessaging layer connectors as well. This provides us with the \nability to extend and adapt an OODT system\u2019s architecture, as \nwell as easily tailor the architecture for any specific interaction \nneeds (e.g., by adding data encryption and/or compression \ncapabilities to the connector). \n\n5. EXPERIENCE AND CASE STUDIES \nThe OODT framework has been used both within and \n\noutside NASA. JPL, NASA\u2019s Ames Research Center, the \nNational Institutes of Health (NIH), the National Cancer Institute \n(NCI), several research universities, and U.S. Federally Funded \nResearch and Development Centers (FFRDCs) are all using \nOODT in some form or fashion. OODT is also available for \ndownload through a large open-source software distributor [29]. \n\nOODT components are found in planetary science, earth science, \nbiomedical, and clinical research projects. In this section, we \ndiscuss our experience with OODT in several representative \nprojects within these scientific areas. We compare and contrast \nhow the projects were handled before and after OODT. We sketch \nsome of the domain-specific technical challenges we encountered \nand identify how OODT helped to solve them. \n\nTo begin using OODT, a user designs a deployment \narchitecture from one or more of the reference OODT \ncomponents (e.g., product and profile servers), and the reference \nOODT connectors. The user must determine if any existing \nhandler connectors can be reused, or if specialized handler \nconnectors need to be developed. Once all the components are \nready, the user has two options for deploying her architecture to \nthe target hosts: (1) the user may translate her design into a \nspecialized OODT deployment descriptor XML file, which can \nthen be used to start each program on the target host(s); or (2) the \nuser can deploy her OODT architecture using a remote server \ncontrol component, adding components, and connectors via a \ngraphical user interface. The GUI allows the user to send \ncomponent and connector code to the target hosts, to start, shut-\ndown, and restart the components and connectors, and to monitor \ntheir health during execution. \n\n5.1 Planetary Data System \nOne of the flagship deployments of OODT has been for \n\nNASA\u2019s Planetary Data System (PDS) [30]. PDS consists of \nseven \u201cdiscipline nodes\u201d and an engineering and management \nnode. Each node resides at a different U.S. university or \ngovernment agency, and is managed autonomously. \n\nFor many years PDS distributed its data and metadata on \nphysical media, primarily CD-ROM. Each CD-ROM was \nformatted a according to a \u201chome-grown\u201d directory layout \nstructure called an archive volume, which later was turned into a \nPDS standard. PDS metadata was constructed using a common, \nwell-structured set of 1200 metadata elements, such as Target \nName and Instrument Type, that were identified from the onset of \nthe PDS project by planetary scientists. Beginning in the late \n1990s the advent of the WWW and the increasing data volumes of \nmissions led NASA managers to impose a new paradigm for \ndistributing data to the users of the PDS: data and metadata were \nnow to be distributed electronically, via a single, unified web \nportal. The web portal and accompanying infrastructure to \ndistribute PDS data and metadata was built in 2001 using OODT \nin the manner depicted in Figure 1. \n\nWe faced several technical challenges deploying OODT to \nPDS. PDS data and metadata were highly distributed, spanning all \nseven of the scientific discipline nodes across the country. \nAlthough the entire data volume across PDS at the time was \naround 7 terabytes, it was estimated that the volume would grow \nto 10 terabytes by 2004. Consequently, the system needed to be \nscalable and respond to large growth spurts caused by new data \nproducing missions. The flexibility and modularity of the OODT \nproduct and profile server components were particularly useful in \nthis regard. Using a product and/or profile server, each new data \nproducing system in the PDS could be dynamically \u201cplugged in\u201d \nto the existing PDS infrastructure that we constructed, without \ndisturbing existing components and processes. \n\nWe also faced the problem of heterogeneity. Almost every \nnode within PDS had a different operating system, ranging from \nLinux, to Windows, to Solaris, to Mac OS X. Each node \n\n\n\nEDRN \nQuery \nServer\n\nm\nessaging layer (R\n\nM\nI)\n\nProduct \nServer\n\nDBMS \n(Specimen \nMetadata)\n\nmoffitt.usf.edu (win2k server)\n\nMS SQL DBMS \n(Specimen \nProducts)\n\nSpecimen \nQuery \n\nHandler\n\nSpecimen Profile \nHandler (MS SQL)\n\nOODT \u201cSandbox\u201d\n\nOODT \u201cSandbox\u201d\n\nProduct \nServer\n\nProfile \nServer\n\nanother.erne.server (AnotherOS)\n\nCAS Profile \nHandler\n\nCAS Query \nHandler\n\nOODT \u201cSandbox\u201d\nCatalog and \n\nArchive Server\n\nLung Images \n(Filesystem)\n\nOther \nApplications\n\nginger.fhcrc.org (win2k)\n\nOther Applications\n\nERNE Web \nPortal\n\n(Query Client)\n\nuser host\n\nProfile \nClient\n\nProduct \nClient\n\nProfile ServerOther \nApplications\n\nOther \nApplications\n\nOther Applications\n\nOther Applications\n\nSpecimen Inventory\n(MS SQL)\n\nOther Applications\n\nOther Applications\n\npds.jpl.nasa.gov (Linux)\nLegend:\n\nOODT \nComponent\n\nData/metadata \nstore\n\nOODT Connector Hardware \nhost\n\nOODT \ncontrolled \nportion of \nmachine\n\ndata/control flow\nBlack Box\n\n \n \n\nFigure 2. The Early Detection Research Network (EDRN) OODT Architecture Instantiation \n\nmaintained its own local catalog system. Although each node in \nPDS had different file system implementations dictated by their \nOS, each node stored their data and metadata according to the \narchive volume structure. Because of this, we were able to write a \nsingle, reusable PDS Query Handler which could serve back \nproducts from a PDS archive volume structure located on a file \nsystem. Plugging into each node\u2019s catalog system proved to be a \nsignificant challenge. For nearly all of the nodes, specialized \nprofile handler connectors were constructed to interface with the \nunderlying catalog systems, which ranged from static text files \ncalled PDS label files to dynamic web site inventory systems \nconstructed using Java Server Pages. Because each of the catalogs \ntagged PDS data using the common set of 1200 elements, we \nwere able to share much of the code base among the profile \nhandler connectors, ultimately only changing the portion of the \ncode that made the particular JSP page call, or read the selected \nset of metadata from the label file. The entire code base of the \nPDS including all the domain specific handler connectors is only \nslightly over 15 KSLOC, illustrating the high degree of \nreusability provided by the OODT framework. \n\n5.2 Early Detection Research Network \nOODT is also supporting the National Cancer Institute\u2019s \n\n(NCI) Early Detection Research Network (EDRN). EDRN is a \ndistributed research program that unites researchers from over \nthirty institutions across the United States. Tens of thousands of \nscientists participate in the EDRN. Each institution is focused on \nthe discovery of cancer biomarkers as indicators for disease [31]. \n\nA critical need for the EDRN is an electronic infrastructure to \nsupport discovery and validation of these markers. \n\nIn 2001 we worked with the EDRN program to develop the \nfirst component of their electronic biomarker infrastructure called \nthe EDRN Resource Network Exchange (ERNE). The (partial) \ncorresponding architecture is depicted in Figure 2. One of the \nmajor goals of ERNE was to provide real-time access to bio-\nspecimen information across the institutions of the EDRN. Bio-\nspecimen information typically consisted of gigabytes of \nspecimen images, and location and contact metadata for obtaining \nthe specimen from its origin study institution. The previous \nmethod of obtaining bio-specimen information was very human-\nintensive: it involved phone calls and some forms of electronic \ncommunication such as email. Specimen information was not \nsearchable across institutions participating in the EDRN. The bio-\nspecimen catalogs were largely out-of-date, and out-of-synch with \ncurrent holdings at each participating institution. \n\nOne of the initial technical challenges we faced with EDRN \nwas scale. The EDRN was over three times as large as the PDS. \nBecause of this we chose to target ten institutions initially, rather \nthan the entire set of thirty one. Again, OODT\u2019s modularity and \nscalability came into play as we could phase deployment at each \ndeployment institution. As we instantiated new product, profile, \nquery, and archive servers at each institution, we could do so \nwithout interrupting any existing OODT infrastructure already \ndeployed. \n\nAnother challenge that we encountered was dealing with \neach participating site\u2019s Institutional Review Board (IRB). An \nIRB is required to review and ensure compliance of projects with \n\n\n\nfederal laws related to working with data from research projects \ninvolving human subjects. To satisfy the IRB, any OODT \ncomponents deployed at an EDRN site had to provide an adequate \nsecurity capability in order to get approval to share the data \nexternally from an institution. OODT\u2019s separation of data and \nmetadata explicitly allowed us to satisfy this requirement. We \ndesigned ERNE so that each institution could remain in control of \ntheir specimen holding data by instantiating product server \ncomponents at each site, rather than distributing the information \nacross ERNE which would have violated the IRB agreements. \n\nAnother significant challenge we faced in developing ERNE \nwas lack of a consistent metadata model for each ERNE site. We \nwere forced to develop a common specimen metadata model and \nthen to create specific mappings to link each local site to the \ncommon model. OODT aided us once again in this endeavor as \nthe common mappings we developed were easily codified into a \nquery handler connector, and reused across each ERNE site. \n\nThe entire code base of ERNE, including all its specialized \nhandler connectors is only slightly over 5.3 KSLOC, highlighting \nthe high degree of reusability of the shared framework code base \nand the handler code base. \n\n \n\n5.3 Science Processing Systems \nOODT has also been deployed in several science processing \n\nsystem missions both, operational and under development. Due to \nspace limitations, we can only briefly summarize each of the \nOODT deployments in these systems. \n\nSeaWinds, a NASA-funded earth science instrument flying \non the Japanese ADEOS-II spacecraft, used the OODT CAS \ncomponent as a workflow and processing component for its \nProcessing and Analysis Center (SeaPAC). SeaWinds produced \nseveral gigabytes of data during its six year mission. CAS was \nused to control the execution and data flow of mission-specific \ndata processor components, which calibrated and created derived \ndata products from raw instrument data, and archived those \nproducts for distribution into the data store managed by CAS. A \nmajor challenge we faced during the development of SeaPAC was \nthat the processor components were developed by a group \noutside of the SeaWinds project. We had to provide a mechanism \nfor integrating their source code into the OODT SeaPAC \nframework. OODT\u2019s separation of concerns allowed us to address \nthis issue with relative ease: once the data processors were \nfinished, we were able wrap and tailor them internally within \nCAS, without disturbing the existing SeaPaC infrastructure. \n\nThe success of the CAS within SeaWinds led to its reuse on \nseveral different missions. Another earth science mission called \nQuikSCAT retrofitted and replaced some of their existing \nprocessing components with CAS, using the SeaWinds experience \nas an example. The Orbiting Carbon Observatory (OCO) mission \nthat will fly in 2009, and that is currently under development, is \nalso utilizing CAS to ingest and process existing FTS CO2 \nspectrometer data from earth-based instruments. The James Web \nTelescope (JWT) is using the CAS for to implement its workflow \nand processing capabilities for astrophysics data and metadata. \nEach of these science processing systems will face similar \ntechnical challenges, including separation of concerns between \nthe actual processing framework and the developers writing the \nprocessor code, the volume of data that must be handled by the \nprocessing system (OCO is projected to produce over 150 \nterabytes), and the flexibility and tailorability of the workflow \n\nneeded to process the data. We believe that OODT is uniquely \npositioned to address these difficult challenges. \n\n5.4 Computer Modeling Simulation and \nVisualization \n\nOODT has also been deployed to aid the Computer \nModeling Simulation and Visualization (CMSV) community at \nJPL, by linking together several institutional model repositories \nacross the organizations within the lab, and creating a web portal \ninterface to query the integrated model repositories. We \ndeveloped specialized profile server components that locate and \nlink to different model resources across JPL, such as power \nsubsystem models of the Mars Exploration Rovers (MER), CAD-\ndrawing models of different spacecraft assembly parts, and \nsystems architecture models for engineering and design of \nspacecraft. Each of these different model types lived in separate \nindependent repositories across JPL. For instance, the CAD \nmodels were stored in a commercial product called TeamCenter \nEnterprise [32], while the power and systems architecture models \nwere stored in a commercial product called Xerox Docushare \n[33]. \n\nTo integrate these model repositories for CMSV, we had to \nderive a common set of metadata across the wide spectrum of \ndifferent model types that existed at JPL. OODT\u2019s separation of \ndata from metadata allowed us to rapidly instantiate our common \nmetadata model once we developed it, by constructing specialized \nprofile handler connectors that mapped each repository\u2019s local \nmodel to the common model. Reusability levels were high across \nthe connectors, resulting in an extremely small code base of 2.57 \nKSLOC. \n\nAnother challenge in light of this mapping activity was \ninterfacing with the APIs of the underlying model repositories. In \nthe above two cases, the APIs were commercial products, and \npoorly documented. In some cases, such as the Docushare \nrepository, the APIs did not fully conform to their stated \nspecifications. The division of labor amongst OODT components \ncame into play on this task. It allowed us to focus on deploying \nthe rest of the OODT supporting infrastructure, such as the web \nportal, and the profile handler connectors, and not getting stalled \nwaiting for the support teams from each of the commercial \nvendors to debug our API problems. Once the OODT CMSV \ninfrastructure was deployed, the modeling and simulation \ncommunity at JPL immediately began adopting it and sharing \ntheir models across the lab. During the past year, the system has \nreceived around 40,000 hits on the web portal, and over 9,000 \nqueries for models. \n\n6. CONCLUSIONS \nWhen the need arose at NASA seven years ago for a data \n\ndistribution and management solution that satisfied the formidable \nrequirements outlined in this paper, it was not clear to us initially \nhow to approach the problem. On the surface, several applicable \nsolutions already existed (middleware, information integration \nsystems, and the emerging grid technologies). Adopting one of \nthem seemed to be a preferable path because it would have saved \nus precious time. However, upon closer inspection we realized \nthat each of these options could be instructive, but that none of \nthem solved the problem we were facing (and that even some of \nthese technologies themselves were facing). \n\nThe observation that directly inspired OODT was that we \nwere dealing with software engineering challenges, and that those \n\n\n\nchallenges naturally required a software engineering solution. \nOODT is a large, complex, dynamic system, distributed across \nmany sites, servicing many different users, and classes of users, \nwith large amounts of heterogeneous data, possibly spanning \nmultiple domains. Software engineering research and practice \nboth suggest that success in developing such a system will be \ndetermined to a large extent by the system\u2019s software \narchitecture. It therefore became imperative that we rely on our \nexperience within the domain of data-intensive systems (e.g., \nJPL\u2019s PDS project), as well as our study of related research and \npractice, in order to develop an architecture for OODT that will \naddress the challenges we discussed in Section 2. Once the \narchitecture was designed and evaluated, OODT\u2019s initial \nimplementation and its subsequent adaptations followed naturally. \n\nAs OODT\u2019s developers we are heartened, but as software \nengineering researchers and practitioners disappointed, that \nOODT still appears to be the only system of its kind. The \nintersection of middleware, information management, and grid \ncomputing is rapidly growing, yet it is still characterized by one-\noff solutions targeted at very specific problems in specific \ndomains. Unfortunately, these solutions are sometimes clever by \naccident and more frequently little more than \u201chacks\u201d. We \nbelieve that OODT\u2019s approach is more appropriate, more \neffective, more broadly applicable, and certainly more helpful to \ndevelopers of future systems in this area. We consider OODT\u2019s \ndemonstrated ability to evolve and its applicability in a growing \nnumber of science domains to be a testament to its explicit, \ncarefully crafted software architecture. \n\n7. ACKNOWLEDGEMENTS \nThis material is based upon work supported by the Jet \n\nPropulsion Laboratory, managed by the California Institute of \nTechnology. Effort also supported by the National Science \nFoundation under Grant Numbers CCR-9985441 and ITR-\n0312780. \n\n8. REFERENCES \n[1] A. Chervenak, I. Foster, et al., \"The Data Grid: Towards an \n\nArchitecture for the Distributed Management and Analysis of \nLarge Scientific Data Sets,\" J. of Network and Computer \nApplications, vol. 23, pp. 187-200, 2000. \n\n[2] N. Medvidovic and R. N. Taylor, \"A Classification and \nComparison Framework for Software Architecture Description \nLanguages,\" IEEE TSE, vol. 26, pp. 70-93, 2000. \n\n[3] D. E. Perry and A. L. Wolf, \"Foundations for the Study of \nSoftware Architecture,\" Software Engineering Notes (SEN), \nvol. 17, pp. 40-52, 1992. \n\n[4] \"The Globus Alliance (http://www.globus.org),\" 2005. \n[5] \"Webservices.org (http://www.webservices.org),\" 2005. \n[6] A. Luther, R. Buyya, et al., \"Alchemi: A .NET-based \n\nEnterprise Grid Computing System,\" in Proc. of 6th \nInternational Conference on Internet Computing, Las Vegas, \nNV, USA, 2005. \n\n[7] \"JCGrid Web Site (http://jcgrid.sourceforge.net),\" 2005. \n[8] \"LHC Computing Grid (http://lcg.web.cern.ch/LCG/),\" 2005. \n[9] D. Bernholdt, S. Bharathi, et al., \"The Earth System Grid: \n\nSupporting the Next Generation of Climate Modeling \nResearch,\" Proceedings of the IEEE, vol. 93, pp. 485-495, \n2005. \n\n[10] A. Finkelstein, C. Gryce, et al., \"Relating Requirements and \nArchitectures: A Study of Data Grids,\" J. of Grid Computing, \nvol. 2, pp. 207-222, 2004. \n\n[11] C. A. Mattmann, N. Medvidovic, et al., \"Unlocking the Grid,\" \nin Proc. of CBSE, St. Louis, MO, pp. 322-336, 2005. \n\n[12] J. Hammer, H. Garcia-Molina, et al., \"Information translation, \nmediation, and mosaic-based browsing in the tsimmis system,\" \nin Proc. of ACM SIGMOD International Conference on \nManagement of Data, San Jose, CA, pp. 483-487, 1995. \n\n[13] T. Kirk, A. Y. Levy, et al., \"The information manifold,\" \nWorking Notes of the AAAI Spring Symposium on Information \nGathering in Heterogeneous, Distributed Environment, Menlo \nPark, CA, Technical Report SS-95-08, 1995. \n\n[14] O. Etzioni and D. S. Weld, \"A softbot-based interface to the \nInternet,\" CACM, vol. 37, pp. 72-76, 1994. \n\n[15] A. Go\u00f1i, A. Illarramendi, et al., \"An optimal cache for a \nfederated database system,\" Journal of Intelligent Information \nSystems, vol. 9, pp. 125-155, 1997. \n\n[16] M. R. Genesereth, A. Keller, et al., \"Infomaster: An \ninformation integration system,\" in Proc. of ACM SIGMOD \nInternational Conference on Management of Data, Tucson, \nAZ, pp. 539-542, 1997. \n\n[17] A. Tomasic, L. Raschid, et al., \"A data model and query \nprocessing techniques for scaling access to distributed \nheterogeneous databases in disco,\" IEEE Transactions on \nComputers, 1997. \n\n[18] Y. Arens, C. A. Knoblock, et al., \"Query Reformulation for \nDynamic Information Integration,\" Journal of Intelligent \nInformation Systems, vol. 6, pp. 99-130, 1996. \n\n[19] J. Ambite, N. Ashish, et al., \"Ariadne: A system for \nconstructing mediators for internet sources,\" in Proc. of ACM \nSIGMOD International Conference on Management of Data, \nSeattle, WA, pp. 561-563, 1998. \n\n[20] G. Barish and C. A. Knoblock, \"An Expressive and Efficient \nLanguage for Information Gathering on the Web,\" in Proc. of \n6th International Conference on AI Planning and Scheduling \n(AIPS-2002) Workshop, Toulouse, France, 2002. \n\n[21] A. Y. Halevy, \"Answering queries using views: A survey,\" \nVLDB Journal, vol. 10, pp. 270-294, 2001. \n\n[22] J. L. Ambite, C. A. Knoblock, et al., \"Compiling Source \nDescriptions for Efficient and Flexible Information \nIntegration,\" Information Systems Journal, vol. 16, pp. 149-\n187, 2001. \n\n[23] E. Lambrecht and S. Kambhampati, \"Planning for Information \nGathering: A Tutorial Survey,\" ASU CSE Technical Report \n96-017, May 1997. \n\n[24] \"Enterprise Java Beans (http://java.sun.com/ejb),\" pp. 2005. \n[25] \"Java RMI (http://java.sun.com/rmi/),\" 2005. \n[26] C. A. Mattmann, S. Malek, et al., \"GLIDE: A Grid-based \n\nLightweight Infrastructure for Data-intensive Environments,\" \nin Proc. of European Grid Conference, Amsterdam, the \nNetherlands, pp. 68-77, 2005. \n\n[27] DCMI, \"Dublin Core Metadata Element Set,\" 1999. \n[28] T. Berners-Lee, R. Fielding, et al., \"Uniform Resource \n\nIdentifiers (URI): Generic Syntax,\" 1998. \n[29] \"Open Channel Foundation: Request Object Oriented Data \n\nTechnology (OODT) - \n(http://openchannelsoftware.com/orders/index.php?group_id=3\n32),\" 2005. \n\n[30] J. S. Hughes and S. K. McMahon, \"The Planetary Data System. \nA Case Study in the Development and Management of Meta-\nData for a Scientific Digital Library.,\" in Proc. of ECDL, pp. \n335-350, 1998. \n\n[31] S. Srivastava, Informatics in proteomics. Boca Raton, FL: \nTaylor & Francis/CRC Press, 2005. \n\n[32] \"UGS Products: TeamCenter \n(http://www.ugs.com/products/teamcenter/),\" 2005. \n\n[33] \"Document Management | Xerox Docushre \n(http://docushare.xerox.com/ds/),\" 2005. \n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\tINTRODUCTION\n\tSOFTWARE ENGINEERING CHALLENGES\n\tBACKGROUND AND RELATED WORK\n\tOODT ARCHITECTURE\n\tGuiding Principles\n\tOODT Components\n\tProduct Server and Product Client\n\tProfile Server and Profile Client\n\tQuery Server and Query Client\n\tCatalog and Archive Server and Client\n\n\tOODT Connectors\n\tHandler Connectors\n\tMessaging Layer Connector\n\n\n\tEXPERIENCE AND CASE STUDIES\n\tPlanetary Data System\n\tEarly Detection Research Network\n\tScience Processing Systems\n\tComputer Modeling Simulation and Visualization\n\n\tCONCLUSIONS\n\tACKNOWLEDGEMENTS\n\tREFERENCES\n\n", "X-TIKA:parse_time_millis": "11123", "access_permission:assemble_document": "true", "access_permission:can_modify": "true", "access_permission:can_print": "true", "access_permission:can_print_degraded": "true", "access_permission:extract_content": "true", "access_permission:extract_for_accessibility": "true", "access_permission:fill_in_form": "true", "access_permission:modify_annotations": "true", "created": "Wed Feb 15 13:13:58 PST 2006", "creator": "End User Computing Services", "date": "2006-02-15T21:16:01Z", "dc:creator": "End User Computing Services", "dc:format": "application/pdf; version=1.4", "dc:title": "Proceedings Template - WORD", "dcterms:created": "2006-02-15T21:13:58Z", "dcterms:modified": "2006-02-15T21:16:01Z", "grobid:header_Abstract": "Modern scientific research is increasingly conducted by virtual communities of scientists distributed around the world. The data volumes created by these communities are extremely large, and growing rapidly. The management of the resulting highly distributed, virtual data systems is a complex task, characterized by a number of formidable technical challenges, many of which are of a software engineering nature. In this paper we describe our experience over the past seven years in constructing and deploying OODT, a software framework that supports large, distributed, virtual scientific communities. We outline the key software engineering challenges that we faced, and addressed, along the way. We argue that a major contributor to the success of OODT was its explicit focus on software architecture. We describe several large-scale, real-world deployments of OODT, and the manner in which OODT helped us to address the domain-specific challenges induced by each deployment.", "grobid:header_AbstractHeader": "ABSTRACT", "grobid:header_Address": "Pasadena, CA 91109, USA Los Angeles, CA 90089, USA", "grobid:header_Affiliation": "1 Jet Propulsion Laboratory California Institute of Technology ; 2 Computer Science Department University of Southern California", "grobid:header_Authors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1 Nenad Medvidovic 2 Steve Hughes 1", "grobid:header_BeginPage": "-1", "grobid:header_Class": "class org.grobid.core.data.BiblioItem", "grobid:header_Email": "{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov ; {mattmann,neno}@usc.edu", "grobid:header_EndPage": "-1", "grobid:header_Error": "true", "grobid:header_FirstAuthorSurname": "Mattmann", "grobid:header_FullAffiliations": "[Affiliation{name='null', url='null', institutions=[California Institute of Technology], departments=null, laboratories=[Jet Propulsion Laboratory], country='USA', postCode='91109', postBox='null', region='CA', settlement='Pasadena', addrLine='null', marker='1', addressString='null', affiliationString='null', failAffiliation=false}, Affiliation{name='null', url='null', institutions=[University of Southern California], departments=[Computer Science Department], laboratories=null, country='USA', postCode='90089', postBox='null', region='CA', settlement='Los Angeles', addrLine='null', marker='2', addressString='null', affiliationString='null', failAffiliation=false}]", "grobid:header_FullAuthors": "[Chris A Mattmann, Daniel J Crichton, Nenad Medvidovic, Steve Hughes]", "grobid:header_Item": "-1", "grobid:header_Keyword": "Categories and Subject Descriptors D2 Software Engineering, D211 Domain Specific Architectures Keywords OODT, Data Management, Software Architecture", "grobid:header_Keywords": "[D2 Software Engineering, D211 Domain Specific Architectures (type:subject-headers), Keywords (type:subject-headers), OODT, Data Management, Software Architecture (type:subject-headers)]", "grobid:header_Language": "en", "grobid:header_NbPages": "-1", "grobid:header_OriginalAuthors": "Chris A. Mattmann 1, 2 Daniel J. Crichton 1 Nenad Medvidovic 2 Steve Hughes 1", "grobid:header_Title": "A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications", "meta:author": "End User Computing Services", "meta:creation-date": "2006-02-15T21:13:58Z", "meta:save-date": "2006-02-15T21:16:01Z", "modified": "2006-02-15T21:16:01Z", "pdf:PDFVersion": "1.4", "pdf:encrypted": "false", "producer": "Acrobat Distiller 6.0 (Windows)", "resourceName": "ICSE06.pdf", "title": "Proceedings Template - WORD", "xmp:CreatorTool": "Acrobat PDFMaker 6.0 for Word", "xmpTPg:NPages": "10" } ] Great work, Sujen Shah . I'm going to commit this now and start work on the Wiki page!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/55

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/55
          Hide
          sujenshah Sujen Shah added a comment -

          Awesome Chris A. Mattmann !! Thank you Will start work on the wiki.

          Show
          sujenshah Sujen Shah added a comment - Awesome Chris A. Mattmann !! Thank you Will start work on the wiki.
          Hide
          sujenshah Sujen Shah added a comment -

          Awesome Chris A. Mattmann !! Thank you Will start work on the wiki.

          Show
          sujenshah Sujen Shah added a comment - Awesome Chris A. Mattmann !! Thank you Will start work on the wiki.
          Hide
          chrismattmann Chris A. Mattmann added a comment -
          • fixed in r1695816. Great work Sujen Shah!
          Show
          chrismattmann Chris A. Mattmann added a comment - fixed in r1695816. Great work Sujen Shah!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-trunk-jdk1.7 #821 (See https://builds.apache.org/job/tika-trunk-jdk1.7/821/)
          Changes.txt for TIKA-1699. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1695817)

          • /tika/trunk/CHANGES.txt
          • /tika/trunk/tika-parsers/pom.xml
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidConfig.java
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidHeaderMetadata.java
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidParser.java
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java
          • /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java
          • /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #821 (See https://builds.apache.org/job/tika-trunk-jdk1.7/821/ ) Changes.txt for TIKA-1699 . (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1695817 ) /tika/trunk/CHANGES.txt fix for TIKA-1699 : Integrate the GROBID PDF extractor in Tika contributed by Sujen Shah <sujen1412@gmail.com> this closes #55. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1695816 ) /tika/trunk/tika-parsers/pom.xml /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidConfig.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidHeaderMetadata.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidParser.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf
          Hide
          chrismattmann Chris A. Mattmann added a comment -
          Show
          chrismattmann Chris A. Mattmann added a comment - docs are here: https://wiki.apache.org/tika/GrobidJournalParser
          Hide
          gagravarr Nick Burch added a comment - - edited

          A build from trunk is now failing for me:

          [ERROR] Failed to execute goal on project tika-parsers: Could not resolve dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 -> org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. -> [Help 1]
          

          With -X showing

          Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 -> org.chasen:crfpp:jar:1.0.2
          Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2
          Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file:///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created.
          

          Can we get this broken GROBID dependency pom fixed / an exclusion in place, so that trunk builds again?

          Show
          gagravarr Nick Burch added a comment - - edited A build from trunk is now failing for me: [ERROR] Failed to execute goal on project tika-parsers: Could not resolve dependencies for project org.apache.tika:tika-parsers:bundle:1.11-SNAPSHOT: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 -> org.chasen:crfpp:jar:1.0.2: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file: ///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. -> [Help 1] With -X showing Caused by: org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies at org.grobid:grobid-core:jar:0.3.4 -> org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactDescriptorException: Failed to read artifact descriptor for org.chasen:crfpp:jar:1.0.2 Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact org.chasen:crfpp:pom:1.0.2 from/to 3rd-party-local-repo (file: ///${basedir}/lib/): Repository path /${basedir}/lib does not exist, and cannot be created. Can we get this broken GROBID dependency pom fixed / an exclusion in place, so that trunk builds again?
          Hide
          gagravarr Nick Burch added a comment -

          I've tried to exclude the grobid transient dependencies to work around this problem, but even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816, in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly, and we can restore it!

          On other possible issue spotted while failing to work around the broken pom - the grobid-core jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency sizes and jars?

          Show
          gagravarr Nick Burch added a comment - I've tried to exclude the grobid transient dependencies to work around this problem, but even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816, in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly, and we can restore it! On other possible issue spotted while failing to work around the broken pom - the grobid-core jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency sizes and jars?
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          I've tried to exclude the grobid transient dependencies to work around this problem, but even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816, in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly, and we can restore it!

          yeah we're working with them to getting this fixed.

          On other possible issue spotted while failing to work around the broken pom - the grobid-core jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency sizes and jars?

          Looking at: http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.10/

          Tika-app is ~48MB it seems so closer to 30% actually size increase. As for depending on a smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect to its REST server? Sujen Shah In that fashion we could omit adding really any dependencies beyond CXF and its WebClient. I'll investigate this.

          Show
          chrismattmann Chris A. Mattmann added a comment - I've tried to exclude the grobid transient dependencies to work around this problem, but even an exclude of * still breaks the build on org.apache.maven.plugins:maven-remote-resources-plugin with the broken repo definition. Unfortunately, I've therefore had to back out your r1695816, in order to unbreak the build. Hopefully we can get the grobid community to sort that shortly, and we can restore it! yeah we're working with them to getting this fixed. On other possible issue spotted while failing to work around the broken pom - the grobid-core jar seems to be almost 15mb in size! Plus its dependencies themselves. That means we'll increase the size of the tika-app, tika-server and tika-bundle jars by almost half! Is there perhaps a smaller grobid jar we could depend on instead, which doesn't cause such a bump in our dependency sizes and jars? Looking at: http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.10/ Tika-app is ~48MB it seems so closer to 30% actually size increase. As for depending on a smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect to its REST server? Sujen Shah In that fashion we could omit adding really any dependencies beyond CXF and its WebClient. I'll investigate this.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-trunk-jdk1.7 #825 (See https://builds.apache.org/job/tika-trunk-jdk1.7/825/)
          Back out r1695816, so the build can pass again, pending a fix of the broken grobid poms. Fix being tracked in TIKA-1699 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696054)

          • /tika/trunk/tika-parsers/pom.xml
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #825 (See https://builds.apache.org/job/tika-trunk-jdk1.7/825/ ) Back out r1695816, so the build can pass again, pending a fix of the broken grobid poms. Fix being tracked in TIKA-1699 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696054 ) /tika/trunk/tika-parsers/pom.xml /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          All filed issues to publish all grobid-core deps:
          Wapiti jar fork:
          https://issues.sonatype.org/browse/OSSRH-17124
          EUGFC ImageIO plugin:
          https://issues.sonatype.org/browse/OSSRH-17126
          Language Detection:
          https://issues.sonatype.org/browse/OSSRH-17127
          Chasen CRFPP:
          https://issues.sonatype.org/browse/OSSRH-17128
          WIPO analysers:
          https://issues.sonatype.org/browse/OSSRH-17129

          That should be all of them. Will let everyone know once it's published.

          Show
          chrismattmann Chris A. Mattmann added a comment - All filed issues to publish all grobid-core deps: Wapiti jar fork: https://issues.sonatype.org/browse/OSSRH-17124 EUGFC ImageIO plugin: https://issues.sonatype.org/browse/OSSRH-17126 Language Detection: https://issues.sonatype.org/browse/OSSRH-17127 Chasen CRFPP: https://issues.sonatype.org/browse/OSSRH-17128 WIPO analysers: https://issues.sonatype.org/browse/OSSRH-17129 That should be all of them. Will let everyone know once it's published.
          Hide
          chrismattmann Chris A. Mattmann added a comment -
          • here's the patch that Nick backed out in case folks want to use it while we get the Jars published to Central.
          Show
          chrismattmann Chris A. Mattmann added a comment - here's the patch that Nick backed out in case folks want to use it while we get the Jars published to Central.
          Hide
          gagravarr Nick Burch added a comment -

          Tika-app is ~48MB it seems so closer to 30% actually size increase.

          I added a bit on for the dependency jars that I can't get to!

          As for depending on a smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect to its REST server?

          I know that for some of the dependencies so far, we've worked with them to produce a -min version or equivalent, with just the key parts in for size reasons. My first choice would be for something like that here.

          If not, could we follow the sqlite patterns, bundle the base java code as standard, but require people to download the large bulky native platform code to fully enable the support? (Assuming I've got the right idea about the bulk being from the CRF native stuff?)

          Show
          gagravarr Nick Burch added a comment - Tika-app is ~48MB it seems so closer to 30% actually size increase. I added a bit on for the dependency jars that I can't get to! As for depending on a smaller core Jar, I had an idea here. Grobid has a server, I wonder if we should just connect to its REST server? I know that for some of the dependencies so far, we've worked with them to produce a -min version or equivalent, with just the key parts in for size reasons. My first choice would be for something like that here. If not, could we follow the sqlite patterns, bundle the base java code as standard, but require people to download the large bulky native platform code to fully enable the support? (Assuming I've got the right idea about the bulk being from the CRF native stuff?)
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          To use this patch, follow the instructions first here: https://wiki.apache.org/tika/GrobidJournalParser to install Grobid, and then apply this patch.

          Show
          chrismattmann Chris A. Mattmann added a comment - To use this patch, follow the instructions first here: https://wiki.apache.org/tika/GrobidJournalParser to install Grobid, and then apply this patch.
          Hide
          chrismattmann Chris A. Mattmann added a comment -
          • here's a WIP patch to convert the Grobid parser to use its REST services. Tests are passing. I need to add the rest of the GROBID header XML metadata elements. Just got a bit tired Sujen Shah if you want to finish this off, all you. Else if you don't beat me to it, maybe I'll finish it tomorrow.
          Show
          chrismattmann Chris A. Mattmann added a comment - here's a WIP patch to convert the Grobid parser to use its REST services. Tests are passing. I need to add the rest of the GROBID header XML metadata elements. Just got a bit tired Sujen Shah if you want to finish this off, all you. Else if you don't beat me to it, maybe I'll finish it tomorrow.
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          OK I got the fully REST services version of the GROBID PDF parser implemented. Tests are passing and I'm going to commit it within the next few minutes. Basically it only adds the CXF rest client dependency and also the org.json dependency. Lot better, and lot smaller. Also GROBID can exist on another machine now. Will update the docs shortly.

          Show
          chrismattmann Chris A. Mattmann added a comment - OK I got the fully REST services version of the GROBID PDF parser implemented. Tests are passing and I'm going to commit it within the next few minutes. Basically it only adds the CXF rest client dependency and also the org.json dependency. Lot better, and lot smaller. Also GROBID can exist on another machine now. Will update the docs shortly.
          Hide
          chrismattmann Chris A. Mattmann added a comment -
          • committed and fixed in r1696191 and r1696192. Cheers!
          Show
          chrismattmann Chris A. Mattmann added a comment - committed and fixed in r1696191 and r1696192. Cheers!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-trunk-jdk1.7 #830 (See https://builds.apache.org/job/tika-trunk-jdk1.7/830/)

          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java
            TIKA-1699: refactored GROBID parser to use GROBID rest API. Only introduced 2 deps, CXF client, and also org.json. very small and works great. Thanks to Sujen Shah for his initial work on the GROBID patch. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696191)
          • /tika/trunk/tika-parsers/pom.xml
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java
          • /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java
          • /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #830 (See https://builds.apache.org/job/tika-trunk-jdk1.7/830/ ) fix typo: TIKA-1699 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696192 ) /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java TIKA-1699 : refactored GROBID parser to use GROBID rest API. Only introduced 2 deps, CXF client, and also org.json. very small and works great. Thanks to Sujen Shah for his initial work on the GROBID patch. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696191 ) /tika/trunk/tika-parsers/pom.xml /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/journal/JournalParserTest.java /tika/trunk/tika-parsers/src/test/resources/test-documents/testJournalParser.pdf
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-trunk-jdk1.7 #832 (See https://builds.apache.org/job/tika-trunk-jdk1.7/832/)
          TIKA-1699: fix bundle for GROBID parser deps. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696319)

          • /tika/trunk/tika-bundle/pom.xml
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #832 (See https://builds.apache.org/job/tika-trunk-jdk1.7/832/ ) TIKA-1699 : fix bundle for GROBID parser deps. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696319 ) /tika/trunk/tika-bundle/pom.xml TIKA-1699 : statically load the rest URL properties inside of GROBIDRESTParser (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1696286 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java
          Hide
          gagravarr Nick Burch added a comment -

          Quick one - the wiki mentions needing to do a 600mb git checkout and then a build. Is it possibly to just download a smaller pre-built package of GROBID to skip this step? And if not, could we maybe suggest it to them for their next release? (A 10s of MB download is probably easier and more beginner-friendly then a huge checkout + having to build!)

          Show
          gagravarr Nick Burch added a comment - Quick one - the wiki mentions needing to do a 600mb git checkout and then a build. Is it possibly to just download a smaller pre-built package of GROBID to skip this step? And if not, could we maybe suggest it to them for their next release? (A 10s of MB download is probably easier and more beginner-friendly then a huge checkout + having to build!)
          Hide
          chrismattmann Chris A. Mattmann added a comment - - edited

          Agreed. We have suggested it in #59. Please feel free to join the convo there.

          Show
          chrismattmann Chris A. Mattmann added a comment - - edited Agreed. We have suggested it in #59 . Please feel free to join the convo there.

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              sujenshah Sujen Shah
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development