Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 !!!

        Activity

        Hide
        gagravarr Nick Burch added a comment - - edited

        I've just had a quick try with the library, against a test SAS file with 5 columns each of different types. Looking at the properties on the file, and on the columns, Parso is able to return:

        u64 - false
        compressionMethod - null
        endianness - 1
        encoding - windows-1252
        sessionEncoding - null
        name - SHEET1
        fileType - DATA
        dateCreated - Fri Mar 06 19:10:19 GMT 2015
        dateModified - Fri Mar 06 19:10:19 GMT 2015
        sasRelease - 9.0101M3
        serverType - XP_PRO
        osName - 
        osType - 
        headerLength - 1024
        pageLength - 8192
        pageCount - 1
        rowLength - 96
        rowCount - 31
        mixPageRowCount - 69
        columnsCount - 5
        
        5 Columns defined:
         1 - A
          Label: A
          Format: $
          Size 58 of java.lang.String
         2 - B
          Label: B
          Format: 
          Size 8 of java.lang.Number
         3 - C
          Label: C
          Format: DATE
          Size 8 of java.lang.Number
         4 - D
          Label: D
          Format: DATETIME
          Size 8 of java.lang.Number
         5 - E
          Label: E
          Format: 
          Size 8 of java.lang.Number
        

        I guess we'd want to map some of the file properties onto standard keys, and the rest onto custom ones? For the data, I guess we output SAX events for a HTML-like table. Not sure about the column metadata, any patterns we can copy from any of the database formats or other scientific dataset formats?

        Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika Parsers test documents area. Do we have a standard "moderately complicated" tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we can have largely the same test data between formats?

        Show
        gagravarr Nick Burch added a comment - - edited I've just had a quick try with the library, against a test SAS file with 5 columns each of different types. Looking at the properties on the file, and on the columns, Parso is able to return: u64 - false compressionMethod - null endianness - 1 encoding - windows-1252 sessionEncoding - null name - SHEET1 fileType - DATA dateCreated - Fri Mar 06 19:10:19 GMT 2015 dateModified - Fri Mar 06 19:10:19 GMT 2015 sasRelease - 9.0101M3 serverType - XP_PRO osName - osType - headerLength - 1024 pageLength - 8192 pageCount - 1 rowLength - 96 rowCount - 31 mixPageRowCount - 69 columnsCount - 5 5 Columns defined: 1 - A Label: A Format: $ Size 58 of java.lang. String 2 - B Label: B Format: Size 8 of java.lang. Number 3 - C Label: C Format: DATE Size 8 of java.lang. Number 4 - D Label: D Format: DATETIME Size 8 of java.lang. Number 5 - E Label: E Format: Size 8 of java.lang. Number I guess we'd want to map some of the file properties onto standard keys, and the rest onto custom ones? For the data, I guess we output SAX events for a HTML-like table. Not sure about the column metadata, any patterns we can copy from any of the database formats or other scientific dataset formats? Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika Parsers test documents area. Do we have a standard "moderately complicated" tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we can have largely the same test data between formats?

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development