Uploaded image for project: 'Log4j 2'
  1. Log4j 2
  2. LOG4J2-1305

Binary Layout

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Layouts

    Description

      Logging in a binary format instead of in text can give large performance improvements.

      Logging text means going from a LogEvent object to formatted text, and then converting this text to bytes. Performance investigations with text-based logging formats like PatternLayout (see LOG4J2-930), and encoding Strings to bytes (LOG4J2-935, LOG4J2-1151) suggest that formatting and encoding text is expensive and imposes limits on the performance that can be achieved.

      A different approach would be to convert the LogEvent to a binary representation directly without creating a text representation first. This would result in extremely compact log files that are fast to write. The trade-off is that a binary log cannot easily be read in a general-purpose editor like VI or Notepad. A specialized tool would be necessary to either display or convert to human-readable form.

      This ticket proposes a simple BinaryLayout, where each LogEvent is logged in a binary format.

      Example BinaryLayout log event record format

      Offset Type Log Event Record Field Description
      0 long TimeMillis
      8 long NanoTime
      16 int Level
      20 int Logger name index - string value in separate file
      24 int Thread name index - string value in separate file
      28 long Thread ID
      36 int Thread priority
      40 int Marker index - value & hierarchy in separate file
      44 int ThreadContext key/value pair count
      48 int ThreadContext key1 index - string value in separate file
      52 int ThreadContext value1 index - string value in separate file
      56 int ThreadContext key2 index
      60 int ThreadContext value2 index
      64 int Message length
      68 int Message encoder FQCN index
      72 byte[] Message data - below offset assumes 18 bytes of message data
      90 int Throwable data length
      94 byte[] Throwable data

      Repeating String Data
      Repeating String data like thread names, logger names, marker names and ThreadContextMap keys and values should be logged only once, after which they can be referenced by their index.

      One way to do this is to save string data to a separate file. The main log file contains an index (the line number, zero-based) into the string-data file instead of the full string. Index -1 means the String value was null. The format of the string-data file can simply be: each unique string on a separate line (separated by '\n' (0x0A) character). Any '\n' characters embedded in the string value are Unicode escaped and writen as "\u000A".

      An alternative to separate files is interspersing "string-data" records with "log event" records. Records could be prefixed with a single byte indicating their record type (e.g. '#' (0x23)=header, '\n' (0x0A)=log event, '$' (0x24)=string data).

      String-data record format:

      Offset Type String-Data Record Field Description
      0 int index of the string (each unique String has a unique index)
      4 byte[] the String value, encoded in the standard Java modified UTF-8 format used by DataOutput.writeUTF(String)

      Custom Messages
      Note: custom Messages that implement the Encoder interface (introduced with LOG4J2-1274) can be written in binary form directly without first being converted to text (LOG4J2-506). Any specialized tool for reading binary log files should handle messages of type "text" out of the box, but could have some plugin mechanism for decoding custom messages.

      A more flexible and less intrusive variation of this is to have a registry of Encoders that map Classes to the associated Encoder. That would allow not only custom Messages, but also the content of any ObjectMessage to be encoded in custom binary format. Domain classes then no longer need to implement the Message interface.

      Markers
      TBD: as Matt points out in the comments, Markers are special since they are hierarchic. One way to deal with this is to manage a separate file to save the Marker hierarchy. Another way is to do something similar to PatternLayout: treat it as a String value, where the string includes hierarchy information. I like the simplicity of the latter approach.

      Versioning
      The binary file must start with a header, indicating version information and perhaps schema information providing meta data on the log record. Schema information may make it possible to include/exclude fields. For version 1.0, the schema can either be fixed like the above example, or it could be a simple bitmask for the fields mentioned above.
      Byte Order
      TBD: Are multi-byte values like ints and longs written in big Endian or little Endian? This could be specified in the header, or we could fix it to either one. Exchange protocols like ITCH tend to select a fixed byte order (ITCH uses big Endian - network byte order). I like the simplicity of this approach.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            rpopma Remko Popma

            Dates

              Created:
              Updated:

              Slack

                Issue deployment