WhatWentWrongErrorHandling

From SPA Wiki

WS8 - What Went Wrong?
Error Handling in a Distributed System

Andy Longshaw and Eoin Woods

Acknowledgements

Many thanks to TonyBarrettPowell for making excellent notes during the session, which became this output with minimal editing!

Good Error Handling Practices

Create a logging hierarchy and classify each error withing the hierarchy.
Allow error filtering based on the logging hierarchy and don't perform message related processing (e.g. creating debug strings) if the message isn't included in the current filter.
Detect and report errors at the "right" place (e.g. validating input).
Log details where error occurs and convert to more general error/concept for propagation back to caller.
Create a repository of all known errors.
Rotate developers into support for a time.
Separate error detection from error logging from error handling.
Report the operational context with the error information.
Log a unique operation ID with error context (to allow pieces of error information to be strung together).
Only intercept and handle an error if you can do something useful with it.
Explore failure scenarios with users to find out how important different errors are to them.
Classify each errors to allow them to be treated in groups.
When errors occur, continue with reduced functionality if possible.

Bad Error Handling Practices

Shutting down the system on error.
No standards (for error handling).
Propagating errors that should be handled locally.
Exposing technical detail to end users.
Wrong use of unchecked exceptions.
Non-errors reported as errors.
Reporting "normal" business errors as exceptions.
Attempting to retro-fit a error handling strategy.
Vague errors that don't clearly define what happened and where.
Inflexible timeouts causing spurious errors.
Losing lead developers when production transition is complete.

Team 1 Framework

Context

Error handling for a large distributed system involving a large number of data sources in existing operational systems on different platforms.

Framework Structure

        Presentation  ---------------->  Central Log
             !                     !         ^
       Business Layer -------------%2b         !
             !                     !         !
      Data Access Layer -----------%2b         !
             !                            Log Message Interface (MQ)
     -----------------                       ^
     !     !     !    !                      !
    Sys1  Sys2  Sys3 Sys4                    !
   [Shared Legacy Systems]                   !
     ^     ^     ^     ^                     !
   [[LogD]]1  [[LogD]]2 [[LogD]]3  [[LogD]]4-----------------%2b
       [Logging Daemons]

The system structure is on the left:
Presentation later displays user interface, accesses Business Layer
Business Layer provides business txn logic, accesses Data Access Layer
The shared legacy systems are the sourcess of data for the business transactions, accessed via the Data Access Layer
The unified logging system is on the right:
Single central log "server"
Presentation, Business and Data Access layers access the log via a standard programming library.
A Logging Daemon is created for each legacy system, which runs on the remote machine, recognises error or status conditions of interest and sends log messages to the central log.
The Logging Daemons access the central log via a message based (IBM MQ) service interface.

Principles

Centralised logs (accessed via MQ messaging interface)
Provide tools to navigate errors
Add entries to the log at all of the levels of the system.
Session Id, Application Id would be used to identify errors
The legacy systems must be included into the error logging (using the MQ interface and a logging monitor on the legacy system to send error to the Central Log).
For further analysis the legacy systems may require a service interface to enable search and filter of the error log.
The classification of what is logged should be provided to the developers in this environment, for example what to log in the central log.
Some form of information is required to identify what happens when an error occurs, for example what an operater should do when an error occurs.

Team 2 Framework

Context

There are a mix of technologies, the principle should be to make it easy for th developers no matter what the technology is.

Principles

Centralised Error Definitions (codes, format, classification, required supplementary information)
Example class structure for exception:

			
                      Exception
		    	       !
		  ----------------------
		  !                    !
	    Application           System
           Exception           Exception
{| border="1"
 |            ||
 |}
         ---------------
         !       !      !
        ...     ...     ...
     [Types of app. exception]

Example Error class:

        // The error has a unique id obtained at construction.
        class Error 
        {
            public Error()
            public Error(Error e)
            toString()
            getID()
            setCode([[ErrorCode]] code)
            setParameter(name, value)
            isRetryable()
            raise()  // throws the exception for the specific error
        }

        class [[ErrorReport]]
        { 
           // If the error is not known or does not contain appropriate 
           // information this is also an error condition.
           public [[ErrorReport]](Error e)
           setCorrelationId(Id id)
           toLog()
           toSNMP()
        }

Abstract errors to the next level of system.
Retry information can be included in the error.
Layers can modify the retry information as required.

Team 3 Framework

Context

Definition of a framework of principles for error handling in large distributed information systems.

Principles

Create a set of central error definitions.
Handle errors at system boundaries.
Classify errors to allow them to be processed appropriately.
Log errors at boundaries of the system's levels.
Enrich error information as the error propagate between the different levels of the system.
Ensure that each instance of an error has a unique identifier.

Back to OtTwoThousandAndFourOutput.

WhatWentWrongErrorHandling

From SPA Wiki

Contents

WS8 - What Went Wrong?
Error Handling in a Distributed System

Andy Longshaw and Eoin Woods

Acknowledgements

Good Error Handling Practices

Bad Error Handling Practices

Team 1 Framework

Context

Framework Structure

Principles

Team 2 Framework

Context

Principles

Team 3 Framework

Context

Principles

Views

Personal tools

Navigation

Search

Toolbox

WhatWentWrongErrorHandling

From SPA Wiki

Contents

WS8 - What Went Wrong?Error Handling in a Distributed System

Andy Longshaw and Eoin Woods

Acknowledgements

Good Error Handling Practices

Bad Error Handling Practices

Team 1 Framework

Context

Framework Structure

Principles

Team 2 Framework

Context

Principles

Team 3 Framework

Context

Principles

Views

Personal tools

Navigation

Search

Toolbox

WS8 - What Went Wrong?
Error Handling in a Distributed System