WhatWentWrongErrorHandling

From SPA Wiki

Jump to: navigation, search

Contents

WS8 - What Went Wrong?
Error Handling in a Distributed System

Andy Longshaw and Eoin Woods

Acknowledgements

Many thanks to TonyBarrettPowell for making excellent notes during the session, which became this output with minimal editing!

Good Error Handling Practices

  • Create a logging hierarchy and classify each error withing the hierarchy.
  • Allow error filtering based on the logging hierarchy and don't perform message related processing (e.g. creating debug strings) if the message isn't included in the current filter.
  • Detect and report errors at the "right" place (e.g. validating input).
  • Log details where error occurs and convert to more general error/concept for propagation back to caller.
  • Create a repository of all known errors.
  • Rotate developers into support for a time.
  • Separate error detection from error logging from error handling.
  • Report the operational context with the error information.
  • Log a unique operation ID with error context (to allow pieces of error information to be strung together).
  • Only intercept and handle an error if you can do something useful with it.
  • Explore failure scenarios with users to find out how important different errors are to them.
  • Classify each errors to allow them to be treated in groups.
  • When errors occur, continue with reduced functionality if possible.

Bad Error Handling Practices

  • Shutting down the system on error.
  • No standards (for error handling).
  • Propagating errors that should be handled locally.
  • Exposing technical detail to end users.
  • Wrong use of unchecked exceptions.
  • Non-errors reported as errors.
  • Reporting "normal" business errors as exceptions.
  • Attempting to retro-fit a error handling strategy.
  • Vague errors that don't clearly define what happened and where.
  • Inflexible timeouts causing spurious errors.
  • Losing lead developers when production transition is complete.

Team 1 Framework

Context

Error handling for a large distributed system involving a large number of data sources in existing operational systems on different platforms.

Framework Structure

        Presentation  ---------------->  Central Log
             !                     !         ^
       Business Layer -------------%2b         !
             !                     !         !
      Data Access Layer -----------%2b         !
             !                            Log Message Interface (MQ)
     -----------------                       ^
     !     !     !    !                      !
    Sys1  Sys2  Sys3 Sys4                    !
   [Shared Legacy Systems]                   !
     ^     ^     ^     ^                     !
   [[LogD]]1  [[LogD]]2 [[LogD]]3  [[LogD]]4-----------------%2b
       [Logging Daemons]
  • The system structure is on the left:
  • Presentation later displays user interface, accesses Business Layer
  • Business Layer provides business txn logic, accesses Data Access Layer
  • The shared legacy systems are the sourcess of data for the business transactions, accessed via the Data Access Layer
  • The unified logging system is on the right:
  • Single central log "server"
  • Presentation, Business and Data Access layers access the log via a standard programming library.
  • A Logging Daemon is created for each legacy system, which runs on the remote machine, recognises error or status conditions of interest and sends log messages to the central log.
  • The Logging Daemons access the central log via a message based (IBM MQ) service interface.

Principles

  • Centralised logs (accessed via MQ messaging interface)
  • Provide tools to navigate errors
  • Add entries to the log at all of the levels of the system.
  • Session Id, Application Id would be used to identify errors
  • The legacy systems must be included into the error logging (using the MQ interface and a logging monitor on the legacy system to send error to the Central Log).
  • For further analysis the legacy systems may require a service interface to enable search and filter of the error log.
  • The classification of what is logged should be provided to the developers in this environment, for example what to log in the central log.
  • Some form of information is required to identify what happens when an error occurs, for example what an operater should do when an error occurs.

Team 2 Framework

Context

There are a mix of technologies, the principle should be to make it easy for th developers no matter what the technology is.

Principles

  • Centralised Error Definitions (codes, format, classification, required supplementary information)
  • Example class structure for exception:
			
                      Exception
		    	       !
		  ----------------------
		  !                    !
	    Application           System
           Exception           Exception
{| border="1"
 |            ||
 |}
         ---------------
         !       !      !
        ...     ...     ...
     [Types of app. exception]
  • Example Error class:
        // The error has a unique id obtained at construction.
        class Error 
        {
            public Error()
            public Error(Error e)
            toString()
            getID()
            setCode([[ErrorCode]] code)
            setParameter(name, value)
            isRetryable()
            raise()  // throws the exception for the specific error
        }

        class [[ErrorReport]]
        { 
           // If the error is not known or does not contain appropriate 
           // information this is also an error condition.
           public [[ErrorReport]](Error e)
           setCorrelationId(Id id)
           toLog()
           toSNMP()
        }
  • Abstract errors to the next level of system.
  • Retry information can be included in the error.
  • Layers can modify the retry information as required.


Team 3 Framework

Context

Definition of a framework of principles for error handling in large distributed information systems.

Principles

  • Create a set of central error definitions.
  • Handle errors at system boundaries.
  • Classify errors to allow them to be processed appropriately.
  • Log errors at boundaries of the system's levels.
  • Enrich error information as the error propagate between the different levels of the system.
  • Ensure that each instance of an error has a unique identifier.

Back to OtTwoThousandAndFourOutput.