9 min read

Golang: Discrete Event Simulation 101

Building a simulation model is a convenient way (and sometimes the only way) to stress test your system by taking it through scenarios that don't often occur in real life.
three rows of discrete cats sitting on white marble floor in perfect straight lines

Hey friend 👋

I've been doing a lot of simulation work lately trying to figure out how a system with various moving components behaves in different scenarios and I thought it would be great to share how I approached it.

Writing a simulation environment has been fun and I had a chance to dust off my university notes on modeling stochastic processes. 🤓

What and Why

In this post, I'll show how to leverage the two data structures I mentioned earlier, so you have a chance to read through those posts to recap on how those containers work internally:

  1. Golang: Circular Queue from Scratch
  2. Golang Priority Queue: Sorting at Scale

Building a simulation model is a convenient way (and sometimes the only way) to stress test your system by taking it through scenarios that don't often occur in real life.

This could be super useful when your system is complex and have various probabilistic processes and cross-component dependencies which are hard to reason about and you cannot afford to risk reliability and test your scenarios in production.

Hope you feel excited and ready to build some simulations 😉

Simulation Scenario: Feeding Pets

This is super exciting!
We'll be building an automated feeding system for our animal companions! 🐕 🐈

In reality, all systems are similar in some ways and you can apply the same approach to reasoning about backend microservices or any other applied domain.

Here's the assumptions:

  1. There are N pets we need to feed, all of them hungry at the start.
  2. We've got K automatic feeders (K <= N)
  3. Once a feeder is available, a random hungry pet would start using it.
  4. When a pet is at a feeder, it eats for time T and leaves satisfied releasing the feeder.
  5. Then the cycle repeats until there's no more hungry pets.
  6. Once in a while (with probability P) a feeder breaks and someone needs to go and fix it, which takes time R.

What we'd like to know:

  1. Time duration for 50%, 95%, 99% feeding completion.
  2. Can we reasonably feed all the pets twice a day and how many feeders we would need for N pets.
a few cats and dogs at the top waiting to be fed, automatic feeders in the middle, three occupied, one broken, and fed pets at the bottom.

Now, this might look artificial, but bear in mind that the goal in designing a simulation system is a balance act of conflicting priorities:

  1. Trying to make the simulation as simple as possible so that it's performant and easy to build and evolve.
  2. Getting to the level of detail that reflects the properties you're trying to model.

If we can get away with not modeling everything ideally and still get a reasonable approximation of reality, that's good enough 😉

Application of Queue

The key element in a simulation system is an event queue.

Unless you're simulation something without durations, you need to track when certain things are supposed to happen.

a queue of timestamps in an increasing order

What we want from an event queue:

  1. Push future events onto the queue
  2. Retrieve the earliest event from the queue

Now, this should look familiar 😉

This is exactly what a Heap container does. And in a few special cases you can use a Ring queue as well for more performance (constant instead of logarithmic time).

How to build a simulation

If you want to follow along, open your favourite editor/IDE and make a new go binary package. Alternatively, you can use Go Playground.

There are two key ways to simulate time:

  1. Incremental Time Progression – you simulate X time steps in a loop moving forward in fixed time intervals.
  2. Event-time Progression – jump to the earliest future event time at every simulation step.

In general, if you need to simulate a year of time and only 100 events happen in this year in your system, you can see how jumping to event-time (2) is significantly more efficient than uselessly looping over hourly or minute time increments (1).

We'll look at both appoaches.

Here's the start:

package main

// our event queue interface, this is anything that we can
// add elements to, remove minimal, peek minimal and check its size.
type EventQueue[T any] interface {
	Push(element T) error
	Pop() (T, error)
	Peek() (T, error)
	Size() int
}

func main() {
    // ...
}

A structure that would hold our simulation parameters (configuration that doesn't change from run to run in case we want to repeat this simulation).

type Simulation struct {
	// min-queue for tracking repair times
	eventQueue EventQueue[int]
	// total number of pets we need to process i.e. feed
	taskCount int
	// number of parallel processing gates i.e. feeders
	gateCount int
	// probability of a gate breaking
	breakChance float32
	// units of time required for repairs
	repairTime int
	// processing time
	processingTime int
}

// run the simulation and return 3 duration percentiles:
// time to 50%, 90% and 99% done
func (s *Simulation) run(seed int64) (percentileTime [3]int) {
    // this isn't doing anything useful now, we'll code this later
    return
}

And here's the main function stub, this is how we'll run our simulation program:

func main() {
	var r EventQueue[int] = nil // nil for now

    // these are example settings, feel free to play around with the simulation parameters!
	stepSim := Simulation{
		eventQueue:     r,
		taskCount:      1_000_000, // feed a millon pets!
		gateCount:      10,        // with 10 feeders
		breakChance:    0.05,      // each has a 5% chance of failure
		repairTime:     120,       // requiring 120 time units of repair time
		processingTime: 15,        // and with 15 time units of eating time
	}

    // give a seed for random generated probabilistic events
	res := stepSim.run(100)

    // print out results
	fmt.Printf("Time to X%%: 50%% = %d, 95%% = %d, 99%% = %d\n", res[0], res[1], res[2])
	fmt.Printf("Relative Time to X%%: 50%% = 1x, 95%% = %.2fx, 99%% = %.2fx\n", float32(res[1])/float32(res[0]), float32(res[2])/float32(res[0]))
}

At this point, although the program isn't doing anything useful yet, we have the main structure in place and can compile and run it 🙌

See this step on Go Playground

a cart rolling down a slope with a number of stacked cubes, it passes through a number of gates, each gate removing some of the cubes from the cart, it ends up with a single cube.

Fixed Time Increments

The idea here is that your events happen on (or approximately on) a time grid i.e. every hour, every day etc.

In this case, this is your natural time step and the granularity / precisions of your simulation.

The smaller the time step, the more precision you have, but at the cost of compute efficiency since there will be more simulation cycles.

Let's work on our run function:

func (s *Simulation) run(seed int64) (percentileTime [3]int) {
    // set the random generator seed
	rgen := rand.New(rand.NewSource(seed))

	var time int // current simulation time
	var done int // done count
	var broken int // broken gate count
    
    // run simulation until all done
	for done < s.taskCount {
		// "fix" previously broken gates that
        // have finished their repair time
		for s.eventQueue.Size() > 0 {
			ts, _ := s.eventQueue.Peek()
            // if the minimal even of the event queue is in
            // future, stop
			if ts > time {
				break
			}
			s.eventQueue.Pop() // remove the past event
			broken--           // reduce broken
		}

		// process all healthy gates
		time += s.processingTime      // fixed time step
		done += s.gateCount - broken  // complete all healthy gate tasks

		// roll a chance to of breakage for all healthy gates
        // if it's broken already, it cannot break again
		for idx := 0; idx < s.gateCount-broken; idx++ {
            // test breakage by generating a uniform random number [0,1)
            // [0-----[5%]------------------------1]
            //     ^                   ^
            //     | brakes            | survives
			if rgen.Float32() < s.breakChance {
				broken++
                // schedule a repair at current + repair time
				s.eventQueue.Push(time + s.repairTime)
			}
		}

		// Metrics: percentiles of done
        percentDone := 100 * done / s.taskCount
        percentiles := []int{50, 95, 99}
		for idx, p := range percentiles {
			if percentDone >= p && percentileTime[idx] == 0 {
				percentileTime[idx] = time
			}
		}
	}

	return
}

This should look pretty straight-forward.

Now, in order to run this, we need to provide an actual event queue container (remember, we have a nil in the main function now).

I'll chose a RingQueue in this case because fortunately, we track only breakage events in this scenario and they have a fixed repair time.

This means that out future event time will only go up with every single step and will always be larger than the oldest event on the queue.

So there is no need to re-order events, we just need to carefully add them to the end of the queue and pull from the start.

This is a perfect use case for a RingQueue (github repo).


import (
	"fmt"
	"math/rand"

    // import RingQueue from the github repo
	roundrobin "github.com/sombr/go-container-roundrobin"
)

// ...

func main() {
    // here's our RingQueue container
	r := roundrobin.NewRingQueue[int](10)

	stepSim := Simulation{
		eventQueue:     r,
		taskCount:      1_000_000,
		gateCount:      10,
		breakChance:    0.05,
		repairTime:     120,
		processingTime: 15,
	}

	res := stepSim.run(100)

	fmt.Printf("Time to X%%: 50%% = %d, 95%% = %d, 99%% = %d\n", res[0], res[1], res[2])
	fmt.Printf("Relative Time to X%%: 50%% = 1x, 95%% = %.2fx, 99%% = %.2fx\n", float32(res[1])/float32(res[0]), float32(res[2])/float32(res[0]))
}

See this code on Go Playground

Although we're using a fast event-queue container here, iteration itself can be slow on large timelines.

Mr Bean is repeatedly pressing a keyboard button and looks bored

So, let's see how we can speed this up by jumping through time.

Event-Time Jump

Time jumping is super easy! We already have core simulation code that steps through the increments and we have an event-queue that gives us the next earliest event.

Knowing that no other events happen while we're waiting for the next event (repairs), we can jump straight to it! 🚀⏳

func (s *Simulation) run(seed int64) (percentileTime [3]int) {
	rgen := rand.New(rand.NewSource(seed))

	var time int
	var done int
	var broken int
	for done < s.taskCount {
		// "fix" broken
		for s.eventQueue.Size() > 0 {
			ts, _ := s.eventQueue.Peek()
			if ts > time {
				break
			}
			s.eventQueue.Pop()
			broken--
		}

		// --------------- THIS CODE HAS CHANGED ---------------
		// default next time step to the next processing increment
        nextTime := time + s.processingTime
        // if we have events on the queue, get the earliest time
		if s.eventQueue.Size() > 0 {
			nextTime, _ = s.eventQueue.Peek()
		}

        // add however many we'll be able to do in the meantime
		done += (s.gateCount - broken) * (nextTime - time) / s.processingTime
		time = nextTime // jump!
        
        // ------------------- END OF CHANGE -------------------

		// roll a chance to of breakage for all healthy gates
		for idx := 0; idx < s.gateCount-broken; idx++ {
			if rgen.Float32() < s.breakChance {
				broken++
				s.eventQueue.Push(time + s.repairTime)
			}
		}

		// percentiles of done
		percentDone := 100 * done / s.taskCount
		percentiles := []int{50, 95, 99}
		for idx, p := range percentiles {
			if percentDone >= p && percentileTime[idx] == 0 {
				percentileTime[idx] = time
			}
		}
	}

	return
}

See this code on Go Playground

Now, this is much more performant.

However, you might be able to spot a logical regression. Because we're jumping through the repair events and tracking completion along the way, our percentiles might be skewed. Additionally, breakages also happen only at repair processing times, which degrades simulation quality.

This is a tradeoff and might be tolerable depending on your goals.

However, I will show a more general approach to handling simulated events that is still performant and capable of handling different types of events and probabilistic durations.

Event Priority Queue

What if we need to track multiple different event types? Or if the event durations aren't fixed, but generated randomly.

We cannot continue using the RingQueue in this case since we won't have ascending order of timestamps anymore.

Imagine that on step 1 we schedule an event for time 10 and on step 2 we need to schedule for time 5.

The best solution for this that holds the requirements of always giving us the earliest timestamp (minimal element) and still allowing to add arbitrary timestamps into the mix is a HeapQueue (also known as Priority Queue).

Take a look at this code and see if you can trace how this works:

// both repairs and processing are events now
type Event struct {
	time int   // event timestamp as before
	kind byte  // event type - repairs or processing
}

type Simulation struct {
	// note that this container holds items of Event type now
	eventQueue EventQueue[Event]
	taskCount int
	gateCount int
	breakChance float32
	repairTime int
	processingTime int
}


func (s *Simulation) run(seed int64) (percentileTime [3]int) {
	rgen := rand.New(rand.NewSource(seed))

	var time int
	var done int
    // track how make tasks are in flight (i.e. pets eating)
	var inflight int
	var broken int
	for done < s.taskCount {
		// since everything is events, we can start with
        // rolling a breakage chance and scheduling repairs
		for idx := 0; idx < s.gateCount && broken < s.gateCount; idx++ {
			if rgen.Float32() < s.breakChance {
				broken++
                
                // probabilistic repair time
                // mean = repairTime, stdev = repairTime / 4
				repairTime := s.repairTime + int(rgen.NormFloat64()*float64(s.repairTime)/4)
                // truncate a marginal chance that it goes negative which would not make sense
				if repairTime < 0 {
					repairTime = 0
				}

				repairedAt := time + repairTime
				s.eventQueue.Push(Event{
					kind: 'R', // Repair
					time: repairedAt,
				})
			}
		}

		// top up inflight processing
        // tasks (pets) fill the available healthy gates
		for inflight < s.gateCount-broken {
			inflight++
            
            // probabilistic processing time
            // mean = processingTime, stdev = processingTime / 4
			processingTime := s.processingTime + int(rgen.NormFloat64()*float64(s.processingTime)/4)
			if processingTime < 0 {
				processingTime = 0
			}

			doneAt := time + processingTime
			s.eventQueue.Push(Event{
				kind: 'D',
				time: doneAt,
			})
		}

        // jump to the next earliest event
		nextEvent, _ := s.eventQueue.Pop()
		if nextEvent.kind == 'R' { // a gate got repaired
			broken--
		} else { // a pet finished eating :)
			done++
			inflight--
		}
        // time travel
		time = nextEvent.time

		// percentiles of done
		percentDone := 100 * done / s.taskCount
		percentiles := []int{50, 95, 99}
		for idx, p := range percentiles {
			if percentDone >= p && percentileTime[idx] == 0 {
				percentileTime[idx] = time
			}
		}
	}

	return
}

See this code on Go Playground

This one was a little lengthy, but we got there 😉

a man shoing thumbs up and saying well done.

You've been awesome :)

Thank you for reading and let me know if this helped you solve your engineering problems! I'm super interested to know what use cases you have!

As always, please subscribe - it makes my day brighter ☀️