How fast is go? simulating millions of particles on a smart tv

The challenge, simulate millions of particles in golang, multi-player enabled, cpu only, smart tv compatible.

Let's go, pun very much intended.

So, during my day job I had to write a ws server which merged multiple upstream ws servers into a single ws server. (Don't ask) I was lost and not even the power of claude, gemini, and cursor could save me. The vibes were simply not enough to get the project done. I had to learn the real real stuff.

To learn the real stuff I decided to write a particle simulation and see how many particles I could get golang to push. I had already done this in the lands of javascript, rust, and swift. But given Go doesn't support simd, I knew it wouldn't be a fair fight against the other languages and more importantly, it would be boring let alone helping my day job.

I figured to give go the best shake I needed to up the difficulty by adding multi-player to the simulation. After all, go is best known as a snappy productive backend language. Is it snappy and productive enough for simulating a million particles and syncing it across hundreds of clients?

Only one way to find out.

No Cheating

There is a rule and an important one. No client simulation allowed only server. The client should be a simple web page. It should work anywhere browser runs.

Determinism. This is a key word in computer science and it is here too. If you start with the same initial state and apply the same input, you will always get the same result. Predictable and reproducible.

Many multi-player games use determinism to effectively decouple the relationship of bigger game states from the data the server must sync across clients.

Real-time strategy games are a poster child for this. Rather than send the positions of thousands of units, projectiles, unit health, etc to clients, you would send only player input data which the client can use to derive the game state at a given point in time. Add a little bit of client side prediction, rollback, and the like and you get a smooth game experience, usually.

But this requires the client to be able to simulate the game and not all clients are fast. How can this run everywhere a browser does if I want to simulate millions of particles?

I know what the ol'gafferongames would say right now. Even if I didn't use determinism I'd still need to simulate SOMETHING on the client. If I want to send millions of particle positions to the client, surely I'd need to send at least the positions right? I could derive the velocities and quantize the crap out of everything too. I am sure there are other tricks I don't remember from his excellent blog series on game state networking code.

I still call that cheating too as the client has to simulate something and it may not scale to millions of particles.

I have another idea. I can decouple the simulation size much like determinism does by taking a hint from graphics programming.

G-Buffer

Way back when in the days of Doom 3 when Mr. Carmack was in his prime, games would calculate lighting based on building shadow geometry from polygons. This was done in Doom 3 and looked amazing. However, as you increased polygon count, the cost of shadows would go up.

The industry then figured out how to decouple polygon count from lighting by using a graphics buffer and deferred shading. The details are heavy but the important part is that the cost of lighting is no longer proportional to number of polygons in the scene. Instead, it is based on a “fixed” g-buffer size. The buffer is proportional to the render resolution.

This is why 4k is so expensive to render and the games industry has stoked on AI upscaling and frame interpolation. Fewer pixels means faster rendering. Geometry is making a come back with slick virtualization too but I digress. The important part is that deferred shading decouples polygons count from lighting.

Well, what if, I went full SSR and did all the rendering on the server sending the simulation frames to the clients? All a client needs to do is just, play a video onto an html canvas element. I would still need to send the input to the server but the size of the simulation wouldn't matter. The cost would be fixed to the resolution of the client.

I could have a billion particles and the data the client needs would remain the same. Now, if there were only a few thousand particles it wouldn't be worth the trade off.

How do I know it is worth it though?

The Math

I want to simulate 1 million particles at HD resolution so 1920x1080 pixels. There are 4 bytes per pixel rgba but I only need rgb. Actually, I only need one byte per pixel since that is how the previous particle simulations in the other languages worked. That means I am sending 2,073,600 bytes per frame or a little over 2mb. Oof. That is 120mb/s at 60 fps.

If I were to compress the frames they could drop down to a fraction of that. H264 or H265 (if I pay) could cut that down to 260kb per frame. I could also send only 24 frames/s making it pretty close to streaming regular video content.

Note were are talking full bytes here not bits.

If I were to send the particle data instead, each particle is made up of an x, y, dx, and dy. They are floating point numbers at 4 bytes each so a particle takes up 16 bytes. I could, derive the dx and dy so I will say 8 bytes per particle. That makes it 8,000,000 bytes or 8mb. 4x as big as the raw frame buffer. I am sure with a bit of quantization that could be brought down more and I wouldn't need to send the data 60/s but what if I want 10 million particles? The frame buffer starts to look more appealing.

The other thing that is nice here is that by sending the frames, the complexity drops. I don't need to worry about prediction or interpolation on the client. This wouldn't be great for a twitch shooter but they also don't sync millions of data points either.

No lossy compression. A particle simulation like this gets destroyed by lossy compression due to how dense the information is. There is little repitition in the data so HEVC or other lossy codecs will wreck it. Lossy compression is out of the question. It HAS to be lossless.

For example, look at this image, see all that noise? That doesn't compress well especially in motion.

noisey image with many colors in a spiral

Additionally, compression isn't free. If I want to scale to hundreds of clients, I cannot spend all my time compressing data. This is an important consideration.

I am going to use TCP via websockets. For low latency realtime apps UDP would be better as it allows unordered messages preventing head of line blocking. However, it is significantly more complicated and not well supported on the web. TCP's guaranteed ordering of messages will cause some lag spikes but alas it is the best the web has to offer. QUIC is a thing however it only helps the blocking when there are multiple connections which this won't have.

TCP websockets it is. Onwards!

A Concept

My end game is to have a configurable play area where each client has its own view into the world. The best starting point would be to have a single view that is sent to all the clients.

The simulation itself is pretty simple. A particle struct distributing updates to many go routines. I knew that I couldn't write to a buffer while sending it to a client so I setup double buffering of the frames. A ticker is the way to get a game loop going though I am not a fan of it.

Here is the gist.

type Particle struct {
  x  float32
  y  float32
  dx float32
  dy float32
}

type Input struct {
  X           float32
  Y           float32
  IsTouchDown bool
}

type SimState struct {
  dt     float32
  width  uint32
  height uint32
}

// setup variables

func startSim() {
  // setup code

  for range ticker.C {
    // metadata code

    wg.Add(numThreads)
    // no locking, that is fine.
    numClients := len(clients)
    const friction = 0.99

    for i := 0; i < numThreads; i++ {
      go func(threadID int) {
        defer wg.Done()

        startIndex := threadID * particlesPerThread
        endIndex := startIndex + particlesPerThread

        if threadID == numThreads-1 {
          endIndex = particleCount
        }

        for p := startIndex; p < endIndex; p++ {
          for i := 0; i < numClients; i++ {
            input := inputs[i]
            if input.IsTouchDown {
              // apply gravity
            }
          }

          particles[p].x += particles[p].dx
          particles[p].y += particles[p].dy
          particles[p].dx *= friction
          particles[p].dy *= friction

          // bounce if outside bounds
        }
      }(i)
    }

    // wait for them to complete
    wg.Wait()

    framebuffer := getWriteBuffer()
    copy(framebuffer, bytes.Repeat([]byte{0}, len(framebuffer)))
    for _, p := range particles {
      // build frame buffer
    }
    swapBuffers()

    go func(data []byte) {
      // Non-blocking send: if the channel is full, the oldest frame is dropped
      select {
      case fameChannel <- framebuffer:
      default:
        // Channel is full, drop the frame
      }
    }(framebuffer)
  }
}

A few notes. The simulation is basic. Particles can be pulled around by players slowing down by some friction value.

I want to avoid locking as much as possible. This means I am going to try and create a fixed amount of memory based the max number of clients I can handle then use the clients index as a fast way to access their respective data, in this case input data.

I also want to avoid writing a frame buffer per client as I know that writing to the frame buffer is not cache friendly and should only be done once. As a matter of fact, building the frame buffer is the most expensive part. I learned this from previous languages. Cache locality and all that.

One thing I learned about go is that it can be verbose. Not java levels but it is pretty lengthy.

I will skip the boiler plate code.

func main() {
  go startSim()

  http.HandleFunc("/ws", wsHandler)
  // for webpage
  http.Handle("/", http.FileServer(http.Dir("./public")))

  log.Println("Server started on :8080")
  if err := http.ListenAndServe(":8080", nil); err != nil {
    log.Fatal("ListenAndServe:", err)
  }
}

Initially, each websocket spins up their own ticker writing the latest frame buffer to the client. This is ugly and has sync issues but as a first version it is fine.

func wsHandler(w http.ResponseWriter, r *http.Request) {
  // boiler plate

  clientsMu.Lock()
  // safely add client
  clientsMu.Unlock()

  defer func() {
    // safely clean up
  }()

  go func() {
    // wait for messages
    var input Input
    err := binary.Read(bytes.NewReader(message), binary.LittleEndian, &input)

    // maybe later we lock or figure out way to not need a lock in a hot path
    // this is fine as only chance of bad data is if someone connect or drops whilst updating input
    idx := findClientIndex(conn)
    if idx != -1 && idx < maxClients {
      inputs[idx] = input
    }
  }()

  ticker := time.NewTicker(time.Second / 30)
  for range ticker.C {
    data := getReadBuffer()
    if err := conn.WriteMessage(websocket.BinaryMessage, data); err != nil {
      log.Println("Write failed:", err)
      break
    }
  }
}

The client index is pretty silly but I really don't want to lock the hot path here.

I'll skip over the client javascript/html. It isn't that interesting, setup a web socket, write data to a canvas element using setPixels the usually.

So, how well does it work? Pretty well. But it is not fast, and is writing a shit load of data. It also sometimes flickers when the tickers touch the frame buffer at the same time the simulation thread does.

Still, a good starting point.

Speed Speed Speed

I used pprof to run some samples to see where things are slow. I noticed that time was spent creating go routines. While light they are not free. The idea is to have go routines listening in on channels. I refactored the simulation to wait for SimJob requests to come in.

type SimJob struct {
  startIndex int
  endIndex   int
  simState   SimState
  inputs     [maxClients]Input
  numClients int
}

// other code

jobs := make(chan SimJob, numThreads)
var wg sync.WaitGroup
for i := 0; i < numThreads; i++ {
  go worker(jobs, &wg)
}

// in sim loop
for i := 0; i < numThreads; i++ {
  startIndex := i * particlesPerThread
  endIndex := startIndex + particlesPerThread
  if i == numThreads-1 {
    endIndex = particleCount
  }
  jobs <- SimJob{startIndex, endIndex, simState, &inputs, numClients, &framebuffer}
}

I also reused frame buffers with a pool and then fixed the syncing by pushing to a channel which another thread would listen on. This thread would then write the new frame out to all the clients before adding it back to the pool. At first I used a simple for loop but that made it write as fast as the slowest client so instead I had it prep the frame and then push it to ANOTHER set of channels per client. Threads on each of those channels would then actually write data to the client.

func broadcastFrames(ch <-chan *Frame, pool *sync.Pool) {
  for {
    frame := <-ch
    // setup data
    clientsMu.Lock()
    for i, conn := range clients {
      // send to other channel
      select {
      case clientSendChannelMap[conn] <- dataToSend:
      default:
        log.Printf("Client %d's channel is full, dropping frame. Requesting full frame.", i)
      }
    }
    clientsMu.Unlock()
    pool.Put(frame)
  }
}

// other code
func writePump(conn *websocket.Conn) {
  var channel = clientSendChannelMap[conn]
  for {
    message, ok := <-channel
    if !ok {
      return
    }

    if err := conn.WriteMessage(websocket.BinaryMessage, message); err != nil {
      log.Printf("Write to client failed: %v", err)
      return
    }
  }
}

I pump when a client connects and stop when they disconnect. This works pretty well and fixes the sync issue.

I experimented with compression a bit using the standard flate lib. It worked but was pretty slow. I then experimented with every kind of encoding I could think of to reduce the data transferred.

I tried a bit mask where I would sent 0/1 flags per pixel followed by the in order pixel data that changed. This was fast per client but ended up being larger than the raw frame data when particles started moving around.

I tried run length encoding, variable length encoding, and delta encoding all combined in different ways.

There are some gotchas I ran into For RLE to work you need a sentinel byte to know if there is a run of data. This drops the available space per pixel. It could be possible to encode multiple pixels at a time but that didn't work that well. VLE is just run length using varints to drop the size down for small numbers this worked and did help reduce size by almost 8% on average.

Delta encoding was the most fun. The way I tried it was to delta encode between frames. That way passing the delta into RLE would massively compress the data. The issue is that particle counts can go up or down per pixel. This required zigzag encoding but there was an issue.

If each pixel is one byte, and the byte can go from 0 to 255 in a single frame, then the size of the deltas would need to go from -255 to +255 which doesn't fit in a byte. The zigzag encoding worked and I figured I could just drop down to 7-bits per pixel.

I read about zigzag encoding from a microsoft paper on optimized compression for depth data from multiple connect sensors. I don't remember the name. To save you some time, just ask AI about it.

func CreateDeltaBuffer(oldBuffer, newBuffer []byte) []byte {
  if len(oldBuffer) != len(newBuffer) {
    return newBuffer
  }
  deltaBuffer := make([]byte, len(newBuffer))

  for i := 0; i < len(newBuffer); i++ {
    diff := int16(newBuffer[i]) - int16(oldBuffer[i])
    deltaBuffer[i] = zigZagEncode(int8(diff))
  }

  return deltaBuffer
}

And it works. Unless the whole frame was moving around, the size was ~30% smaller than the raw frame data. If nothing moved, it would send only a few bytes! But again, there were problems.

The code is complicated. I had to track if I was sending delta frames or full frames, and if a client dropped a frame, I'd need to send a full frame again. This would cause a brief flash on the screen. I could add frame counting logic so the client could drop out of sync deltas but the code was already a bit much.

func broadcastFrames(ch <-chan *Frame, pool *sync.Pool) {
  for {
    frame := <-ch
    fullFrameBuffer.Reset()
    fullFrameBuffer.WriteByte(OpCodeFullFrame)
    fullFrameBuffer.Write(frame.FullBuffer)
    fullFrameBytes := fullFrameBuffer.Bytes()

    deltaFrameBuffer.Reset()
    deltaFrameBuffer.WriteByte(OpCodeDeltaFrame)
    deltaFrameBuffer.Write(frame.Delta)
    deltaFrameBytes := deltaFrameBuffer.Bytes()

    clientsMu.Lock()
    for i, conn := range clients {

      var dataToSend []byte
      if clientFullFrameRequestFlags[i] {
        dataToSend = fullFrameBytes
        clientFullFrameRequestFlags[i] = false
      } else {
        dataToSend = deltaFrameBytes
      }

      select {
      case clientSendChannelMap[conn] <- dataToSend:
      default:
        log.Printf("Client %d's channel is full, dropping frame. Requesting full frame.", i)
        clientFullFrameRequestFlags[i] = true
      }
    }
    clientsMu.Unlock()
    pool.Put(frame)
  }
}

It does work though.

However, I want each client to have their own “view” into a bigger frame where the can pan around the world fighting for particles. For this to work, with the current setup, I'd need to track delta frames per client and figure out how to handle dropped delta frames better.

Back to the drawing board.

The Drawing Board

Stupid simple, is simple stupid.

I am going to have 1 bit per pixel. That means only a single “luminance” value. In the previous video I was only effectively using a few bits on the client having only about 8 “luminance” values. Bet you thought it was more right?

This will simplify everything. The only hard part is packing and unpacking the data. I also figured I could implement client camera state too. I waffled a bit around how to store this state and update it.

Does the client sent their camera position directly? What if the window resizes? What if the client sends a negative camera size? What if they send an infinite one? How do I prevent blocking?

I ended up copying what I did for the input data.

type ClientCam struct {
  X      float32
  Y      float32
  Width  int32
  Height int32
}
var (
  inputs     [maxClients]Input
  cameras    [maxClients]ClientCam
)

I would have the client send the camera data with the existing input message.

  for {
    mt, message, err := conn.ReadMessage()
    if err != nil {
      log.Println("Read failed:", err)
      return
    }
    if mt == websocket.BinaryMessage {
      reader := bytes.NewReader(message)
      var touchInput Input
      // interpret x and y as offsets.
      var camInput ClientCam
      // read input
      errInput := binary.Read(reader, binary.LittleEndian, &touchInput)
      errCam := binary.Read(reader, binary.LittleEndian, &camInput)
      // set client input/camera state
      // handle errors in the input, bounds, negatives etc.
    }
  }

For sending frame data, at first, I naively packed the buffer and then unpacked it again based on the region a client was rendering.

// in simulation thread
for _, p := range particles {
  x := int(p.x)
  y := int(p.y)
  if x >= 0 && x < int(simState.width) && y >= 0 && y < int(simState.height) {
    idx := (y*int(simState.width) + x)
    if idx < int(simState.width*simState.height) {
      byteIndex := idx / 8
      bitOffset := idx % 8
      if byteIndex < len(framebuffer) {
        framebuffer[byteIndex] |= (1 << bitOffset)
      }
    }
  }
}

// in frame send channel before sending to client 
// get client camera info
for row := int32(0); row < height; row++ {
  for col := int32(0); col < width; col++ {
    mainFrameIndex := ((y+row)*int32(simState.width) + (x + col))
    dataToSendIndex := (row*width + col)

    if mainFrameIndex/8 < int32(len(frameBuffer)) && dataToSendIndex/8 < int32(len(dataToSend)) {
      isSet := (frameBuffer[mainFrameIndex/8] >> (mainFrameIndex % 8)) & 1
      if isSet == 1 {
        dataToSend[dataToSendIndex/8] |= (1 << (dataToSendIndex % 8))
      }
    }
  }
}

This makes HD only a few hundred kb/frame.

pprof has a niffy way of downloading a simple of the running up you can then load and look at with their cli tool.

I checked a sample using pprof and noticed the sim was slow when it was building the frame buffer so I moved the frame building to the simulation workers. This is possible now because each thread will only mark a bit as on if there is a particle there. Which means I don't need to lock anything and can let the last writer win.

Before

Showing top 10 nodes out of 35
      flat  flat%   sum%        cum   cum%
    16.75s 52.54% 52.54%     17.05s 53.48%  main.worker
     6.42s 20.14% 72.68%      6.49s 20.36%  main.broadcastFrames

And after

Showing top 10 nodes out of 36
      flat  flat%   sum%        cum   cum%
    13.01s 46.46% 46.46%     13.28s 47.43%  main.worker
     6.98s 24.93% 71.39%      7.08s 25.29%  main.broadcastFrames

The simulation is a bit faster but notice that the broadcast frames is slow. You know that packing I am doing? Ya, that ended up being very very slow. 25% of the time is spent broadcasting frames to a few clients due to all the bit opps packing and unpacking.

A simple fix is to use full bytes even if the count is only ever 0-1 and then using a lookup map to skip the bit opps like so.

var uint64ToByteLUT = make(map[uint64]byte)

func init() {
  var byteSlice = make([]byte, 8)
  for i := 0; i < 256; i++ {
    for bit := 0; bit < 8; bit++ {
      if (i>>bit)&1 == 1 {
        byteSlice[bit] = 1
      } else {
        byteSlice[bit] = 0
      }
    }
    // trust me bro
    uint64ToByteLUT[BytesToUint64Unsafe(byteSlice)] = byte(i)
  }
}

for row := int32(0); row < height; row++ {
  yOffset := (y + row) * int32(simState.width)
  for col := int32(0); col < width; col += 8 {
    fullBufferIndex := yOffset + (x + col)
    chunk := frameBuffer[fullBufferIndex : fullBufferIndex+8]
    key := BytesToUint64Unsafe(chunk)

    packedByte, _ := uint64ToByteLUT[key]

    if outputByteIndex < int32(len(dataToSend)) {
      dataToSend[outputByteIndex] = packedByte
      outputByteIndex++
    }
  }
}

Now slicing the frame region for a client is almost free. The idea is pretty simple, interpret 8 bytes at a time as a uint64 which maps to the corresponding single byte with the right bits set. I did something similar on the client too to make unpacking faster. The extra memory here is pretty small and worth it to me.

Not bad. I am sure more can be done but the main sim is taking up the majority of the time.

Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
    15.62s 58.85% 58.85%     15.83s 59.65%  main.worker
     4.66s 17.56% 76.41%      4.66s 17.56%  runtime.pthread_cond_wait
     2.96s 11.15% 87.57%      2.96s 11.15%  runtime.pthread_cond_signal
     0.90s  3.39% 90.96%      0.90s  3.39%  runtime.kevent
     0.47s  1.77% 92.73%      0.80s  3.01%  runtime.mapaccess2_fast64
     0.36s  1.36% 94.08%      1.17s  4.41%  main.broadcastFrames
     0.20s  0.75% 94.84%      0.20s  0.75%  runtime.asyncPreempt
     0.17s  0.64% 95.48%      0.17s  0.64%  runtime.madvise
     0.17s  0.64% 96.12%      0.17s  0.64%  runtime.pthread_kill
     0.17s  0.64% 96.76%      0.17s  0.64%  runtime.usleep

On the topic of memory, with millions of particles the server barely breaks over 100mb. Not rust good but node would be well over a few gigs and I cannot imagine python would be much lighter.

I waffled about again for well over a day trying to get some go assembly to work for those sweet simd gains but it wasn't to be. There are some libraries I tried too but none of them did better than what I already had. There are some interesting things with go's compiler around how to optimize out dereferences and bounds checking which did give me a 30% boost.

For example, before I was updating the particles like so

for i := job.startIndex; i < job.endIndex; i++ {
  particles[i].x += particles[i].dx
  // etc

But by doing this

for i := job.startIndex; i < job.endIndex; i++ {
  p := &particles[i]
  p.x += p.dx

Big boost. It makes sense. Bounds checking was happening way too often. I did a few more micro optimizations which helped but nothing significant. I did try SoA too and that actually did worse than AoS. Go figure.

I tried to use a fast inverse square root function but for some reason the go version of it didn't work at all which is weird because js lands had no issue. Oh well.

The previous profile samples were taken simulating a dozen or so clients with a few million particles. If I bump that up to 20m particles, I get this.

Showing top 10 nodes out of 21
      flat  flat%   sum%        cum   cum%
   113.16s 93.06% 93.06%    115.22s 94.75%  main.worker
     2.27s  1.87% 94.93%      2.27s  1.87%  runtime.pthread_cond_wait
     2.04s  1.68% 96.60%      2.04s  1.68%  runtime.asyncPreempt
     1.23s  1.01% 97.62%      1.23s  1.01%  runtime.pthread_cond_signal
     0.63s  0.52% 98.13%      1.56s  1.28%  runtime.mapaccess2_fast64
     0.30s  0.25% 98.38%      1.92s  1.58%  main.broadcastFrames

The simulation is the clear bottle neck and without simd, I am not sure how much better I can get it.

It works though. Time to deploy it to the cloud.

The Cloud

I was going to use a digital ocean docklet but it would cost more than an Equinox gym membership for anything faster than a toaster. I ended up spending ~$8/month at netcup for a 16 gig 10 core arm vm Manassas, Virginia with a ping of over 300ms. Fantastic. Not sure why the DNS on their panel says germany though.

I opened a few ports under root like the psychopath I and gave it a spin.

Pretty slick. More clients barely impacts cpu usage. But will it scale to 1k clients? I am not sure.

The netcup vps has a 2.5 Gbit pipe. At a 1920x1080 resolution with 1 bit per pixel at a rate of 30 frames/s down the pipe, that should theoretically support well over 300 clients. Now, if these were mobile devices with a fraction of that resolution say 390x844 or so, (we are not talking native), it may actually scale.

It is worth noting it does suffer from head of line blocking at times especially as the resolution gets cranked up. Clients with a resolution larger than the max particle grid are ignored so your 5k display won't work.

How fast is go?

I did a bit of polishing on the client, edge scrolling, panning, mobile support, etc but this post isn't about javascript it is about golang.

Go is able to simulate 2.5 million particles at 60fps while sending data at 30 fps to in theory over 300 clients maybe even a thousand. And because this happens on the server it even runs on a smart tv. Don't believe me? Just visit, howfastisgo.dev and see for yourself or watch the video below.

While go certainly isn't a compute workhorse, that is pretty impressive. Anywhere a browser runs this works! Go is just THAT fast.

This is just so cursed. I love it.

Until next time.

last time

why I quit my dream job

next time

how fast is java? Teaching an old dog new tricks