Voice management in game audio

In my experience one of the major themes in audio programming for games has been voice management. Voice management as I understand it is the act of managing the set of playing sounds (voices) and deciding which of those voices should be heard and which should not be heard. A voice represents a single sound file being played back.

A couple of terms I would like to clarify before we begin:

  • a playing voice or real voice is a sound that is currently playing back and contributing to the audio output of the game
  • a virtual voice is a sound that should be playing but is not actively contributing to the audio mix. it may continue to advance it’s playback position until the sound either stops, or de-virtualizes turning it into a playing voice.
  • a sound behaviour can be a set of data – crafted by a sound designer – defining how sounds should be played back. This might include dynamic systems like parameter driven audio properties, multi-layered blending sounds and more

Let’s dive into WHY we want to do voice management:

Esthetic of the mix

One of the biggest benefits gained from employing voice management can be that the mix can be cleaned up. In the full post-production workflow the sound effect editors will cut sound effects for everything and it’s the mixers responsibility to adjust volumes, change equalization and mute sounds so that the most important sounds can shine. For storytelling removing the sounds that do not support the narrative can be extremely beneficial and oftentimes “less is more”.

An additional perspective is Walter Murch’s rule of two and a half.


Playing a sound and adding it’s sample data to the audio mix  costs CPU time so by not playing all the sounds a game may request of the audio engine, the audio system can optimize its performance.

Once we decide a voice shall not be heard we can choose to virtualize it (still keeping track of it’s playback position) and later that voice might de-virtualize again, or we can simply stop the voice. Stopping the sound can be a good choice when the sound is not looping, and is fairly short but this behaviour can also be exposed to technical sound designers.


Performance is dependent on the hardware we measure on so we would also take into account other resources such as memory usage, the availability of SPUs/RSX memory and hardware decompression.

Decoding audio from a compressed format before playback and mix is one of the more processing intensive operations an audio engine will do so leveraging hardware decoders so the CPU does not have to decode in software is extremely beneficial.

On iOS this would mean using MP3, on Xbox One it’s XMA2, PS4 has AT9 and the Switch has hardware support for the Opus codec.

Setting a maximum amount of voices

To begin with we can set a certain number of voices which we want to be able to playback simultaneously. Let’s call that our “Maximum Voice Count” count or “real” voice limit. Our system shall never have more playing voices than this number. Should more voices be requested by newly playing sound behaviours, we must either stop or virtualize already playing voices. Alternatively we might ignore the new sound triggers.

One might also define a “Max Virtual Voice Count” to limit the amount of updates on virtual voice objects, but this can get a bit complicated.


The first approach to virtualizing or stopping a sound could be based on it audibility. We don’t need to render a sound if it is silent because it’s outside of hearing range (based on 3D distance attenuation) or because a dynamic behaviour is manipulating its volume. This could be implemented with a simple volume threshold defining what makes a sound inaudible.

A more sophisticated method could also look at other sounds that are playing and judge if a sound might be masked by a louder sound. One could take the masking approach a step further and implement FFT to determine masking not only based on volume but also based on frequency content.


Controlling the makeup of the mix can be achieved with a trivial to implement system of priorities, where sound designers assign a priority value to each sound and we render the priority sounds first. If some of those were inaudible or we didn’t use up all the available voices we’ll render less important voices also.

This approach can be especially fruitful for esthetic choices and storytelling. We can ensure that the sounds that most important (music, dialogue, cut scene SFX etc) can be heard, no matter what the game is doing at that moment. We reserve the highest priority for sounds which are of utmost importance. These voices should never be virtualized.

Depending on the type of game we can extend this approach by changing a sounds priority dynamically. This could be done via a check if a sound position is on screen (for 2D games) or if a sound is within the listening angle of a cone on the listener object.

In a first person shooter setting with lots of explosions, a cone based approach could prioritize the explosions actually visible to the player over those behind the players field of view comparable to frustum culling in the 3D rendering world.

Cones like these can be used in a variety of ways including adding directionality to sounds (like a speaker in the real world) or using the angle as an input to a filter but that’s a topic for another day.

Voice- & Instance-Limiting

When a sound behaviour has multiple blending sounds it can take up many voices at once. We can define a number of voices all instances of that sound are allowed to use. Once a sound behaviour exceeds that amount we use the same audibility & priority factors to determine which voices shall be freed. The voice limiting sound can also apply to all instances of a sound. If a certain sound gets played over and over, the audio engine will only play a certain amount of voices related to that sound behaviour.

A similar approach to voice limiting is instance limiting but instead of limiting how many voices can be associated to a sound behaviour, we limit the amount of simultaneous instances of that behaviour. If a sound behaviour uses more than one voice per instance (blending sounds together) we can control how many of these sounds we allow to exist at once.

When applying either these limiting approaches it is important to define how to proceed once the limit is reached.

  • Should a new playing sound instance be ignored when the limit is reached?
  • Or should the oldest instance be stopped – or virtualized?
  • Do voice- and instance-limiting clash or create undesired behaviour in this combination?
  • How does this system interact with the others?
Decoupling Voices from rendering

To enable proper voice management it is important to decouple the idea of a voice (what should be playing back) from the audio rendering object. In Unity the AudioSource component is responsible for rendering audio and submitting the sample data to the Mixer.

In Moona – our tool at A Shell In The Pit Audio – has a pool of AudioSource objects which voices can take from. Is the pool empty a voice will request to steal an AudioSource from a voice of lower priority. If the pool still can’t provide a source the voice will go into a virtualizing state which determines if the voice shall remain virtual, or stop.

Voices in this case are C# objects that have little to do with the underlying audio engine. They are simply a control structure that might be controlling an AudioSource. Upon entering a virtualized state, the voice will return the AudioSource to the pool for reuse.

When aiming for sample accurate playback upon de-virtualizing a voice or when dealing with timing critical material (eg music) one might want to pay attention to tracking the virtual voices’ playback position in relation to the pitch the voice is playing. Additionally many decoding algorithms can not seek to specific sample positions easily but will seek to the nearest decode block in the file.

Closing Notes

Voice management can use various factors to decide which voices to play and which should be virtual. All of these systems should go hand in hand to achieve the best possible mix and keep hardware requirements down.

Once we have determined which voices we actually want to hear playing, we can save computing resources, clean up the mix and focus attention the players attention to whats important.

Many thanks to the members of the game audio coding community at audiocoders.slack.com especially Aaron McLeran and  Guy Somberg. Your input was invaluable <3

On Audio in VR

This is a paragraph from my website I wrote a long time ago. Maybe it better sits here now:

With the revolutionary development of modern Virtual Reality headsets players have the ability to transcend the screen and step into the game world with a never seen fidelity. The ability to invoke the feeling of “presence” is only achievable with the help of high quality spatialized audio. The auditory system is communicating with the human brain on a much more direct and unfiltered way.

It is a shockingly different approach to creating experiences. The fact that the listener is physically in the game space (with head and hand tracking in room scale VR) makes being creative in this space extremely rewarding. More on that topic over here.

The problems with spatializing audio and creating a sound field that adheres to the expectations we have in the real word are a fun and exciting challenge.

It’s really interesting to read this after working in VR for a bit over a year. Back when I wrote this I knew audio was important to giving presence, but I had no idea how complex the systems would have to become to really sell the experience.

To see what’s currently hot in VR audio land, check out this talk about the latest audio tech Oculus has been developing (October 2017):

Pretty neat stuff!

What I find so fascinating is that the technology to render audio for this type of game is still in constant flux. Ambisonics has be resurrected from an audio obscurity to a bonafide industry tool – yet the standardization is a mess. Every company uses different encoding- & decoding formats, there are various channel layouts and a whole range of proprietary formats has entered the market as well.

In the Unity VR projects I am currently working on we are mostly using Steam Audio (rebranded Phonon) – one of the best sounding, freely available spatializers around. Unity is adding antive Ambisonics support. Unreal is adding native VR audio features.

I hope this was interesting or – whatever 😛

Practicing Mindfulness

I spend all day in front of computer screens typing. This is not a good thing for my body. What I should do is take regular breaks, take a walk every day, stretch, do yoga, all that stuff. The issue is that I sometimes just enter that zone and dive deep into my work – oblivious of my surroundings and my body.

To combat this “work frenzy” I have started practicing mindfulness and I want to share the first step on this path with you. It’s been working beautifully for me 🙂

I have a Tibetan bell called Tingsha (A tweet with video) which I used to use to meditate. It’s rather heavy so I don’t want to carry it with me everywhere. What I ended up doing was record a single ringing sound (download links below), put it on my phone and let it remind me – once every hour – that an hour has passed. It reminds me to check my mindset, reset, take a deep breath. I take more breaks now and even though they mostly just last seconds or a couple minutes I have realized there has been a major change in my state of mind.

If I am stressed when the bell goes off I will calm myself. If I am sore I will stretch. It has improved my daily life and my all around happiness so here it is:

I recommend you do the same.

I use the app Hourly Reminder for Android (Play Store, FDroid Store) which allows me to set an interval as well as a range of hours through the day that the sound should happen. You can choose any of your phones ring tones or any sound file (I use this one DOWNLOAD: Dropbox, GDrive). I’m sure you will be able to find something equivalent for your phones operating system.

I hope you find this useful and can use this practice to improve your every day life:)