Voice management in game audio

In my experience one of the major themes in audio programming for games has been voice management. Voice management as I understand it is the act of managing the set of playing sounds (voices) and deciding which of those voices should be heard and which should not be heard. A voice represents a single sound file being played back.

A couple of terms I would like to clarify before we begin:

  • a playing voice or real voice is a sound that is currently playing back and contributing to the audio output of the game
  • a virtual voice is a sound that should be playing but is not actively contributing to the audio mix. it may continue to advance it’s playback position until the sound either stops, or de-virtualizes turning it into a playing voice.
  • a sound behaviour can be a set of data – crafted by a sound designer – defining how sounds should be played back. This might include dynamic systems like parameter driven audio properties, multi-layered blending sounds and more

Let’s dive into WHY we want to do voice management:

Esthetic of the mix

One of the biggest benefits gained from employing voice management can be that the mix can be cleaned up. In the full post-production workflow the sound effect editors will cut sound effects for everything and it’s the mixers responsibility to adjust volumes, change equalization and mute sounds so that the most important sounds can shine. For storytelling removing the sounds that do not support the narrative can be extremely beneficial and oftentimes “less is more”.

An additional perspective is Walter Murch’s rule of two and a half.


Playing a sound and adding it’s sample data to the audio mix  costs CPU time so by not playing all the sounds a game may request of the audio engine, the audio system can optimize its performance.

Once we decide a voice shall not be heard we can choose to virtualize it (still keeping track of it’s playback position) and later that voice might de-virtualize again, or we can simply stop the voice. Stopping the sound can be a good choice when the sound is not looping, and is fairly short but this behaviour can also be exposed to technical sound designers.


Performance is dependent on the hardware we measure on so we would also take into account other resources such as memory usage, the availability of SPUs/RSX memory and hardware decompression.

Decoding audio from a compressed format before playback and mix is one of the more processing intensive operations an audio engine will do so leveraging hardware decoders so the CPU does not have to decode in software is extremely beneficial.

On iOS this would mean using MP3, on Xbox One it’s XMA2, PS4 has AT9 and the Switch has hardware support for the Opus codec.

Setting a maximum amount of voices

To begin with we can set a certain number of voices which we want to be able to playback simultaneously. Let’s call that our “Maximum Voice Count” count or “real” voice limit. Our system shall never have more playing voices than this number. Should more voices be requested by newly playing sound behaviours, we must either stop or virtualize already playing voices. Alternatively we might ignore the new sound triggers.

One might also define a “Max Virtual Voice Count” to limit the amount of updates on virtual voice objects, but this can get a bit complicated.


The first approach to virtualizing or stopping a sound could be based on it audibility. We don’t need to render a sound if it is silent because it’s outside of hearing range (based on 3D distance attenuation) or because a dynamic behaviour is manipulating its volume. This could be implemented with a simple volume threshold defining what makes a sound inaudible.

A more sophisticated method could also look at other sounds that are playing and judge if a sound might be masked by a louder sound. One could take the masking approach a step further and implement FFT to determine masking not only based on volume but also based on frequency content.


Controlling the makeup of the mix can be achieved with a trivial to implement system of priorities, where sound designers assign a priority value to each sound and we render the priority sounds first. If some of those were inaudible or we didn’t use up all the available voices we’ll render less important voices also.

This approach can be especially fruitful for esthetic choices and storytelling. We can ensure that the sounds that most important (music, dialogue, cut scene SFX etc) can be heard, no matter what the game is doing at that moment. We reserve the highest priority for sounds which are of utmost importance. These voices should never be virtualized.

Depending on the type of game we can extend this approach by changing a sounds priority dynamically. This could be done via a check if a sound position is on screen (for 2D games) or if a sound is within the listening angle of a cone on the listener object.

In a first person shooter setting with lots of explosions, a cone based approach could prioritize the explosions actually visible to the player over those behind the players field of view comparable to frustum culling in the 3D rendering world.

Cones like these can be used in a variety of ways including adding directionality to sounds (like a speaker in the real world) or using the angle as an input to a filter but that’s a topic for another day.

Voice- & Instance-Limiting

When a sound behaviour has multiple blending sounds it can take up many voices at once. We can define a number of voices all instances of that sound are allowed to use. Once a sound behaviour exceeds that amount we use the same audibility & priority factors to determine which voices shall be freed. The voice limiting sound can also apply to all instances of a sound. If a certain sound gets played over and over, the audio engine will only play a certain amount of voices related to that sound behaviour.

A similar approach to voice limiting is instance limiting but instead of limiting how many voices can be associated to a sound behaviour, we limit the amount of simultaneous instances of that behaviour. If a sound behaviour uses more than one voice per instance (blending sounds together) we can control how many of these sounds we allow to exist at once.

When applying either these limiting approaches it is important to define how to proceed once the limit is reached.

  • Should a new playing sound instance be ignored when the limit is reached?
  • Or should the oldest instance be stopped – or virtualized?
  • Do voice- and instance-limiting clash or create undesired behaviour in this combination?
  • How does this system interact with the others?
Decoupling Voices from rendering

To enable proper voice management it is important to decouple the idea of a voice (what should be playing back) from the audio rendering object. In Unity the AudioSource component is responsible for rendering audio and submitting the sample data to the Mixer.

In Moona – our tool at A Shell In The Pit Audio – has a pool of AudioSource objects which voices can take from. Is the pool empty a voice will request to steal an AudioSource from a voice of lower priority. If the pool still can’t provide a source the voice will go into a virtualizing state which determines if the voice shall remain virtual, or stop.

Voices in this case are C# objects that have little to do with the underlying audio engine. They are simply a control structure that might be controlling an AudioSource. Upon entering a virtualized state, the voice will return the AudioSource to the pool for reuse.

When aiming for sample accurate playback upon de-virtualizing a voice or when dealing with timing critical material (eg music) one might want to pay attention to tracking the virtual voices’ playback position in relation to the pitch the voice is playing. Additionally many decoding algorithms can not seek to specific sample positions easily but will seek to the nearest decode block in the file.

Closing Notes

Voice management can use various factors to decide which voices to play and which should be virtual. All of these systems should go hand in hand to achieve the best possible mix and keep hardware requirements down.

Once we have determined which voices we actually want to hear playing, we can save computing resources, clean up the mix and focus attention the players attention to whats important.

Many thanks to the members of the game audio coding community at audiocoders.slack.com especially Aaron McLeran and  Guy Somberg. Your input was invaluable <3