Planet Gamedev

Game AI for Developers

BROADCAST: Applying MCTS to a Real-Time Combat Scenario (August 22nd)

by Alex J. Champandard at August 21, 2014 04:00 AM

BROADCAST: Applying MCTS to a Real-Time Combat Scenario (August 22nd)

This upcoming broadcast on Friday, August 22nd at 18:00 UTC will take place online within your browser using streaming audio/video:

“This isn't a typical broadcast. This time, we'll be streaming our attempts to apply MCTS to a real-time Capture The Flag-style scenario! It's a notoriously difficult problem that hasn't received much attention in either research or the games community, but using the tools we've built as part of The AI Sandbox, we'll learn something new at least. Join us to see what happens...”

To subscribe for email reminders and check the exact time in your current timezone, visit this broadcast's page on

iPhone Development Tutorials and Programming Tips

Guide: Everything You Need To Know About Swift Arrays

by Johann at August 20, 2014 06:26 PM

Post Category    Featured iPhone Development Resources,iOS Development Tutorials,Swift

I’ve mentioned a number of Swift resources and yesterday a tutorial on Swift basics through the creation of a Tetris style game.

Here’s a nice guide to Swift arrays from Coding Explorer that explains some fundamental differences between NSArray and Swift arrays, a nice chart listing the different methods, and a nice list of the different methods available for working with Swift arrays.

There’s also a nice step-by-step guide working through some of the array basics:

- Creating arrays
- Accessing array values
- Adding to arrays
- Removing from arrays
- Creating new arrays based on previous ones

You can find the tutorial on the Coding Explorer blog.

A nicely laid out guide on everything you’d want to know about Swift arrays.

Be the first to comment...

Related Posts:

FacebookTwitterDiggStumbleUponGoogle Plus

Original article: Guide: Everything You Need To Know About Swift Arrays

©2014 iOS App Dev Libraries, Controls, Tutorials, Examples and Tools. All Rights Reserved.

Game Design Aspect of the Month

Game Art Explained

by (Sande Chen) at August 20, 2014 05:35 PM

In this article, aspiring game designer Gabby Taylor explains the role game art plays in a game's overall design.

Graphics are usually the center of the console versus PC argument (next to overall cost). We argue about how noticeable pixels should be, acceptable frame rates, and sometimes even perspective. And nearly everyone comes up with one reason or another why they’re right, but so few people seem to actually understand what role game art plays.

Our eyes play a huge role in how we explore the world around us, but it’s how the world is perceived that gives this visual information its potency. This psychological phenomenon is what game art plays on. It’s what causes you to beeline for the one lit area in an otherwise blacked out room. It’s what makes you feel confused or alarmed when things are blurred and red, and what makes the scene with a faintly violet glow seem enchanting. Horror games also use this to clash against our idea of what should be, thus making it creepy. There are a thousand different ways game art assists the design, for those who know how it works.

Game art can be broken down into two basic building blocks: color and design. Each area in a game is carefully crafted using these two principles in order to achieve the feeling the game design needs to successfully immerse the player in the gameplay and story. Think back to Batman: Arkham Asylum. If the place were brighter, or painted in pastel colors, do you think it would have felt the same? No, of course not. It sounds simple, and it is in theory, but applying them effectively requires lots of practice and expertise. Let’s clear the air of all the nonsense arguments, and briefly examine what each game artist has to know.

Colors typically have a meaning attached to them, that can vary by culture. In the United States, for example, white stands for innocence and purity. They can also have physical effects on us, like increased heart rate when we see the color red. Colors affect everything from our appetites, to how heavy something appears, to what emotion we feel when we look at it. The use of color is further broken down into values, hues, and contrast. Hues are the colors themselves, whereas the value is how light or dark that color is. For example, pink is a lighter value of red, and navy is a darker value of blue. A scene with all light values is called high key, and a scene with all dark values is, you guessed it, low key. Colors complement and contrast one another, creating different effects. Using yellow and red, for instance, creates a much different scene than using green and blue, or purple and brown, or even shades of grey.

Design is more about how the scene and the colors in it are arranged. This is not to be confused with overall game design, though; this is artistic design of a scene or area. When designing the scene, you’re focusing on lines and objects to which the colors are applied. You can think of design as the skeleton on which color is the fleshy bits. Going back to the Batman: Arkham Asylum reference, the whole game would have a different feel to it if Calender Man didn’t cover the walls of his cell with calendar pages, or if Poison Ivy didn’t have plants everywhere, or if the whole place looks like it had been meticulously cleaned.

These two elements come together to create game art, which is necessary for the game design to convey the intended message and emotions. That’s all it is. There is no art style that’s superior, or acceptable frame rate ceiling. It’s the emotions, the perception, that’s important to the game’s success at allowing the player to delve into the world set before them.

Gabby Taylor is an aspiring game designer and head of GreyBox Studio. When not making design documents, she contemplates going outside, and sometimes even takes a few steps when feeling particularly frisky.

Geeks3D Forums

10 Ways CUDA 6.5 Improves Performance and Productivity

August 20, 2014 03:52 PM

Today we’re excited to announce the release of the CUDA Toolkit version 6.5.

CUDA 6.5 adds a number of features and improvements to the CUDA platf...


Slides from 2014 SIGGRAPH Khronos Group BOF presentations now online

August 20, 2014 12:45 PM

The Khronos Group has posted some of the SIGGRAPH 2014 BOF presentation slides online. More to come!

Accelerated Combinatorial Optimization using Graphics Processing Units and C++ AMP

by dom at August 20, 2014 08:40 AM


In the course of less than a decade, Graphics Processing Units (GPUs) have evolved from narrowly scoped application specific accelerators to general-purpose parallel machines capable of accommodating an ever-growing set of algorithms. At the same time, programming GPUs appears to have become trapped around an attractor characterised by ad-hoc practices, non-portable implementations and inexact, uninformative performance reporting. The purpose of this paper is two-fold, on one hand pursuing an in-depth look at GPU hardware and its characteristics, and on the other demonstrating that portable, generic, mathematically grounded programming of these machines is possible and desirable. An agent-based meta-heuristic, the Max-Min Ant System (MMAS), provides the context. The major contributions brought about by this article are the following: (1) an optimal, portable, generic-algorithm based MMAS implementation is derived; (2) an in-depth analysis of AMD’s Graphics Core Next (GCN) GPU and the C++ AMP programming model is supplied; (3) a more robust approach to performance reporting is presented; (4) novel techniques for raising the abstraction level without sacrificing performance are employed. This represents the first implementation of an algorithm from the Ant Colony Optimisation (ACO) family using C++ AMP, whilst at the same time being one of the first uses of the latter programming environment.

(A. Voicu: “Accelerated Combinatorial Optimization using Graphics Processing Units and C++ AMP ”. International Journal of Computer Applications 100(6):21-30, August 2014. [DOI])

CUDA Course Sept 23 – 26, 2014, Frankfurt

by dom at August 20, 2014 08:38 AM

This hands-on four day course teaches how to write and optimize applications that fully leverage the multi-core processing capabilities of the GPU. More details and registration:

Timothy Lottes

Next Generation OpenGL Initiative Details from Khronos BOF

by Timothy Lottes ( at August 20, 2014 08:00 AM

OpenGL Ecosystem BOF 2014


Cross vendor project between OpenGL and OpenGL ES working groups:
- Chair = Tom Olson (ARM)
- IL Group Chair = Bill Licea-Kane (Qualcomm)
- API Spec Editors = Graham Sellers (AMD) and Jeff Bolz (NVIDIA)

Committed to adopting a portable intermediate language for shaders.
Compatibility break from existing OpenGL.
Starting from first principles.
Multi-thread friendly.
Greatly reduced CPU overhead.
Full support for tiled and direct renderers.
Explicit control: application tells driver what it wants.

iPhone Development Tutorials and Programming Tips

Swift Based Network Abstraction Layer For Simplifying Work With Complex APIs

by Johann at August 20, 2014 06:13 AM

Post Category    Featured iPhone Development Resources,Swift

Earlier this month I mentioned the Swift based networking library AlamoFire created to be AFNetworking reimagined for the conventions of the Swift language.

Here’s a library from Ash Furrow created in Swift called Moya that utilizes AlamoFire providing a network abstraction layer that makes it easier to work with more complex APIs, especially those where an action can lead to different end points.

As the readme states some of the requirements for Moya that were met include:

- Treat test stubs as first-class citizens.
- Only allow endpoints clearly defined through Moya can be access through Moya, enforced by the compiler.
- Allow iterating through all potential API requests at runtime for API sanity checks.
- Keep track of inflight requests and don’t support duplicates.

And takes things a step further:

We actually go a bit farther than other network libraries by abstracting the API endpoints away, too. Instead of dealing with endpoints (URLS) directly, we use targets, which represent actions to be taken on the API. This is beneficial if one action can hit different endpoints; for example, if you want to GET a user profile, but the endpoint differs depending on if that user is a friend or not.

You can find Moya on Github here.

An interesting library for those working with complicated web APIs.

Be the first to comment...

Related Posts:

FacebookTwitterDiggStumbleUponGoogle Plus

Original article: Swift Based Network Abstraction Layer For Simplifying Work With Complex APIs

©2014 iOS App Dev Libraries, Controls, Tutorials, Examples and Tools. All Rights Reserved.

Timothy Lottes


by Timothy Lottes ( at August 20, 2014 12:17 AM

Link to the Shadertoy example.

Growing up in the era of the CRT "CGA" Arcade Monitor was just awesome. Roughly 320x240 or lower resolution at 60 Hz with a low persistence display. Mix that with stunning pixel art. One of the core reasons I got into the graphics industry.

Built the above Shadertoy example to show what I personally like in attempting to simulate that old look and feel on modern LCD displays. The human mind is exceptionally good at filling in hidden visual information. The dark gaps between scanlines enable the mind to reconstruct a better image than what is actually there. The right most panel adds a quick attempt at a shadow mask. It is nearly impossible to do a good job simulating that because the LCD cannot get bright enough. The compromise in the shader example is to rotate the mask 90 degrees to reduce chromatic aberration. The mask could definitely be improved, but this is a great place to start...

Feel free to use/modify the shader. Hopefully I'll get lucky and have the option to turn on the vintage scanline look when I play those soon to be released games with awesome pixel art!

iPhone Development Tutorials and Programming Tips

Beginners Swift Tutorial Showing How To Create A Tetris Style Game

by Johann at August 19, 2014 09:41 PM

Post Category    Featured iPhone Development Resources,Game Programming,iOS Development Tutorials,Objective-C,Swift

Last month I mentioned a nice beginners tutorial showing how to create a Candy Crush style game in Swift.

Here’s another tutorial providing a fun way to increase your Swift knowledge in the building of a Tetris style game from Bloc.

The tutorial utilizes Swift, and SpriteKit explaining every step along the way going from the setup of your project and loading assets, through to setting up the actual game, and some special effects.  Everything is nicely explained a very clean format and allows you to jump between each step along the way.

Here’s an image from the tutorial showing the tutorial in action:

You can find the Swiftris tutorial over on the Bloc blog.

A nice tutorial for learning some Swift and Sprite Kit basics.

Be the first to comment...

Related Posts:

FacebookTwitterDiggStumbleUponGoogle Plus

Original article: Beginners Swift Tutorial Showing How To Create A Tetris Style Game

©2014 iOS App Dev Libraries, Controls, Tutorials, Examples and Tools. All Rights Reserved.

Game From Scratch

Adventures in Phaser with TypeScript–Audio

by at August 19, 2014 08:13 PM


Now that we’ve covered input and graphics, it’s time we move on to audio.  Here is a bit of a warning upfront…


HTML5 audio sucks.


Yeah, that’s about it in nutshell.  This is one of those areas where games in HTML5 really suffer and sadly it doesn’t seem to be going away any time soon.  There are two major things to be aware of right up front.


First, audio file formats.  Different browsers support different file formats, but not all the same formats sadly.  You can read a pretty good chart on the topic right here.  In a nutshell, some support ogg, some support mp3, some support wav and some support m4a.  To make things even more fun, mp3 is a license encumbered format that could result in heavy licensing fees if your game is successful.  In reality, it’s never really enforced, but just the possibility should be enough to make you wary.  Fortunately, Phaser provides a way to deal with all of this that you will see shortly.


Second, well, Safari on iOS really kinda stinks at audio for a couple glaring reasons.  First and foremost, you can only play one sound or video at a time…  Yeah, really.  If you play a second sound, the first immediately stops.  Kinda. Big. Deal.  You can however work around this using an audio sprite, which is basically a bunch of audio files smashed together into a single file.  A second major failing is you can’t play audio on page load in Safari.  This means if you want to load your game up to a title screen with some music playing… you cant.  Audio can only be started in response to a users touch.  Yeah, this sucks too.  This one unfortunately you cannot work around, short of using a wrapper like CocoonJS for “natively” deploying your HTML5 game.


Through this all, you are probably going to need a tool to either merge your audio, or convert to a variety of formats.  Fortunately there is a free and excellent audio editor named Audacity available, that makes the process fairly painless.  In order to follow this tutorial, you are going to have to get an audio file and save it in mp3 and ogg format and add it to your project.


OK, enough about the HTML5 audio warts, let’s get to some code! 


/// <reference path="phaser.d.ts"/>
class SimpleGame {
    game: Phaser.Game;
    sound: Phaser.Sound;

    constructor() { = new Phaser.Game(640, 480, Phaser.AUTO, 'content', { create: this.create, preload: this.preload });
    preload() {"GameMusic", ["song.mp3","song.ogg"]);
    create() {
        this.sound ='GameMusic');;

window.onload = () => {
    var game = new SimpleGame();


It’s important to note, this example wont actually work in Safari on iOS due to the limitation of only being able to play audio in response to a user touch.  You would have to change it so the play() call is done in a touch handler.


Here what you are doing is preloading audio using  The parameters are the key to refer to the audio file, and a list of files to load.  This is where the magic happens, you should provide all supported file formats, then Phaser will serve the one that performs best on the browser the user is using.  M4A files perform the best on iOS, and don’t have the legal encumbrance that MP3, so OGG and M4A would probably be all you needed, but I may be wrong here.  If in doubt, provide all 3.


Now let’s look at an example that allows you to control playback of the sound file:


/// <reference path="phaser.d.ts"/>
class SimpleGame {
    game: Phaser.Game;
    sound: Phaser.Sound;
    playButton: Phaser.Button;
    pauseButton: Phaser.Button;
    stopButton: Phaser.Button;
    volUpButton: Phaser.Button;
    volDownButton: Phaser.Button;
    muteButton: Phaser.Button;

    constructor() { = new Phaser.Game(640, 480, Phaser.AUTO, 'content', { create: this.create, preload: this.preload, 
: this.render }); } preload() {"GameMusic", ["song.mp3", "song.ogg"]);"button", "button.png", false); } render() {, 0, 100); } create() { this.sound ='GameMusic'); // Set up sound event handlers on the sound object this.sound.onPlay.add(() => { alert("Played"); }); this.sound.onPause.add(() => { alert("Paused"); }); // Play Button this.playButton =, 0, "button", () => { if (this.sound.currentTime > 0) this.sound.resume(); else; } , this); this.playButton.addChild(new Phaser.Text(, 17, 18, "Play", { fill: '#ff0000' })); // Pause Button this.pauseButton =, 0, "button", () => { this.sound.pause(); } , this); this.pauseButton.addChild(new Phaser.Text(, 12, 18, "Pause", { fill: '#ff0000' })); // Stop Button this.stopButton =, 0, "button", () => { if (this.sound.isPlaying) { this.sound.stop(); this.sound.currentTime = 0; } } , this); this.stopButton.addChild(new Phaser.Text(, 17, 18, "Stop", { fill: '#ff0000' })); // Volume Up Button this.volUpButton =, 0, "button", () => { this.sound.volume += 0.1; } , this); this.volUpButton.addChild(new Phaser.Text(, 17, 18, "Vol +", { fill: '#ff0000' })); // Volume Down Button this.volDownButton =, 0, "button", () => { this.sound.volume -= 0.1; } , this); this.volDownButton.addChild(new Phaser.Text(, 17, 18, "Vol -", { fill: '#ff0000' })); // Mute Button this.volDownButton =, 0, "button", () => { // Global mute! Use this.sound.mute to mute a single sound = !; } , this); this.volDownButton.addChild(new Phaser.Text(, 17, 18, "Mute", { fill: '#ff0000' })); } } window.onload = () => { var game = new SimpleGame(); };


Here is the resulting app:



It’s a bit of a monster, but most of the code is actually just wiring up the buttons.  There are a few important take away points here.  First, you can work with the local sound object, or all sounds globally using game.sound.  Second, each action on the sound file ( play, resume, stop, etc ) have an appropriate event handler you can implement.  I only did a couple in this example.  Finally, volume is a value from 0 to 1.  You can go above or below this range, but it wont actually do anything.  All said, once you get over the file format issues, playing audio is relatively straight forward.


Finally, let’s touch on audiosprites again for a second.  As mentioned earlier, an audiosprite is actually a bunch of audio files smashed together into a single file.


Consider the following sound effects.  A gun cocking ( ) and a gun shot ( ).  Load your first sound effect into audacity, like so:



Take note of the end of the file, in this case 0.785 seconds.


Now, select File->Import-> Audio



Import your additional sound effects, in this case I’m opening the gunshot file.  You should now have two tracks in Audacity, like so:



Now select the bottom track by double clicking.  It should highlight in gray, like so:


Now click at the end of the first track and paste ( CTRL + V, or EDIT->Paste ).  You should now have a single track that looks like this:



Now save this file ( might as well create OGG, MP3 and M4A versions while you are at it ).


Now lets take a look how we use it in code.


/// <reference path="phaser.d.ts"/>
class SimpleGame {
    game: Phaser.Game;
    sound: Phaser.Sound;

    constructor() { = new Phaser.Game(640, 480, Phaser.AUTO, 'content', { create: this.create, preload: this.preload, 
: this.render }); } preload() {"sfx", ["GunSounds.ogg", "GunSounds.wav"]); } render() {, 0, 100); } create() { this.sound ='sfx'); this.sound.addMarker("gunCock", 0, 0.785); this.sound.addMarker("gunShoot", 0.786, 1.49);"gunCock"); this.sound.onMarkerComplete.add(() => {"gunShoot"); }); } } window.onload = () => { var game = new SimpleGame(); };


This plays the first sound, then immediately once it is complete, it plays the other.  As you can see, you do so by setting markers within the entire sound file.  0.785 is the length of the first sound effect, then 1.49 is the length of the entire file.  You simply name each marker, then give it’s start and end locations within the file.  You can get these values using an audio editor, such as audacity.  This allows you to load all of your similar sound effects into a single quicker to load file and should help you work around the single sound stream limitations of iOS.


GLEW 1.11.0 adds support for OpenGL 4.5

August 19, 2014 05:36 PM

GLEW is a cross-platform (Windows, Linux, Mac, Unix) open-source C/C++ extension loading library for OpenGL. GLEW 1.11.0 adds support for OpenGL 4.5 and various AMD, EXT, INTEL and NV extensions.

NeoAxis 3D Engine 2.2 Released

August 19, 2014 05:26 PM

NeoAxis Group Ltd has released a new version of the free OpenGL-based universal environment for 3D project development NeoAxis 3D Engine 2.2. This new version includes updated character physics, profiling tool in map editor, WPF widget performance improvements, PhysX improvements and bug fixes.

Procedural World

Dynamic Fluids

by Miguel Cepero ( at August 19, 2014 03:32 PM

You may have caught some of this if you attended SOE Live or if you watched some of the audience videos from the panels.

We have been working for a while on a cellular automata system. The following video shows some early results:

As you can see the automata is producing something that resembles flowing water. I guess with different rules we could represent other phenomena, like smoke, clouds, slime and lava. Water in particular is a tough system because it likes to flow pretty fast.

This system is already able to schedule simulation so areas near the player (or players) get more simulation ticks. Simulation frequency is also affected by how much the system "wants" to change in a given area. For instance a stable puddle of water will get far less ticks than water falling from a cliff.

This makes the solution manageable from a computational point. This approach does create its own set of problems. For instance if nobody is there to experience the water, it does not really change. As someone approaches, the simulation frequency will increase gradually. This is enough to deceive players I hope, the player influence area is large enough to mask the fact that distant water does not update as frequently.

There are a few significant challenges ahead. We need to make sure this scales when we have hundreds of players. Also not everything in a game world is made out of voxels, eventually we would need to make props block and modify the flow of water.

iPhone Development Tutorials and Programming Tips

Example: Source Code Release To Popular iOS RTS Game Warfare Incorporated

by Johann at August 19, 2014 01:36 PM

Post Category    Featured iPhone Development Resources,iOS Development Source Code Examples,Objective-C

As you may know I have been maintaining lists of open source iOS apps, and open source iOS games for about 5 years.

Here’s an iOS  source code release to the fantastic Warfare Incorporated RTS game called Hostile Takeover from Spiffcode.

As the readme states:

Hostile Takeover is the open source release of the wildly popular mobile Real Time Strategy game Warfare Incorporated. Warfare Incorporated’s developers, grateful for all the contributions of the open source community, are delighted to give something back.

As far as I can tell this is the source code to Warfare Incorporated modified to use the included server rather than the Warfare Incorporated servers, and is BSD licensed.

A screenshot showing Warfare incorporated in action:

Warefare Incorporated

You can find the Hostile Takeover source code on Github here.

You can find Warfare Incorporated on the App Store.

A great example for those looking to build a real-time strategy game.

Thanks to Nathan for the submission.

Be the first to comment...

Related Posts:

FacebookTwitterDiggStumbleUponGoogle Plus

Original article: Example: Source Code Release To Popular iOS RTS Game Warfare Incorporated

©2014 iOS App Dev Libraries, Controls, Tutorials, Examples and Tools. All Rights Reserved.

Open Source iOS Component For Customizable Action Sheets Without UIAlertController Limitations

by Johann at August 19, 2014 06:15 AM

Post Category    Featured iPhone Development Resources

I’ve mentioned a number of components that simplify the syntax of creating an action sheets, and with iOS 8 Apple has added UIAlertController which greatly simplifies the syntax when compared to UIActionSheet and UIAlertView, but there are some added limitations.

Here’s a custom up-to-date action sheet component from Jonas Gessner that allows users to create action sheets with a number of extra features well beyond standard UIAlertController and allows you to avoid some of the UIAlertController limitations.

Some of the features added by JGActionSheet include:

- Unlimited content (thanks to built-in scrolling)
- Multiple sections
- Support for custom views within your action sheets
- The ability to place views precisely in a view on iPad with arrows pointing like a popover

Here are a couple of images from the readme showing JGActionSheet in action:


You can find JGActionSheet on Github here.

A great highly customizable popover component.

Be the first to comment...

Related Posts:

FacebookTwitterDiggStumbleUponGoogle Plus

Original article: Open Source iOS Component For Customizable Action Sheets Without UIAlertController Limitations

©2014 iOS App Dev Libraries, Controls, Tutorials, Examples and Tools. All Rights Reserved.

iPhone Development Tutorials and Programming Tips

Open Source Library Providing A Set Of Flat Buttons With Fantastic Transition Animations

by Johann at August 18, 2014 06:00 PM

Post Category    Featured iPhone Development Resources,iOS UI Controls,iPad,iPhone,Objective-C

Earlier this year I mentiond a nice guide mentioning how to create buttons with great looking transitions and a set of toolbar buttons with nice animation between them.

Here’s a library providing a nice set of flat buttons with great animations between them called VBPopFlatButton submitted by Victor Baro.

In total the library includes 9 commonly used buttons: forward, back, menu, download, share, add, minus, close, and an up arrow.  The buttons come in two different styles – with or without a rounded background.

You can customize the background color (when a rounded background is used), the line color, the thickness of each line button, and the size of the buttons.

Here’s an animation from the readme going through the 9 different button styles available:


You can find VBPopFlatButton on Github here.

A great set of animated buttons.

Be the first to comment...

Related Posts:

FacebookTwitterDiggStumbleUponGoogle Plus

Original article: Open Source Library Providing A Set Of Flat Buttons With Fantastic Transition Animations

©2014 iOS App Dev Libraries, Controls, Tutorials, Examples and Tools. All Rights Reserved.

The ryg blog


by fgiesen at August 18, 2014 08:58 AM

Last time, we covered the basics of how cache coherency works. Today, let’s talk about some of the primitives necessary to build useful systems on top of a coherent cache, and how they work.

Atomicity and atomic operations

A crucial building block for all of this are atomic operations. This has nothing to do with nuclear physics and everything to do with the root of the word atom, the Ancient Greek “ἄτομος” (atomos, “indivisible”). An atomic operation is one that cannot by divided into any smaller parts, or appears to be for the purposes of other cores in the system. To see why this matters, consider what happens when two cores both try to increment a counter at almost the same type, running the equivalent of the C statement counter++;:

Cycle # Core 1 Core 2
0 reg = load(&counter);
1 reg = reg + 1; reg = load(&counter);
2 store(&counter, reg); reg = reg + 1;
3 store(&counter, reg);

In compiled code, this single turns into a load operation, a register increment, and finally a store operation (here written in C-esque pseudocode). These three steps are distinct and will execute in sequence (note that on the micro-architectural level, this is true on x86 as well, even though the instruction set architecture actually features a read-modify-write add [memory], value instruction). And because of this splitting into multiple cycles, it’s possible for Core 2 to read counter after Core 1 has read it (and started incrementing it), but before it has written the result back. The end result is that, even though both cores ran code to increment the counter, the final value of the counter is only incremented by 1; one of the two increment operations was “lost”.

This problem is exactly what atomic operations are there to prevent; if we use an atomic increment (or more generally, atomic add) instead of a regular increment, the active core will make sure that the three steps above (load, add, store) appear to happen as one operation, hence atomic; no other core is allowed to “peek” while the increment is going on.

How atomics are implemented

Now the question is, how is this done? Conceptually, the easiest way to do this is using a locking mechanism: only one core is allowed to execute an atomic operation at any point in time. The core enters the lock before it starts the operation, and leaves it once the operation is complete. This is what the x86 LOCK prefix originally used to mean (approximately; I’m simplifying here). Here, the lock enter operation consists of a message on the bus that says “okay, everyone, back off the bus for a bit” (for our purposes, this means “stop doing memory operations”). Then the core that sent that message needs to wait for all other cores to finish memory operations they’re in the middle of doing, after which they will acknowledge the lock. Only after every other core has acknowledged, the core attempting the locked operation can proceed. Finally, once the lock is released, it again needs to send a message to everyone on the bus saying that “all clear, you can resume issuing requests on the bus now”.

This works. It is also incredibly slow. x86 CPUs still support this (or an equivalent), but only as an absolute emergency, when-all-else-fails fallback path; they need a fallback because the x86 ISA permits very dubious constructs like unaligned atomic operations that cross multiple cache lines, for backwards compatibility. Other architectures generally just don’t allow atomic operations on values that aren’t naturally aligned, nor on values that are “too big”. These constraints guarantee that a single atomic operation always takes place within a single cache line. And once we have that, we’re in good shape: as we saw last time when discussing cache coherency protocols, inter-core communication synchronizes memory at cache line granularity anyway, so in principle we can do complex modifications to a single cache line and then publish all changes at once by pushing the new cache line. Moreover, the MESI state machine features two states (M and E, “modified” and “exclusive”) that guarantee exclusive ownership of a cache line by one core – while a cache line is exclusively owned, no other core can “peek”. We can use that as substitute for our locking protocol.

So here’s the punchline: in a MESI (or derived) system, all we need to do to make sure an operation touching a single cache line is atomic is to a) make sure we issue the right memory barriers so memory operations are correctly ordered with reference to the surrounding code (see previous post), b) acquire exclusive ownership of the cache line before we read anything, c) only write back the results if we had exclusive ownership for the entire duration of the atomic operation. This guarantees that no other core saw any half-finished data. There’s multiple ways to accomplish c). For example, we can build hardware to make a limited set of atomic operations complete in a single bus clock cycle; if we have exclusive ownership of our cache line by the start of a cycle, we can have our modified data by the end of it. Since a cache line can’t possibly “change hands” within a cycle, this is fast enough. Depending on the bus protocol, we might also start playing games where we respond to coherency messages immediately, but might send the data a bit late if we’re in the middle of an atomic operation. Finally, we can just decide not to play games with timing at all; instead we implement steps b) and c) directly: any cache line that’s being used for an atomic operation is “monitored”, and if some other core looks at our cache line before the atomic operation is complete, we need to start over. This is what leads to the load-link/store-conditional (LL/SC) operations present in most RISC CPUs.

And by the way, on the bus side (and hence to other cores), properly aligned atomic operations don’t look any different than normal memory updates. Again, all of the processing is internal to the core that does it; other cores neither know nor care whether memory was updated from an atomic compare-and-swap operation bracketed by memory barriers or a relaxed store.

This all sounds nice and simple, and conceptually it is, but the devil’s in the details. The bad news is that if you’re a CPU architect, every single detail of this process is of crucial importance; your internal implementation of memory operations needs to avoid starvation (every core that wants to gain exclusive access to a cache line should be able to do so, eventually, no matter what the other cores are doing), and make sure that it’s possible to implement certain fundamental atomic operations without deadlocks or livelocks. It sounds obvious, but these guarantees are not automatic for low-level primitives like atomic operations or LL/SC. This stuff is hard to get right, and CPUs need to have an implementation that’s not just correct, it also needs to be fast for “typical cases” (whatever they are). The good news is that if you’re not working at a company designing CPUs, none of this is your problem, and you can rest assured that somebody else has thought this through, backed up by an army of validation engineers trying very hard to find test cases that break it.

The cost of memory operations

Back on the SW side, let’s assume we’re on a typical CPU architecture and are running code on multiple cores. What’s the cost of a memory operation in that environment? It breaks down into several components:

Execution. Executing a memory operation isn’t free. Suppose for now that only one core is active and running single-threaded code; even then, memory access is still complicated. Programs deal with virtual addresses, but coherent caches and memory buses normally deal exclusively in physical memory addresses. So every memory operation first starts with a virtual to physical address conversion (these are itself cached in the what’s commonly called Translation Lookaside Buffer or TLB). If you’re unlucky, that virtual address isn’t currently mapped to physical memory and needs to be brought in from storage; whenever this happens, the OS is going to schedule another thread on the active core for a while, since IO takes a long time in processor terms. But let’s assume that doesn’t happen here.

With the physical address known, the operation starts to go through the memory hierarchy. For example, to complete a memory load, the relevant data normally needs to be brought into the L1 cache first. If it’s not there yet, this can be a multi-step process that – in the worst case – involves a real memory access and then populating all the intermediate cache levels for the relevant cache line. With poor (i.e. not nicely localized) memory access patterns, waiting for cache levels to get populated is one of the main ways a CPU core spends its time. But for now, let’s assume that doesn’t happen (too often) either.

So how fast can we run memory operations if everything goes well? Pretty fast, it turns out. Usually at least one operation (loads/stores) completed per clock cycle, often more than that. Reasonably cache-friendly code will complete billions of memory operations per second on a single 3GHz core.

Memory barriers and atomic read-modify-write operations. For the next step, let’s suppose we’re running code that’s intended for multi-threaded operation, but we’re still doing so on only a single core. As a result, we will now see memory barriers and atomic operations, but no actual interference from another core; let’s just suppose that all relevant cache lines are already exclusively held by our own core. In that situation, how expensive is, say, updating a reference count using an atomic integer addition?

Well, that really depends on the CPU core. In general, micro-architectures with aggressive reordering of memory operations have more expensive memory barriers and atomic operations than those with only slight reordering or in-order execution of memory operations. For example, incrementing a reference count using LOCK INC [mem] on an Intel Atom core (an in-order design) has essentially the same cost as a regular INC [mem] instruction, and somewhat more complicated atomic operations like exchange or exchange-add end costing about 2x to 3x as much as a “vanilla” memory read-modify-write instruction. By contrast, on Intel and AMD’s out-of-order x86 desktop parts, an atomic increment has about 10x-25x the cost of the non-atomic version; that’s the cost of ensuring proper memory ordering. And again, to reiterate: this is still on code that is executing single-threaded. There’s no actual cross-core communication going on yet; this extra cost is incurred purely within a single core, to make the code safe for multi-core execution.

Bus traffic and cache coherency. Some percentage of memory accesses actually misses the cache and goes straight to memory; and once cache line that haven’t been used in a while get evicted, we start getting write-backs. All these events cause bus traffic (and memory traffic). Bus and memory bandwidth is a limited resource; as we start saturating their capacities, things start to get slower.

Moreover, once we switch to running multiple threads of our program on multiple cores, we actually start getting cache coherency traffic, as the cores continually synchronize their respective views of memory. If every thread works on its own independent memory region, this doesn’t really do much; if a given region of memory is only used by one core, then there’s no sharing, and getting exclusive ownerships of one of the corresponding cache lines is easy and doesn’t cause any invalidations elsewhere.

By contrast, if two or more cores frequently access the same cache lines, then these cache lines need to be kept synchronized. Updates to one of these cache lines require exclusive ownership, which means all other cores need to invalidate their copies of that cache line first. As a result, the next time that cache line is accessed by another core, its contents need to be sent across the bus. So we get both extra cache misses (on the remote core) and extra bus traffic. This phenomenon of multiple cores hitting a cache line that is being updated regularly is called “cache (line) contention”, and it is probably the easiest way to make parallel code in shared-memory environments crawl.

Cache line contention

To get cache line contention, we need multiple cores frequently accessing the same cache line, with at least some of these regular accesses being writes. Private data (cache lines only accessed by one thread) is never a problem; neither is immutable data (written once and then not modified until the end of its lifetime). What’s tricky is data that is both shared and mutable: doing so requires a lot of communication to maintain a consistent (as per the constraints of the memory model) view of memory between cores, and communication is expensive – and only keeps getting more so as more parties get involved.

How much more expensive are we talking? I wrote a test (for x86/Windows) a few weeks ago. This test is by no means user-friendly or easy to read, and it assumes a quad-core CPU with simultaneous multi-threading, or a similar topology, with at least 8 logical processors. Here’s the gist of it: as explained above, replacing a read-modify-write add of a value in memory with an atomic “add” operation generally makes it about 10x-25x as expensive (how much exactly depends on the micro-architecture). If you need a rule of thumb, just assume about 10x (good enough for Fermi estimation).

Once there is a second core reading that cache line, though, the cost explodes. If there’s another core generating lots of read traffic on the cache line in a tight loop, the atomic add gets more expensive – much more expensive: typically, 4x to 6x more (that’s on top of the ~10x hit we take from using an atomic add to begin with!). And this cost only goes up if there are more readers, and even more so if there are other writers too. Now, please don’t take these values as gospel; this is a completely synthetic benchmark that doesn’t do anything useful. The actual cost of coherency traffic very much depends on context. All I want to do here is give you a very rough feel for the cost of coherency traffic and the communication it does (namely: it’s not negligible).

Some of this communication is not actually necessary. For example, because cache coherency is tracked at cache line granularity, it’s possible to get lots of bogus coherency traffic simple because the different types of data – immutable, private and shared – are intermingled within the same cache line (or similarly, because one cache line contains private data for multiple threads). This is called “false sharing“. Luckily, this kind of problem is fairly easy to find with a profiler, and also relatively straightforward to fix by either reordering the data in memory (possibly while adding padding to make sure two different kinds of data don’t end up in the same cache line) or just removing some of the offending data entirely. My older post “Cores don’t like to share” gives an example.

What’s left over after this process is “real” contention – contended access to shared data. This includes both actual shared mutable data structures and certain kinds of metadata such as locks and other synchronization objects. Exactly how well this works depends on the precise layout of data in memory, as well as the operations used to access it.

In general, the way to get scalable multi-processor code is to avoid contention as much as possible, and to make whatever contention remains pass quickly – in that order. And to do a decent job at this, it’s important to know how cache coherency works (in broad strokes, anyway), what kind of messages cores exchange to maintain memory coherency, and when that coherency traffic happens. Now that we have those basics covered, we can look at somewhat higher levels of the stack. This post is long enough already, but in the next post, I’m planning to have a look at locks and lock-free data structures, and discuss some of the trade-offs. Until then, take care!

GameDev.Net Articles

Introduction to Software Optimization

August 17, 2014 07:06 PM

As a software/game developer, you usually want more and more... of everything actually! More pixels, more triangles, more FPS, more objects on the screen, bots, monsters. Unfortunately you don't have endless resources and you end up with some compromises. The optimization process can help in the reduction of performance bottlenecks and it may free some available powers hidden in the code.

Optimization shouldn't be based on random guesses: "oh, I think, if I rewrite this code to SIMD, the game will run a bit faster". How do you know that "this code" makes some real performance problems? Is investing there a good option? Will it pay off? It would be nice to have some clear guide, a direction.

In order to get some better understanding on what to improve, you need to detect a base line of the system/game. In other words, you need to measure the current state of the system and find hot spots and bottlenecks. Then think about factors you would like to improve... and then... start optimizing the code! Such a process might not be perfect, but at least you will minimize potential errors and maximize the outcome.

Of course, the process will not be finished with only one iteration. Every time you make a change, the process starts from the beginning. Do one small step at a time. Iteratively.

At the end your game/app should still work (without new bugs, hopefully) and it should run X times faster. The factor X, can be even measured accurately, if you do the optimization right.

The Software Optimization Process

According to this and this book, the process should look like this:

  1. Benchmark
  2. Find hot spots and bottlenecks
  3. Improve
  4. Test
  5. Go back


The whole process should not start after the whole implementation (when usually there is no time to do it), but should be executed during the project's time. In case of our particle system I tried to think about possible improvements up front.

1. The benchmark

Having a good benchmark is a crucial thing. If you do it wrong then the whole optimization process can be even a waste of time.

From The Software Optimization Cookbook book:

The benchmark is the program or process used to:
  • Objectively evaluate the performance of an application
  • Provide repeatable application behavior for use with performance analysis tools.

The core and required attributes:

  • Repeatable - gives the same results every time you run it.
  • Representative - uses large portion of the main application's use cases. It would be pointless if you focus only on a small part of it. For a game such a benchmark could include the most common scene or scene with maximum triangles/objects (that way simpler scenes will also work faster).
  • Easy to run - you don't want to spend hours setting up and running the benchmark. A benchmark is definitely harder to make than a unit test, but it would be nice if it runs as fast as possible. Another point is that it should produce easy to read output: for instance FPS report, timing report, simple logs... but not hundreds of lines of messages from internal subsystems.
  • Verifiable - make sure the benchmark produces valid and meaningful results.

2. Find hot spots and bottlenecks


When you run your benchmark you will get some output. You can also run profiling tools and get more detailed results of how the application is performing.

But, having data is one, but actually, it is more important to understand it, analyze and have good conclusion. You need to find a problem that blocks the application from running at full speed.

Just to summarize:

  • bottleneck - place in the system that makes the whole application slower. Like the weakest element of a chain. For instance, you can have a powerful GPU, but without fast memory bandwidth you will not be able to feed this GPU monster with the data - it will wait.
  • hot spot - place in the system that does crucial, intensive job. If you optimize such a module then the whole system should work faster. For instance, if CPU is too hot then maybe offload some work to GPU (if it has some free compute resources available).

This part may be the hardest. In a simple system it is easy to see a problem, but in large-scale software it can be quite tough. Sometimes it can be only one small function, or the whole design, or some algorithm used.

Usually it is better to use a top-down approach. For example:

Your framerate is too low. Measure your CPU/GPU utilization. Then go to CPU or GPU side. If CPU: think about your main subsystems: is this a animation module, AI, physics? Or maybe your driver cannot process so many draw calls? If GPU: vertex or fragment bound... Go down to the details.

3. Improve


Now the fun part! Improve something and the application should work better :)

What you can improve:

  • at system level - look at utilization of your whole app. Are any resources idle? (CPU or GPU waiting?) Do you use all the cores?
  • at algorithmic level - do you use proper data structures/algorithms? Maybe instead of O(n) solution you can reduce it to O(log n) ?
  • at micro level - the 'funniest' part, but do it only when the first two levels are satisfied. If you are sure, that nothing more can be designed better, you need to use some dirty code tricks to make things faster.

One note: Instead of rewriting everything to Assembler use your tools first. Today's compilers are powerful optimizers as well. Another issue here is portability: one trick might not work on another platform.

4. Test

After you make a change test how the system behaves. Did you get 50% of the speed increase? Or maybe it is even slower?

Beside performance testing, please make sure you are not breaking anything! I know that making systems 10% faster is nice, but your boss will not be happy if, thanks to this improvement, you introduce several hard-to-find bugs!

5. Go back


After you are sure everything works even better than before... just run your bechmark and repeat the process. It is better if you make a small, simple change, rather than big, but complex. With smaller moves it is harder to make a mistake. Additionally, it is easy to revert the changes.

Profiling Tools

Main methods:

  • custom timers/counters - you can create a separate configuration (based on Release mode) and enable a set of counters or timers. For instance, you can place it in every function in a critical subsystem. You can generate call hierarchy and analyse it further on.
  • instrumentation - tool adds special fragments of code to your executable so that it can measure the execution process.
  • interception - tool intercepts API calls (for instance OpenGL - glIntercept, or DirectX) and later on analyses such register.
  • sampling - tool stops the application at specific intervals and analyses the function stack. This method is usually much lighter than instrumentation.

Below is a list of professional tools that can help:

  • Intel® VTune™ Amplifier
  • Visual Studio Profiler
  • AMD CodeXL - FREE. AMD created a good, easy to use, profiling tool for CPU and GPU as well. Does the best job when you have also AMD CPU (that I don't have ;/) but for Intel CPU's it will give you at least timing reports.
  • ValGrind - runs your app on a virtual machine and can detect various problems: from memory leaks to performance issues.
  • GProf - Unix, uses a hybrid of sampling and instrumentation.
  • Lots of others... here on wikipedia

Something more


I probably do not need to write this... but the more you automate the easier your job will be.

This rule applies, nowadays, to almost everything: testing, setup of application, running the application, etc.

Have Fun!

The above process sounds very 'professional' and 'boring'. There is also another factor that plays an important role when optimizing the code: just have fun!

You want to make mistakes, you want to guess what to optimize and you want to learn new things. In the end, you will still get some new experience (even if you optimized a wrong method).

You might not have enough time for this at your day job, but what about some hobby project?

The more experience with the optimization process you have, the faster your code can run.


Article Update Log

17th August 2014: Initial version, based on post from Code and Graphics blog