Optimizing string function in UE4
2015-10-30 11:42
609 查看
Pablo Zurita 4:49
am on June 21, 2014
Tags: optimization ( 4 ), performance
( 4 ), Unreal Engine ( 4 )
I was helping EA Sports reach alpha stage on NHL 15 doing back-end work for the UI. As soon as I got it I started profiling to understand the performance characteristics of the engine. As time goes on this blog you will see that performance is something very
important for me =) . Since then a have come up with half a dozen optimizations, some of which are already integrated in Epic’s Perforce, and some of which are still under review. So today I’m going to be talking on a really basic optimization with a considerable
impact in performance which involves converting a character to lower case.
As any serious performance work, I started by profiling which is the most critical piece of data for optimization. I refuse to do any optimization work unless I have profiling data that backs up the work. I also add the before/after profiling data to the
commit information to make explicit why the work needed to be done, and I make sure I include the specs of the hardware used for profiling. All the profiling that I will show in this post was done on this machine:
The first thing I did was hook up Intel’s VTune to Unreal Engine 4 since I also try to avoid doing non-automated profiling as much as possible. After some work dealing with the Unreal Build Tool I got it integrated properly. I was able to add frame markers,
events, and start and stop the capturing of profiling data from code. With that ready I decided to profile the Elemental demo. The Elemental demo was a demo meant to show the potential for Unreal Engine 4 first shown back in 2012. Martin
Mittring made a presentation about it in SIGGRAPH 2012. If you
have never seen the demo here is a video of it:
I decided to profile the Elemental demo starting on frame 2 to frame 1000 just to skip all the loading of data for the first frame. That meant that for the first ~3.6 seconds of the demo starting nothing would be profiled, then the profiler would be enabled
at the start of the second frame and it would be stopped at the end of frame 1000. Given that this was a scripted demo it allowed me to make sure that any optimization I did actually improved performance. The profiling showed this:
For now just ignore FBox::TransformBy() since I will talk about it in another post. After that the top hotspot was TChar<wchar_t>::ToLower(). Since the Unreal Engine 4 EULA allows
me I will share this small piece of code.
This converts a wide char from upper case to lower case. If you take that piece of code on its own most people wouldn’t think of it as being a top hotspot. Unfortunately programmers in general don’t seem to think about the implications in performance of
their string handling code. In particular the code in Unreal Engine 4 used too many branches. That would be hard for the branch predictor to predict given the random nature of the input. There were two things to be done here, the most obvious one was to reduce
the number of calls to the function, but the other one was to actually optimize the function by reducing the number of branches. I will go on to how I reduced the number of calls in another post so let’s focus on optimizing the actual function.
The code makes it pretty explicit what is the valid input is, and what just returns the input char. Since the range of valid input is small it really lends itself to using a look-up table. The look-up table would have to be as big as the valid input to only
have to branch to make sure that the input char fits within the valid range. The first thing was to create the look-up table which was really easy.
This table is just 256 bytes long which are 4 cache-lines for the relevant platforms for Unreal Engine 4. This shouldn’t be a concern at all since the use case for this is for converting full strings char by char. The first time we would get a cache miss
to get the table but it would be a cache-hit for the rest of the chars.
Since we got the look-up table we can now index it with the input char as long as the input char is within the range covered by the table. For anything outside of that range we would just return the source input char.
Now we have a single branch. Time to profile the optimization. Since I got everything automated I ran the exact same process that captured the same number of frames and this was the performance difference in the function:
Without much effort I decreased the CPU time in the function by ~41%. This is a big difference that show how performance can be dominated by branches. It also shows how critical it is to profile before optimizing. I’m pretty sure most programmers would have
thought that the performance would have been dominated by work of some complex subsystem (such as animation) rather than the more mundane like dealing with chars.
The basic lesson here is that you should profile before doing any optimization work, don’t assume that the most complex subsystems are always the bottleneck. Even more important is to be aware of your target platform and write code in a way that it’s good
to it.
Facebook
Google
am on June 21, 2014
Tags: optimization ( 4 ), performance
( 4 ), Unreal Engine ( 4 )
Optimizing string function in UE4
Back in March, during the GDC, Epic Games released Unreal Engine 4 with a subscription based license. As soon as it was released I decided to license it to be able to learn and exercise my skills. That was something that I wasn’t doing back in March sinceI was helping EA Sports reach alpha stage on NHL 15 doing back-end work for the UI. As soon as I got it I started profiling to understand the performance characteristics of the engine. As time goes on this blog you will see that performance is something very
important for me =) . Since then a have come up with half a dozen optimizations, some of which are already integrated in Epic’s Perforce, and some of which are still under review. So today I’m going to be talking on a really basic optimization with a considerable
impact in performance which involves converting a character to lower case.
As any serious performance work, I started by profiling which is the most critical piece of data for optimization. I refuse to do any optimization work unless I have profiling data that backs up the work. I also add the before/after profiling data to the
commit information to make explicit why the work needed to be done, and I make sure I include the specs of the hardware used for profiling. All the profiling that I will show in this post was done on this machine:
The first thing I did was hook up Intel’s VTune to Unreal Engine 4 since I also try to avoid doing non-automated profiling as much as possible. After some work dealing with the Unreal Build Tool I got it integrated properly. I was able to add frame markers,
events, and start and stop the capturing of profiling data from code. With that ready I decided to profile the Elemental demo. The Elemental demo was a demo meant to show the potential for Unreal Engine 4 first shown back in 2012. Martin
Mittring made a presentation about it in SIGGRAPH 2012. If you
have never seen the demo here is a video of it:
I decided to profile the Elemental demo starting on frame 2 to frame 1000 just to skip all the loading of data for the first frame. That meant that for the first ~3.6 seconds of the demo starting nothing would be profiled, then the profiler would be enabled
at the start of the second frame and it would be stopped at the end of frame 1000. Given that this was a scripted demo it allowed me to make sure that any optimization I did actually improved performance. The profiling showed this:
For now just ignore FBox::TransformBy() since I will talk about it in another post. After that the top hotspot was TChar<wchar_t>::ToLower(). Since the Unreal Engine 4 EULA allows
me I will share this small piece of code.
template <> inline TChar<WIDECHAR>::CharType TChar<WIDECHAR>::ToLower(CharType c) { // compiler generates incorrect code if we try to use TEXT('char') instead of the numeric values directly // some special cases switch (WIDECHAR(c)) { // these are not 32 apart case 159: return 255; // diaeresis y case 140: return 156; // digraph ae // characters within the 192 - 255 range which have no uppercase/lowercase equivalents case 240: case 208: case 223: case 247: return c; } if ((c >= 192 && c < 223) || (c >= LITERAL(CharType, 'A') && c <= LITERAL(CharType, 'Z'))) { return c + UPPER_LOWER_DIFF; } // no lowercase equivalent return c; }
This converts a wide char from upper case to lower case. If you take that piece of code on its own most people wouldn’t think of it as being a top hotspot. Unfortunately programmers in general don’t seem to think about the implications in performance of
their string handling code. In particular the code in Unreal Engine 4 used too many branches. That would be hard for the branch predictor to predict given the random nature of the input. There were two things to be done here, the most obvious one was to reduce
the number of calls to the function, but the other one was to actually optimize the function by reducing the number of branches. I will go on to how I reduced the number of calls in another post so let’s focus on optimizing the actual function.
The code makes it pretty explicit what is the valid input is, and what just returns the input char. Since the range of valid input is small it really lends itself to using a look-up table. The look-up table would have to be as big as the valid input to only
have to branch to make sure that the input char fits within the valid range. The first thing was to create the look-up table which was really easy.
static const size_t ConversionMapSize = 256U; static const uint8 ConversionMap[ConversionMapSize] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F, 0x40, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 0x5B, 0x5C, 0x5D, 0x5E, 0x5F, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F, 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F, 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x9C, 0x8D, 0x8E, 0x8F, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0xFF, 0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF, 0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF, 0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xDF, 0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF };
This table is just 256 bytes long which are 4 cache-lines for the relevant platforms for Unreal Engine 4. This shouldn’t be a concern at all since the use case for this is for converting full strings char by char. The first time we would get a cache miss
to get the table but it would be a cache-hit for the rest of the chars.
Since we got the look-up table we can now index it with the input char as long as the input char is within the range covered by the table. For anything outside of that range we would just return the source input char.
template <> inline TChar<WIDECHAR>::CharType TChar<WIDECHAR>::ToLower( CharType c ) { return (c < static_cast<CharType>(ConversionMapSize)) ? static_cast<CharType>(ConversionMap[c]) : c; }
Now we have a single branch. Time to profile the optimization. Since I got everything automated I ran the exact same process that captured the same number of frames and this was the performance difference in the function:
Without much effort I decreased the CPU time in the function by ~41%. This is a big difference that show how performance can be dominated by branches. It also shows how critical it is to profile before optimizing. I’m pretty sure most programmers would have
thought that the performance would have been dominated by work of some complex subsystem (such as animation) rather than the more mundane like dealing with chars.
The basic lesson here is that you should profile before doing any optimization work, don’t assume that the most complex subsystems are always the bottleneck. Even more important is to be aware of your target platform and write code in a way that it’s good
to it.
Share this:
TwitterLike this:
Like Loading...相关文章推荐
- UESTC 1215 Secrete Master Plan
- R语言读取txt文件中的内容
- How to add multiple filters to UIImage GPUImage?
- mysql-普通查询(General Query)慢查询(Slow Query)相关日志配置
- Error:Could not read cache value from '/Users/wangjin/.gradle/daemon/2.2.1/registry.bin'.
- MessageQueue, Looper, Thread, Handler, Message, Messenger
- 如何正确选择uiimage之间不同的过滤器(gpuimage)
- PHP数组之array_values
- UESTC 1218 Pick The Sticks
- 在PL/SQL DEV里面有把锁一样的按钮,点击它会跳出“these query result are not updateable,include the ROWID to get updateab
- 从xib加载UIViewController的一个坑
- [转]IOS 贝塞尔曲线UIBezierPath方法总结
- Using GPUImage to Recreate iOS 7 Glass Effect
- 设置UIButton的文字显示位置、字体的大小、字体的颜色
- GPUImage滤镜中的shader代码分析,及自定义滤镜
- GPUImage滤镜中的shader代码分析,及自定义滤镜
- AVFoundation和 GPUImage初探
- 禁止UITextField的粘贴功能
- GPUImage简单说明
- 【Quick 3.3】资源脚本加密及热更新(一)脚本加密