您的位置:首页 > 运维架构

Optimizing Unreal Engine 4’s async file IO thread.

2015-10-30 11:43 861 查看
Pablo Zurita 12:01
pm on June 30, 2014  

Optimizing Unreal Engine 4’s async file IO thread.

After the first post I made on the ToLower() optimization a lot of people rightfully
wondered why did I optimize that and not the number of calls made to it. The answer is that I did optimize both. The reason is that while the reduction of ToLower() calls would have made a significant difference, it is also important to have the actual code
executed be optimal even if it’s not executed too often. The sum of all these inefficiencies are the cause of performance issues because it is hard to see the performance decrease as the product evolves. This is what people refer as “death by a thousand cuts”.
So considering that I first optimized the function to make sure that there was a minimum bar for performance set for the function call, and then I went ahead and reduced the number of calls to the function. Of course depending on your time restriction you
may set priorities differently.

As usual I’m going to be profiling on my machine that has the following specs:





So let’s begin by looking at the profile. Just like last time this is profiling from frame two the first 1000 frames on the Elemental demo.





As you can see a lot of time is being spent on the ToLower() function, something I already optimized but it wasn’t included on this profile because I profile change by change on its own. So the approach this time is to reduce the number of calls to the function.
So looking at the callstacks I found out that a lot of the calls were coming from the AsyncIOSystem thread.





As the name says the AsyncIOSystem is an asynchronous IO system. It is cross-platform and single-threaded runnable. What is a runnable on Unreal Engine 4 you ask, here is the answer:


The ElementId is generated by the GetTypedHash() function which for our type it generates a CRC32 hash, this is where all the time in FCrc::Strihash_DEPRECATED<wchar_t>() was being spent. And KeyFuncs::Matches() looks like this:

static FORCEINLINE bool Matches(KeyInitType A,KeyInitType B)
{
return A == B;
}
[/code]
While that looks pretty reasonable for fundamental integral types, for comparisons of FStrings it does a call to Stricmp to do a lexicographical comparison. This is where the ToLower() call is made:

template <typename CharType>
static inline int32 Stricmp( const CharType* String1, const CharType* String2 )
{
// walk the strings, comparing them case insensitively
for (; *String1 || *String2; String1++, String2++)
{
CharType Char1 = TChar<CharType>::ToLower(*String1), Char2 = TChar<CharType>::ToLower(*String2);
if (Char1 != Char2)
{
return Char1 - Char2;
}
}
return 0;
}

So now we know what it implies to find something in the cache, but how often does that happen? The answer is in the FAsyncIOSystemBase::Tick() function which shows that it happens once per outstanding request, and then once more when a request if a request
is pending. I measured the number of request done before the very first frame was rendered, there were 2096 requests queued. Considering that the AsyncIOSystem thread has an above normal priority and it can happen pretty often. The numbers add up pretty quickly.
We need to fix this.

To fix this I took a rather simple approach which is to make sure that finding something in the cache involves comparisons between integral types. The easiest way was to add another field to the IO request data which is a 32-bit hash of the filename. The
hash is generated whenever a new IO request is queued up (be it an IO request or a file handle destroy request), and then that hash is used to find cached file handles. To generate the hash I decided to use something already found in the engine rather than
integrating something like FNV-1 or xxHash, so I used a CRC32 hash.

So after doing that change let’s look at the profile:





Pretty impressive, the call to ToLower() isn’t there anymore because only 0.030ms are spent in all the 1000 frames. The call to FCrc::Strihash_DEPRECATED<wchar_t>() isn’t there either because only 7.9ms are spent in all the 1000 frames.

The lessons this time is related to the tendency of developers to hide complexity under very generic functions which have huge performance relevance. In particular performance was suboptimal because of it isn’t obvious that a A == B in KeyFuncs::Matches
would imply a Stricmp call for an FString. That’s why in my own code I tend not to override operators, they usually hide complexity when as programmers we need to be fully aware of the complexity of what we ship. Programmers also forget that
our main objective isn’t to create generic solutions that solve problems that we may have in the future. Our main objective is to ship the best experience to the end user, and that means writing code that solve the actual problem we need to solve, with the
available hardware resources, and with the available time to write it. If you care a lot about the future of your code then worry about making optimizable code rather that making a grand design to abstract complexities using ten different design patterns.
The truth is that the end user doesn’t care if you use 10 different patterns in a generic design, but they do care if the load times are high.

And knowing your performance data is critical. In the case of the Elemental demo 2096 IO requests were done and fulfilled before the first frame was rendered. I think being aware of that is critical to making the proper design and performance decisions.
In fact given that data I would be inclined to further optimize this by changing the AoS nature of the FAsyncIORequest and move it to a SoA so that hashes are all stored together to optimize to reduce the CPI in the Tick function, but I will leave that up
to you.



Share this:

Twitter
Facebook
Google

Like this:

Like Loading...
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: