1 (edited by xenonite 01-09-2015 08:19:47)

Topic: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Hi guys, I've made a few optimized compiles of the latest Open-Source svpflow1.dll library with the 2015 version of the Intel Compiler on Visual Studio 2013.
I have not altered the source code, only set the compiler options and compiled it for the different AVX architectures and to maximize runtime speed.

Since almost all of the computationally intensive code has been hardcoded to SSE assembly, don't expect these builds to dramatically improve SVP's performance. However, they should be the most optimized that you can get for CPUs younger than 5 years old without actually changing the source code.

NB. Mods: If these builds are not useful / break any forum rules, please don't hesitate to delete this thread. I only post them incase there are some users that would like an (at least partially) ICL compiled SVP.

EDIT: The files have been removed, until I can figure out why OpenMP keeps imposing it's inclusion. Also, I want to do some further tests with the compiler optimizations, since aggressive loop unrolling seems to hurt, rather than help, SVP's performance (even though the .dll's file size only grew to about 1100kB).

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Xenonite

I got "Platform returned 126" error. What is that?

Is there any way to test it? I simply did it by copying it to plugin folder in SVP folder though.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

xenonite
I've made a few optimized compiles of the latest Open-Source svpflow1.dll library with the 2015 version of the Intel Compiler on Visual Studio 2013

You can just share your settings cause we're using exactly the same toolchain smile
And I don't expect "AVX" could be any faster than "SSE2" one, still the main bottleneck is in x264 asm code.

I only post them incase there are some users that would like an (at least partially) ICL compiled SVP

SVPflow libs are compliled with ICL since ver.1.1.14 - http://forum.doom9.org/showthread.php?p … ost1718861

4 (edited by xenonite 30-08-2015 19:25:02)

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

mashingan wrote:

Xenonite

I got "Platform returned 126" error. What is that?

Is there any way to test it? I simply did it by copying it to plugin folder in SVP folder though.

I apologise for responding so late... I forgot to mention that these libraries are for SVP 3.1.7 (although they should also work for SVP4).

I don't know why you got that error... It could be that you are using Windows XP? Or a 32-bit build of Windows? Or maybe an AMD CPU or a CPU older than Sandy Bridge. Other than that, I can only guess that some configuration to you OS is causing problems (maybe some non-standard SVP installation that wasn't properly uninstalled?).

Either way, I did test that they do actually work on my development laptop (a 17" sager haswell desktop replacement system), on a old Ivy-Bridge based 3770k system with SLI 780ti GPUs that I have lying around and, of cource, also on my 5960x-based HTPC, so they should just work without any other configuration.
I do, however, not have any OS other than Windows 7 (I have no intention to "upgrade" to windows 10 except maybe for a dedicated DirectX 12 gaming box somewhere down the road), so that might be the cause of some of your issues?

Chainik
Wow I was completely unaware of that. It does, however, explain why my MSVC builds are constantly slower than your default builds (I have been pouring over my development environment, checking and rechecking evert single setting to find why I can't reproduce it, just because I assumed you had been using MSVC all along).

Anyway, thank you very much for providing such optimized builds, I can't tell you how much it means to me that the developers of an application actually took the time to compile an optimised binary (virtually all video processing apps/utilities and plugins are just compiled with msvc or gcc and let go; sometimes even without any /Ox compiler flags or /arch designations).

I'll post my compiler settings when I get home (posting from my phone atm) so that you (if you have time) can tell me what you think and I'll also try to bechmark some important permutations to see if I can find a way to significantly (sustained average >10% performance improvement) improve on the default didtribution.

I'm still very new to general applications programming (I mainly work on the hardware side of projects, with some minimal assembly hand-tuning of a particularly problematic ISR here and there), but I will also take a look to see if there are any easy ways to optimise some of the most expensive functions and then post the code (if any, I'm probably being way to optimistic) for review. I can't run any complete performance profiling tests since I only have the svpflow1 code to work with, but I believe most of the work is done in PlaneOfBlocks, yes?

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

xenonite
I will also take a look to see if there are any easy ways to optimise some of the most expensive functions and then post the code (if any, I'm probably being way to optimistic) for review.

I believe we've already optimized that code as much as it's possible

w/o changing the algorithm, of cause
.
Still it'd be very interesting if you could achieve better results smile

I believe most of the work is done in PlaneOfBlocks, yes?

To be more specific, in the "search.cpp" file

6 (edited by Nintendo Maniac 64 30-08-2015 23:51:34)

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

All AMD CPUs and APUs since Piledriver support AVX; more importantly is that Intel Celerons and Pentiums do not support AVX including the G3258 "Pentium Anniversary Edition".


(Bulldozer also supported AVX, but Llano and Bobcat did not)

7 (edited by xenonite 31-08-2015 01:32:26)

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Chainik
Here are my current C++ compiler options (as taken from the VS2013 project property page, under c++, Command Line:

All Options:
/MP /GS- /W3 /QxCORE-AVX2 /Gy /Zc:wchar_t /I"C:\Users\Xenophos\svpflow\src\jsoncpp\include" /I"C:\Users\Xenophos\svpflow\src\jsoncpp\Release" /Zi /O3 /Ob2 /Fd"Release\vc120.pdb" /Quse-intel-optimized-headers /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "SVPFLOW1_EXPORTS" /D "_WINDLL" /D "_MBCS" /Qipo /Zc:forScope /arch:CORE-AVX2 /Gd /Oy /Oi /MT /Fa"Release\" /EHsc /nologo /Qparallel /Fo"Release\" /Qprof-dir "Release\" /Ot /Fp"Release\svpflow1.pch"

Additional Options:
/Qipo:128 /Qunroll:256 /fast /QxCORE-AVX-I /Qip /Qopt-args-in-regs:all /Qopt-class-analysis /Qopt-dynamic-align /Qopt-jump-tables:large /Qopt-mem-layout-trans:3 /Qopt-prefetch:4 /Qsimd /Qunroll-aggressive /Qinline-calloc /Qinline-forceinline /Qinline-max-per-compile- /Qinline-max-per-routine- /Qinline-max-size- /Qinline-max-total-size- /Qinline-min-size=16384 /Qinline-dllimport

I'm not at all certain that these options are a (or 'the only') member of the set of compiler options that achieve the best possible performance for a typical SVP workload, but I do think it is a decent start to benchmark the effect that the compiler has on SVP's performance.

Chainik wrote:

xenonite
I believe we've already optimized that code as much as it's possible

Yea that's what I thought too... can't hurt to try though!  big_smile
I don't see any vectorizing or parallelizing hints to the compiler in the source code, so I might try to better guide it's compilation before I start trying to dig through the x264 assembly source.

One more question: Have you ever run a performance profiler on the complete SVP program and, if so, approximately what fraction of the total time taken per frame was spent within Search.cpp.

EDIT:

Nintendo Maniac 64 wrote:

All AMD CPUs and APUs since Piledriver support AVX

This is true, however, the Intel compiler has been known to artificially gimp performance on AMD processors. It would be a far better idea to compile with GCC for AMD processors.
Please see here: http://www.agner.org/optimize/blog/read.php?i=49 for a description of the problem,
and here: https://software.intel.com/en-us/articl … ice#opt-en for the official Intel comment on this issue.

Nintendo Maniac 64 wrote:

more importantly is that Intel Celerons and Pentiums do not support AVX

Yes, but I don't see the problem? SVP by default is programmed for maximum performance on all CPUs older than at least 5 years, back to and including the first Pentium 4 processors of 14 years ago. Sure, these DLLs aren't usable on those platforms, but why would you even want to?

I completely understand supporting legacy architectures, especially in the age of "good enough {insert anything here}" where most people don't regularly upgrade their hardware anymore.
The problem I do have, is what about the people who do not achieve sufficient performance with their current hardware. In my case, I completely stopped watching series and movies for a period of 3 years (I can't just ignore or 'watch around' the things that bother me) while I worked as hard as I possibly could to make enough money to buy the absolute 'fastest' components currently available.
Then imagine my horror when I realized that all my hard work was basically for naught. I still could not achieve a 'good' level of quality and not just because "CPU performance has completely stagnated". No, the high-end Haswell (and Skylake) CPUs have the potential to double their throughput of 8-bit integer operations (the kind that is used to find motion vectors for 8-bit-per-component video data) compared to high-end Sandy- and Ivy-Bridge processors.
This basically told me: "No matter how hard you work, you will never be able to buy the level of performance you desire, since the maximum performance is limited by those who don't really care about achieving the maximum possible image quality".

I'm sorry for going off on a rant there but, as you probably gathered, this is quite important to me. Its not simply a matter of some 1%'er bragging about the amount of money he can spend on some 'completely over-the-top PC' to the determent of those who simply cannot afford the same things due to factors completely out of their control. This is about not being able to achieve anything, no matter how much money you are eventually able to make.
And its not specific to SVP at all, how many applications can you list that make use of the newest advanced processor features? I'm not even talking about 'difficult' things such as parallel programming or GPGPU; just simply caring enough to recompile your code or (if you have code that is sufficiently sensitive to have needed hand-tuned assembly optimization) to add some functions that make use of these new instructions.
[Ps. Yes, technically you can (and good software engineering policy dictates that you should) support all the CPUs from both manufacturers to their fullest performance potential without penalizing any of them by doing so. It just makes writing a simple program so much more difficult, that it becomes quite prohibitive for a small group of volunteers, working mostly in their off-time, to achieve their goals in a reasonable time frame. In those cases, its simply a matter of being able to get the app out the door in the first place, which, I believe, is a much better excuse than simply optimizing for the lowest common denominator and not caring about the rest.]

So, in a nutshell, this is why I decided to start properly learning C++ programming.
Do I think that I will be able to make a notable difference, even if I successfully master modern multi-threaded programming? No, of course not; I'm not THAT stupid... but that won't stop me from at least trying! tongue
Also, Chainik, you guys absolutely rock. This really doesn't get said enough, but cleaning up the MVTools code-base to the point that it is at today was some truly amazing work.

8 (edited by Nintendo Maniac 64 31-08-2015 03:42:26)

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

....uhhh, my point was just pointing out a possible reason why mashingan could have been having issues.

Also I'm not sure why you're bringing up old CPU architectures like the Pentium 4 in regards to my mention of processor models that are only a year old or so like the Pentium G3258...

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

xenonite
I don't see any vectorizing or parallelizing hints to the compiler in the source code
... /Qipo

parallelization is a cheating big_smile
we're already in heavy multi-threaded environment and we're not interested in single-threaded performance

Have you ever run a performance profiler on the complete SVP program

very long time ago, I don't remember actual numbers

10 (edited by sparktank 31-08-2015 09:20:26)

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

I tried them in SVP 3.17 and SVP 4 TP and couldn't get them to work.

Windows 7 (x64, but SVP is x86).
Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz
Intel Sandy Bridge rev. 09
MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX

I have SVP installed to default location (Program Files (x86)\SVP).
After renaming the .dll, it just says it can't find the required library.
I close SVP and the media players completely when I try them.

Running each of he .dll's through Dependency Walker show one error:
"Error: At least one required implicit or forwarded dependency was not found."
Module: libiomp5md.dll "Error opening file. "The system cannot find the file specified (2).""

libiomp5md.dll is nowhere on my system.

it seems there are no more static libraries available for distribution regarding OpenMP.
https://software.intel.com/en-us/articl … ft-windows
https://software.intel.com/en-us/articl … -libraries

EDIT:
Not that I'm in a real rush to compare anything.
I seem to remember reading when I wanted to get into programming, that main thing I wanted to learn was to update some things for AVX extension, but then read a lot of things (I mostly don't remember) that said it's not really worth it.
Short/long math is where it really counts?
And in most cases for AVS users, it doesn't count for us so much.
I remember something to that efffect out of the myriad of information I read through.

11 (edited by xenonite 31-08-2015 12:56:29)

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Nintendo Maniac 64 wrote:

Also I'm not sure why you're bringing up old CPU architectures like the Pentium 4 in regards to my mention of processor models that are only a year old or so like the Pentium G3258...

Because its the first in the list of CPUs that SVP is designed for. (aka. SSE2). Also, I'm sorry, that rant wasn't aimed at you at all. I realize that you were only asking about the impact of building AVX2 libraries on 'lower end' CPUs. Luckily, after looking through the code a bit, it seems like SVP is already written to be able to use AVX2 instructions without interfereing with the standard SSE2 codepath (at least for the computationally intensive pixel metric calculations). It shouldn't be too difficult to follow the same template for other functions. At worst, the installer can detect your CPU and install the correct library if it were to be compiled for different architectures.

Chainik wrote:

parallelization is a cheating  
we're already in heavy multi-threaded environment and we're not interested in single-threaded performance

Yes I understan that, but Avisynth's multithreading environment seems kind of funky to me (I have run some scaling tests on a 4-core 3770k with the default libs, like the one you posted a while back with the AMD CPU, and have found the optimum number of Avisynth 'threads' to be 22. Setting threads=8 leads to more than 50% less performance (I'll post my scaling results for Avisynth 2.5.8 and 2.6.0 shortly)).

Actually, after reviewing the code a bit, it seems like all the framework is already in place for proper AVX2 support on the assembly level.
Do you know of any work that was, or is being, done to enable AVX2 SAD calculation?
One more question: it seems your code also has 'support' for plain AVX instructions in calculating pixel metrics. Do you have any idea why someone would have put floating-point AVX together with the other integer SSE2 optimizations?

sparktank wrote:

I seem to remember reading when I wanted to get into programming, that main thing I wanted to learn was to update some things for AVX extension, but then read a lot of things (I mostly don't remember) that said it's not really worth it.
Short/long math is where it really counts?
And in most cases for AVS users, it doesn't count for us so much.

Yes, plain AVX instructions only work on floating point data (32/64-bit fractional numbers), while video data is stored as 8-bit unsigned integers (256 numbers from 0 to 255). That means AVX only benefits workflows that require the precision of floating point numbers (which are much slower to work with than integers).
AVX2, on the other hand, does work on 'packed' 8-bit data, so would provide a very nice speedup over 'legacy' SSE~SSE4.2 instructions.

libiomp5md.dll is nowhere on my system.

it seems there are no more static libraries available for distribution regarding OpenMP.

Ah, I don't know why it did that (I didn't enable the OpenMP language in the project settings and neither does SVP contain any OpenMP instructions what so ever), but thank you very much for reporting back.

I'll try to get the compiler to not link against OpenMP and test the resulting libraries by uninstalling all dev kits on the old 3770k-based system and running it there. If I get it to behave, I'll certainly post the new libraries for you to test as well.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Just to clarify about SSE2 vs AVX and stuff - I'm guessing none of the newer SSE instructions (save for AVX) would be useful?

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

xenonite

it seems your code also has 'support' for plain AVX instructions in calculating pixel metrics. Do you have any idea why someone would have put floating-point AVX together with the other integer SSE2 optimizations?

I prefer to think that x264 guys are extremely experienced with all that stuff big_smile

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Nintendo Maniac 64 wrote:

Just to clarify about SSE2 vs AVX and stuff - I'm guessing none of the newer SSE instructions (save for AVX) would be useful?

Uhg, it seems my reply never actually posted!
Mostly yes, that is correct. Even AVX itself wont help much, its actually the second revision, AVX2, that contains the required instructions.

SSE3->4.2 can also provide a bit of a speedup, but it would be the same amount of work that implementing AVX2 would take for about 20% (mabe even less) of the gain.
The FMA instructions, however, do have the potential for some very nice gains in code of the form x = (a+b)*(c+d), which is very common in any video or audio processing filter (indeed, video upscaling and downscaling, as well as audio upsampling, down sampling and equalizing, almost entirely consist of loops of millions of x = a + b*c or x+=b*c).
But for SVP, it seems that AVX2 would bring the most benefit (though actually processing the SAD calculations as entire frames and kernel blocks, on the GPU, would be able of speeding up SVP's main bottleneck by at least an order of magnitude, over and above what AVX2 would be able to do. Its just also about 10x harder to actually implement hmm ).


Chainik wrote:

I prefer to think that x264 guys are extremely experienced with all that stuff

Yes, of course. I was just wondering what those x264 calculations would bring to an integer codepath. It seemed strange for SVP, thats all... Then again, SVP itself comes from the very 'strange' MVTools, so there's that. smile

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

xenonite wrote:
Nintendo Maniac 64 wrote:

Just to clarify about SSE2 vs AVX and stuff - I'm guessing none of the newer SSE instructions (save for AVX) would be useful?

The FMA instructions, however, do have the potential for some very nice gains in code of the form x = (a+b)*(c+d), which is very common in any video or audio processing filter (indeed, video upscaling and downscaling, as well as audio upsampling, down sampling and equalizing, almost entirely consist of loops of millions of x = a + b*c or x+=b*c).
But for SVP, it seems that AVX2 would bring the most benefit (though actually processing the SAD calculations as entire frames and kernel blocks, on the GPU, would be able of speeding up SVP's main bottleneck by at least an order of magnitude, over and above what AVX2 would be able to do. Its just also about 10x harder to actually implement hmm ).

As an AMD user, which got with the FMA program years before Intel did (FMA requires Haswell+), I'd love to see FMA and GPU SAD (especially as that could be used to efficiently detect and delete duplicate frames, further smoothing out video) come into use.  I'm not feeling the AVX2 as much as that has a far smaller "install base" at the moment and I'm not in it sad

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

VB_SVP wrote:

I'd love to see FMA and GPU SAD (especially as that could be used to efficiently detect and delete duplicate frames, further smoothing out video) come into use.

Me too. Even though I have 'invested' in AVX2, I'd much rather leave my CPU 'under utilized' if that means we can get GPU-based pixel metrics calculation. Heck, without having to do all the pixel metric calculations, the CPU would be much less loaded than it is now, so AVX2 would be less necessary as a result. GPU-based calculations would also open the door to much better image quality (think SSD/SATD in stead of SAD).

Actually, FMA would also help to compute such pixel metrics much more efficiently (although not nearly on the same level as GPU acceleration would). The only problem with that would be the performance required (even with FMA), would probably put it out of practical reach for all AMD owners anyway. Maybe Zen would make it more practical?
FMA also doesn't really help much with standard SAD calculation and would require someone to write the entire assembly functions from scratch (I don't think the x264 codebase includes any FMA pixel metrics), but it would be of great benefit to things such as bicubic frame resizing, gamma correction and colorspace conversion (if and when SVP makes use of such features (actually, SVP already uses frame resizing, but that could maybe be implemented more efficiently in the svpflow2.dll GPU-accelerated part of SVP)).

VB_SVP wrote:

As an AMD user, which got with the FMA program years before Intel did

That really is extremely sad. Even though I have only once owned an AMD CPU (the original FX series), I really hoped that the Bulldozer architecture would encourage developers to change their coding styles towards relying more on specific per-processor optimizations and less on outdated, general instructions sets. Since the 'thin-and-light' craze has capped CPU performance, the only big performance advances have been, and will continue to be, made by developing increasingly specialized and exclusive instruction sets that are optimized for specific classes of problems.
If FMA had caught on, Bulldozer's performance would have exceeded Intel's in all video processing (and gaming post-processing) applications, among others, which would have forced a new 'instruction set arms race' and continued linear performance scaling together with Moore's law.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

The problem is that Bulldozer's general performance was worse than even a Phenom II x6 in several cases, so considering how much faster the likes of an Intel i7 was, devs saw no reason to purchase such a processor and to develop for it.

I believe a similar thing is going on with HSA currently - devs see no reason to purchase a Kaveri APU when it'll be sub-par at everything else that they do, so they don't even bother.  In this case we probably won't see more widespread usage of HSA until we have Zen APUs, and that's assuming that Jim Keller is as good as he has been in the past (a game console or server using an HSA-enabled APU would also very likely take advantage of HSA, but that's beside the point here).

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Nintendo Maniac 64 wrote:

The problem is that Bulldozer's general performance was worse than even a Phenom II x6 in several cases, so considering how much faster the likes of an Intel i7 was, devs saw no reason to purchase such a processor and to develop for it.

I believe a similar thing is going on with HSA currently - devs see no reason to purchase a Kaveri APU when it'll be sub-par at everything else that they do, so they don't even bother.  In this case we probably won't see more widespread usage of HSA until we have Zen APUs, and that's assuming that Jim Keller is as good as he has been in the past (a game console or server using an HSA-enabled APU would also very likely take advantage of HSA, but that's beside the point here).

There was always a glimmer of hope for Bulldozer when it, in multithreaded tasks, was shown punching up against i7s (in video encoding it still beats any Intel processor that I could have bought for the same price).  I've long heard it said in the tech sector that had developers embraced FMA then it would have done much to alleviate Bulldozer's somewhat lacking floating-point performance but we might never know.  In a sense, Bulldozer is AMD's "Itanium".  IMO the problem is that AMD expected devs to use new instruction sets and HSA but didn't really provide them with the tools to do so; this thread's title is a sad reminder of that lol.

Though it's been reported that console developers are finally, after several years of having the functionality just sitting there, starting to use HSA and seeing gains of "30% GPU performance".

But the biggest problem with all of this is availability.  Nvidia's GPUs aren't much for this sort of task and they have 81% of the GPU market and Piledriver+/Haswell+ CPUs aren't all that ubiquitous yet still.

19 (edited by Nintendo Maniac 64 02-09-2015 06:31:13)

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

81% of the discrete GPU market; for the total consumer GPU market one must also consider integrated graphics.  Steam's hardware survey is probably the best-case scenario for Nvidia outside of professional markets, yet even then Nvidia "only" has 52% market share while Intel has 20% and AMD has 27%:
http://store.steampowered.com/hwsurvey/


Oh, and I think you posted the wrong link - Async Shaders are something quite different from HSA.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Nintendo Maniac 64 wrote:

81% of the discrete GPU market; for the total consumer GPU market one must also consider integrated graphics.  Steam's hardware survey is probably the best-case scenario for Nvidia outside of professional markets, yet even then Nvidia "only" has 52% market share while Intel has 20% and AMD has 27%:
http://store.steampowered.com/hwsurvey/


Oh, and I think you posted the wrong link - Async Shaders are something quite different from HSA.

Async Compute is related to HSA as it enables the GPU to be used to perform computation in lieu of the CPU without tanking the GPU's performance (and it is functionality that Nvidia GPUs don't have). 

With the iGPUs, so many of the Intel and AMD ones are rather useless for GPU computation as they lack the functionality outright or "pull an Nvidia" and emulate it, which lacks the performance.  Tragically, AMD took too long to put GCN on its line of APUs so all pre-2014 APUs (Kaveri) are stuck with non-async compute, non-HSA enabled GPUs based on pre-GCN architecture, even though GCN was available on their dGPUs in 2011/2012.  Compounding that brutality, even though they are quite capable of working in some capacity with Vulkan/DX12, AMD has no plans to support those APIs on pre-GCN GPUs. 

I'm not sure how feasible it is for SVP to scan for and eliminate duplicate frames but now that GPUs, at least GCN 1.1+ GPUs, have hardware-accelerated SAD functionality I'd really like to see it happen as duplicate frames are a becoming a serious problem nowadays in animation (Not even SVP can save a 24 FPS video that has every frame triplicated so its effectively 8 FPS).

21 (edited by xenonite 03-09-2015 02:06:43)

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

VB_SVP wrote:

Async Compute is related to HSA as it enables the GPU to be used to perform computation in lieu of the CPU without tanking the GPU's performance (and it is functionality that Nvidia GPUs don't have). With the iGPUs, so many of the Intel and AMD ones are rather useless for GPU computation as they lack the functionality outright or "pull an Nvidia" and emulate it, which lacks the performance.  Tragically, AMD took too long to put GCN on its line of APUs so all pre-2014 APUs (Kaveri) are stuck with non-async compute, non-HSA enabled GPUs based on pre-GCN architecture, even though GCN was available on their dGPUs in 2011/2012.  Compounding that brutality, even though they are quite capable of working in some capacity with Vulkan/DX12, AMD has no plans to support those APIs on pre-GCN GPUs.

Firstly, I would just like to clear up any confusion regarding Nvidia's support for asynchronous computation, see here: Nvidia supports Async Compute

I still think this whole iGPU/APU thing is a very bad idea (as long as people keep demanding they be low power as well). Think about it, a CPU has to process all the sequential program instructions (even modern software like new game engines, that completely utilize 4-cores, do so with four threads that do a lot of sequential computations), and you want it to be very good at that sort of thing to be able to keep up with modern GPUs (especially since transistor geometry scaling still brings huge performance gains to GPUs).
GPUs, on the other hand, aren't very good at dealing with things like logic expressions, branching ('if' , 'else') and nested 'for' loops. To use your example of SVP with duplicate frame detection by using the iGPU, the basic program flow of the most computationally intensive subroutine would look something like this:

>Get the current frame, the next frame, and their representations as a plane of blocks.

>Now get the 'quality' of the current-to-next and the next-to-current motion vectors by running a convolution with each one of the current frame's blocks over the regions of the next frame that correspond to the center location of the current block being tested + the requested motion-vector range to be searched, in every direction. Do the same with the next frame's blocks convolving over the current frame to get the 'reverse' motion-vectors.

>Select the 'best' motion-vector for each block, for both the next-to-current and the current-to-next frames.

>Interpolate a frame somewhere between the current and next frames by 'moving' one of the following blocks by an amount and direction dictated by a function of how good the motion-vector for the next-to-current and the current-to-next frames are:
#If there is occluded motion, use only the block from frame with the unoccluded pixels;
#If one block's motion-vector is much 'better' than the other frame's corresponding block's motion-vector, only use the 'good' one;
#If both motion-vecotors are of a similar quality, compose the new frame of a weighted combination of both, where the weight takes into account how 'good' the motion-vector is, as well as how close (temporally) the frame that the block is being taken from is to the interpolated frame (both frames will be equally close for simple frame doubling).
(In reality, you should loop through each pixel of the 'new' frame, that sits between the current and the next frames, and compute the interpolated value as a function of all the motion vectors that can reach that pixel from both frames. If you only compare the two corresponding blocks from each frame and then move the best one to the position in the interpolated frame that is dictated by it's motion-vector, then you may get regions of pixels without any value, i.e. 'holes' in your interpolated frame.)

Now, you would like to check for a duplicate frame, which would have a large majority of the motion-vectors simply being (0,0). That would only be possible near the end of the algorithm (when you know the 'best' motion-vectors for each direction), where you would then choose to 'discard' the next frame and redo the calculation with the current frame and the frame after the 'discarded' one (except if the number of repeated frames remains constant throughout the video, in which case you can simple use something like one or more Avisynth calls to SelectEven() or SelectOdd(), before the video gets to SVP's interpolation functions). This loop will continue until you find a next frame with some useful motion between it and the current frame. If you then simply interpolate the number of discarded frames multiplied by your interpolation factor, then 'jump over' those 'discarded' frames and continue your interpolation from the 'next' frame you used for the calculation, then you should have the same number of frames as if you had done the interpolation for all those duplicate frames. The computational cost would also come to about the same, but it would require some more logical comparisons to check whether or not you need to drop the frame. Those checks also have to be done for all frames that actually do contain motion, further decreasing performance. Now, this wouldn't be too bad on a CPU, but here is where the problem with APU processing comes in:

GPUs are very good at doing things like massive matrix calculations, image convolutions, interpolated frame composition, etc., but slow down immensely if you don't frame the problem as a matrix computation. Basically, lets say you compute the absolute difference between 2 frames, as an unsigned value at least 16-bits in size, pixel by pixel, via 2 methods.

First method: You do everything on the GPU via two loops (the outer one for, say, the columns and the nested one for the rows) and inside the inner (nested) loop, with which you then calculate:
a=pixel1-pixel2; if(a<0) then return -a, else return a;

Second method: You only do the following on the GPU: a1 = frame1-frame2; a2 = frame2-frame1; (both done as one massive SIMD command).
Thereafter, on the CPU, you loop though every pixel in a1 via two 'for' loops (just like in the above method for the GPU) and for each pixel you do the following:
if(pixel_a1<0) then pixel_a1=pixel_a2;

If done on a discrete, separate CPU and GPU, the speedup of method 2 over method 1 can be anywhere from 100% to 1000%. Newer GPU compute architectures (and especially NVIDIA's CUDA compiler) are getting better and better at reducing this difference, but the core principle remains the same.

If, however, you have an iGPU, then porting some calculations over to it from the CPU may or may not bring a performance advantage. Since you are stuck with OpenCL for iGPU programming, you either need to have basically been on the design team for the GPU you are programming (to be able to transform problematic code into the most optimal form for the iGPU you are targeting), or you need to avoid such code entirely (such as in method 2).
{Yes, I know, a lot of people try desperately to justify their purchases by saying things like "OpenCL is just as good as CUDA for X, but supports everything so it's better". Well, technically yes, you can do almost everything in OpenCL that you can do in CUDA, the problem is the difficulty of doing so. GPU programming is DIFFICULT, its very difficult, which is the main reason why we don't see apps utilizing the enormous potential computational performance inside them more often.
Trying to get the same performance from the same NVIDIA GPU (or an AMD one with an equivalent theoretical computational throughput) with OpenCL that you got from a CUDA program is just making a difficult task so much harder, that almost no one does manage to get the same ultimate performance in a real-world application.}

Finally, even with optimal code, an iGPU has to split its very limited power with the CPU. Compared to the same CPU and GPU as discrete products, the maximum performance of even an optimally coded 50-50 algorithm (one that needs an exactly equal amount of CPU and GPU power) will take anywhere from 50% to even 100% longer on an integrated APU comprised of exactly the same processors. However, if you are against even mildly overclocking your setup (for some strange reason some people, who are not even electrical engineers themselves, seem to think that a moderate overclock reduces the lifespan of your components by an appreciable amount), then the difference will of course be much smaller. You can even go further and underclock the discrete setup to achieve the same performance as the integrated setup, but that only serves to further show how power limited APUs are.

Its also not just limited power that constrains performance. Having to share the CPU's memory means the iGPU has to live with a minuscule portion of a similar dGPU's memory bandwidth. On-chip Crystal-Lake style L4 memory caches can help in this regard, but they are so tiny in comparison to discrete GPU memory, that doing any sort of video processing through only that cache is completely impossible.

APUs are a nice idea, don't get me wrong. Its easy to think of an iGPU similarly to some enhanced instructions set, such as AVX or FMA, that can serve to massively improve upon a CPU's performance in certain tasks. But in practice, it just seems like having the CPU keep the iGPU on constant life support (with woeful GPU memory bandwidth and power-gating killing entire blocks every few milliseconds) just seems to negate all of the advantages in having it so close to the CPU. If, however, the choice is between two similar processors, where one has an iGPU and the other which doesn't
Also, I really don't get this reluctance to develop for a 'closed' architecture. No matter what percentage of the market a certain GPU has, why not develop your application for the one that can actually deliver the required performance? If people really want to use a certain application, then they will buy the required hardware. This is how it works in the industry, and also in the consumer market for most non-software things (say you're following a cooking recipe, which calls for baking a cake in an oven, but you don't have an oven. Is it really necessary to rewrite the recipe to obtain at least something resembling a cake by using another, more readily available, piece of hardware. A hot stone, perhaps?).

I just think that this pathological fear of 'vendor lock-in' is unnecessarily limiting the applications and features in applications that are available to us. I mean, how many programs are there that can, for instance, do what you want (i.e. interpolate a video stream containing many duplicate frames of varying lengths)?

Thinking about this in another way, paints iGPUs in a much better light. What about systems that have a discrete GPU, but where the CPU just also happens to have an iGPU. With the massive transistor budgets available these days, simply integrating an iGPU onto the CPU die shouldn't sacrifice too much potential performance (if any).
Now we have a scenario where people have capable CPUs and discrete GPUs, but where coding in OpenCL allows developers to access a second, integrated iGPU when and where it would be most beneficial to do so.
Through OpenCL programming an AMD iGPU and an NVIDIA dGPU can simultaneously even be working on the exact same problem, with each GPU doing work that is most suited to its compute architecture!
The CPU's available power budget can also, theoretically, be utilized much more completely and effectively by interleaving CPU and GPU commands (such as doing some GPU work while the CPU waits for data to come in from the dGPU, for example). If the iGPU's wake up and sleep sequences are fast enough, it can even be used in place of SIMD instructions like SSE, AVX and FMA. The potential benefits in this scenario, are simply unimaginable.

Maybe I'm just a bit of a pessimist, but it doesn't seem like we will be seeing much use of this second usage scenario of HSA. I just don't get why though. Maybe the lifetime of a GPU compute architecture is just too short for developers to spend the time on optimizing an OpenCL workflow for that specific processor?
Or maybe most people still aren't even aware of the potential gains of simultaneous computation with a dGPU and an iGPU?

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

That Reddit source doesn't disprove the assertion that Nvidia can't perform true async compute.  Given its current "closing statement", I believe it is misunderstanding what "async compute" means in the current context.  I'd trust AMD staking its rep with such a claim over some newbie dev cobbling together a most dubious test (even that thread admits that the test is dubious).  Besides, Nvidia's whole MO is to highly optimize its drivers on a per-game basis for DX11, and wasn't betting on DX12 being widely adopted, so why would they have baked real support for it into their GPUs?

CUDA's supposed advantages are rapidly becoming irrelevant because Apple has thrown it out and replaced it with modern(ish) OpenCL  capable hardware instead in its newest macbooks, something that really needed to happen (and should have happened years ago TBH).

Or maybe most people still aren't even aware of the potential gains of simultaneous computation with a dGPU and an iGPU?

I believe it is the latter, combined with the lack of convenient tools put out by the hardware makers (basically just AMD) to facilitate HSA.  Consoles have had HSA capabilities for years now and devs are just now getting around to using the GPU as a "coprocessor" and it's unclear if they've finally started to embrace multithreaded rendering on those machines (which they need and were designed for).