SmoothVideo Project — Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (VB_SVP) — Thu, 03 Sep 2015 05:47:40 +0000

That Reddit source doesn't disprove the assertion that Nvidia can't perform true async compute. Given its current "closing statement", I believe it is misunderstanding what "async compute" means in the current context. I'd trust AMD staking its rep with such a claim over some newbie dev cobbling together a most dubious test (even that thread admits that the test is dubious). Besides, Nvidia's whole MO is to highly optimize its drivers on a per-game basis for DX11, and wasn't betting on DX12 being widely adopted, so why would they have baked real support for it into their GPUs?

CUDA's supposed advantages are rapidly becoming irrelevant because Apple has thrown it out and replaced it with modern(ish) OpenCL capable hardware instead in its newest macbooks, something that really needed to happen (and should have happened years ago TBH).

Or maybe most people still aren't even aware of the potential gains of simultaneous computation with a dGPU and an iGPU?

I believe it is the latter, combined with the lack of convenient tools put out by the hardware makers (basically just AMD) to facilitate HSA. Consoles have had HSA capabilities for years now and devs are just now getting around to using the GPU as a "coprocessor" and it's unclear if they've finally started to embrace multithreaded rendering on those machines (which they need and were designed for).

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (xenonite) — Thu, 03 Sep 2015 02:06:02 +0000

VB_SVP wrote:

Async Compute is related to HSA as it enables the GPU to be used to perform computation in lieu of the CPU without tanking the GPU's performance (and it is functionality that Nvidia GPUs don't have). With the iGPUs, so many of the Intel and AMD ones are rather useless for GPU computation as they lack the functionality outright or "pull an Nvidia" and emulate it, which lacks the performance. Tragically, AMD took too long to put GCN on its line of APUs so all pre-2014 APUs (Kaveri) are stuck with non-async compute, non-HSA enabled GPUs based on pre-GCN architecture, even though GCN was available on their dGPUs in 2011/2012. Compounding that brutality, even though they are quite capable of working in some capacity with Vulkan/DX12, AMD has no plans to support those APIs on pre-GCN GPUs.

Firstly, I would just like to clear up any confusion regarding Nvidia's support for asynchronous computation, see here: Nvidia supports Async Compute

I still think this whole iGPU/APU thing is a very bad idea (as long as people keep demanding they be low power as well). Think about it, a CPU has to process all the sequential program instructions (even modern software like new game engines, that completely utilize 4-cores, do so with four threads that do a lot of sequential computations), and you want it to be very good at that sort of thing to be able to keep up with modern GPUs (especially since transistor geometry scaling still brings huge performance gains to GPUs).
GPUs, on the other hand, aren't very good at dealing with things like logic expressions, branching ('if' , 'else') and nested 'for' loops. To use your example of SVP with duplicate frame detection by using the iGPU, the basic program flow of the most computationally intensive subroutine would look something like this:

>Get the current frame, the next frame, and their representations as a plane of blocks.

>Now get the 'quality' of the current-to-next and the next-to-current motion vectors by running a convolution with each one of the current frame's blocks over the regions of the next frame that correspond to the center location of the current block being tested + the requested motion-vector range to be searched, in every direction. Do the same with the next frame's blocks convolving over the current frame to get the 'reverse' motion-vectors.

>Select the 'best' motion-vector for each block, for both the next-to-current and the current-to-next frames.

>Interpolate a frame somewhere between the current and next frames by 'moving' one of the following blocks by an amount and direction dictated by a function of how good the motion-vector for the next-to-current and the current-to-next frames are:
#If there is occluded motion, use only the block from frame with the unoccluded pixels;
#If one block's motion-vector is much 'better' than the other frame's corresponding block's motion-vector, only use the 'good' one;
#If both motion-vecotors are of a similar quality, compose the new frame of a weighted combination of both, where the weight takes into account how 'good' the motion-vector is, as well as how close (temporally) the frame that the block is being taken from is to the interpolated frame (both frames will be equally close for simple frame doubling).
(In reality, you should loop through each pixel of the 'new' frame, that sits between the current and the next frames, and compute the interpolated value as a function of all the motion vectors that can reach that pixel from both frames. If you only compare the two corresponding blocks from each frame and then move the best one to the position in the interpolated frame that is dictated by it's motion-vector, then you may get regions of pixels without any value, i.e. 'holes' in your interpolated frame.)

Now, you would like to check for a duplicate frame, which would have a large majority of the motion-vectors simply being (0,0). That would only be possible near the end of the algorithm (when you know the 'best' motion-vectors for each direction), where you would then choose to 'discard' the next frame and redo the calculation with the current frame and the frame after the 'discarded' one (except if the number of repeated frames remains constant throughout the video, in which case you can simple use something like one or more Avisynth calls to SelectEven() or SelectOdd(), before the video gets to SVP's interpolation functions). This loop will continue until you find a next frame with some useful motion between it and the current frame. If you then simply interpolate the number of discarded frames multiplied by your interpolation factor, then 'jump over' those 'discarded' frames and continue your interpolation from the 'next' frame you used for the calculation, then you should have the same number of frames as if you had done the interpolation for all those duplicate frames. The computational cost would also come to about the same, but it would require some more logical comparisons to check whether or not you need to drop the frame. Those checks also have to be done for all frames that actually do contain motion, further decreasing performance. Now, this wouldn't be too bad on a CPU, but here is where the problem with APU processing comes in:

GPUs are very good at doing things like massive matrix calculations, image convolutions, interpolated frame composition, etc., but slow down immensely if you don't frame the problem as a matrix computation. Basically, lets say you compute the absolute difference between 2 frames, as an unsigned value at least 16-bits in size, pixel by pixel, via 2 methods.

First method: You do everything on the GPU via two loops (the outer one for, say, the columns and the nested one for the rows) and inside the inner (nested) loop, with which you then calculate:
a=pixel1-pixel2; if(a<0) then return -a, else return a;

Second method: You only do the following on the GPU: a1 = frame1-frame2; a2 = frame2-frame1; (both done as one massive SIMD command).
Thereafter, on the CPU, you loop though every pixel in a1 via two 'for' loops (just like in the above method for the GPU) and for each pixel you do the following:
if(pixel_a1<0) then pixel_a1=pixel_a2;

If done on a discrete, separate CPU and GPU, the speedup of method 2 over method 1 can be anywhere from 100% to 1000%. Newer GPU compute architectures (and especially NVIDIA's CUDA compiler) are getting better and better at reducing this difference, but the core principle remains the same.

If, however, you have an iGPU, then porting some calculations over to it from the CPU may or may not bring a performance advantage. Since you are stuck with OpenCL for iGPU programming, you either need to have basically been on the design team for the GPU you are programming (to be able to transform problematic code into the most optimal form for the iGPU you are targeting), or you need to avoid such code entirely (such as in method 2).
{Yes, I know, a lot of people try desperately to justify their purchases by saying things like "OpenCL is just as good as CUDA for X, but supports everything so it's better". Well, technically yes, you can do almost everything in OpenCL that you can do in CUDA, the problem is the difficulty of doing so. GPU programming is DIFFICULT, its very difficult, which is the main reason why we don't see apps utilizing the enormous potential computational performance inside them more often.
Trying to get the same performance from the same NVIDIA GPU (or an AMD one with an equivalent theoretical computational throughput) with OpenCL that you got from a CUDA program is just making a difficult task so much harder, that almost no one does manage to get the same ultimate performance in a real-world application.}

Finally, even with optimal code, an iGPU has to split its very limited power with the CPU. Compared to the same CPU and GPU as discrete products, the maximum performance of even an optimally coded 50-50 algorithm (one that needs an exactly equal amount of CPU and GPU power) will take anywhere from 50% to even 100% longer on an integrated APU comprised of exactly the same processors. However, if you are against even mildly overclocking your setup (for some strange reason some people, who are not even electrical engineers themselves, seem to think that a moderate overclock reduces the lifespan of your components by an appreciable amount), then the difference will of course be much smaller. You can even go further and underclock the discrete setup to achieve the same performance as the integrated setup, but that only serves to further show how power limited APUs are.

Its also not just limited power that constrains performance. Having to share the CPU's memory means the iGPU has to live with a minuscule portion of a similar dGPU's memory bandwidth. On-chip Crystal-Lake style L4 memory caches can help in this regard, but they are so tiny in comparison to discrete GPU memory, that doing any sort of video processing through only that cache is completely impossible.

APUs are a nice idea, don't get me wrong. Its easy to think of an iGPU similarly to some enhanced instructions set, such as AVX or FMA, that can serve to massively improve upon a CPU's performance in certain tasks. But in practice, it just seems like having the CPU keep the iGPU on constant life support (with woeful GPU memory bandwidth and power-gating killing entire blocks every few milliseconds) just seems to negate all of the advantages in having it so close to the CPU. If, however, the choice is between two similar processors, where one has an iGPU and the other which doesn't
Also, I really don't get this reluctance to develop for a 'closed' architecture. No matter what percentage of the market a certain GPU has, why not develop your application for the one that can actually deliver the required performance? If people really want to use a certain application, then they will buy the required hardware. This is how it works in the industry, and also in the consumer market for most non-software things (say you're following a cooking recipe, which calls for baking a cake in an oven, but you don't have an oven. Is it really necessary to rewrite the recipe to obtain at least something resembling a cake by using another, more readily available, piece of hardware. A hot stone, perhaps?).

I just think that this pathological fear of 'vendor lock-in' is unnecessarily limiting the applications and features in applications that are available to us. I mean, how many programs are there that can, for instance, do what you want (i.e. interpolate a video stream containing many duplicate frames of varying lengths)?

Thinking about this in another way, paints iGPUs in a much better light. What about systems that have a discrete GPU, but where the CPU just also happens to have an iGPU. With the massive transistor budgets available these days, simply integrating an iGPU onto the CPU die shouldn't sacrifice too much potential performance (if any).
Now we have a scenario where people have capable CPUs and discrete GPUs, but where coding in OpenCL allows developers to access a second, integrated iGPU when and where it would be most beneficial to do so.
Through OpenCL programming an AMD iGPU and an NVIDIA dGPU can simultaneously even be working on the exact same problem, with each GPU doing work that is most suited to its compute architecture!
The CPU's available power budget can also, theoretically, be utilized much more completely and effectively by interleaving CPU and GPU commands (such as doing some GPU work while the CPU waits for data to come in from the dGPU, for example). If the iGPU's wake up and sleep sequences are fast enough, it can even be used in place of SIMD instructions like SSE, AVX and FMA. The potential benefits in this scenario, are simply unimaginable.

Maybe I'm just a bit of a pessimist, but it doesn't seem like we will be seeing much use of this second usage scenario of HSA. I just don't get why though. Maybe the lifetime of a GPU compute architecture is just too short for developers to spend the time on optimizing an OpenCL workflow for that specific processor?
Or maybe most people still aren't even aware of the potential gains of simultaneous computation with a dGPU and an iGPU?

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (VB_SVP) — Wed, 02 Sep 2015 21:42:04 +0000

Nintendo Maniac 64 wrote:

81% of the discrete GPU market; for the total consumer GPU market one must also consider integrated graphics. Steam's hardware survey is probably the best-case scenario for Nvidia outside of professional markets, yet even then Nvidia "only" has 52% market share while Intel has 20% and AMD has 27%:
http://store.steampowered.com/hwsurvey/

Oh, and I think you posted the wrong link - Async Shaders are something quite different from HSA.

Async Compute is related to HSA as it enables the GPU to be used to perform computation in lieu of the CPU without tanking the GPU's performance (and it is functionality that Nvidia GPUs don't have).

With the iGPUs, so many of the Intel and AMD ones are rather useless for GPU computation as they lack the functionality outright or "pull an Nvidia" and emulate it, which lacks the performance. Tragically, AMD took too long to put GCN on its line of APUs so all pre-2014 APUs (Kaveri) are stuck with non-async compute, non-HSA enabled GPUs based on pre-GCN architecture, even though GCN was available on their dGPUs in 2011/2012. Compounding that brutality, even though they are quite capable of working in some capacity with Vulkan/DX12, AMD has no plans to support those APIs on pre-GCN GPUs.

I'm not sure how feasible it is for SVP to scan for and eliminate duplicate frames but now that GPUs, at least GCN 1.1+ GPUs, have hardware-accelerated SAD functionality I'd really like to see it happen as duplicate frames are a becoming a serious problem nowadays in animation (Not even SVP can save a 24 FPS video that has every frame triplicated so its effectively 8 FPS).

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (Nintendo Maniac 64) — Wed, 02 Sep 2015 06:18:18 +0000

81% of the discrete GPU market; for the total consumer GPU market one must also consider integrated graphics. Steam's hardware survey is probably the best-case scenario for Nvidia outside of professional markets, yet even then Nvidia "only" has 52% market share while Intel has 20% and AMD has 27%:
http://store.steampowered.com/hwsurvey/

Oh, and I think you posted the wrong link - Async Shaders are something quite different from HSA.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (VB_SVP) — Wed, 02 Sep 2015 04:21:04 +0000

Nintendo Maniac 64 wrote:

The problem is that Bulldozer's general performance was worse than even a Phenom II x6 in several cases, so considering how much faster the likes of an Intel i7 was, devs saw no reason to purchase such a processor and to develop for it.
I believe a similar thing is going on with HSA currently - devs see no reason to purchase a Kaveri APU when it'll be sub-par at everything else that they do, so they don't even bother. In this case we probably won't see more widespread usage of HSA until we have Zen APUs, and that's assuming that Jim Keller is as good as he has been in the past (a game console or server using an HSA-enabled APU would also very likely take advantage of HSA, but that's beside the point here).

There was always a glimmer of hope for Bulldozer when it, in multithreaded tasks, was shown punching up against i7s (in video encoding it still beats any Intel processor that I could have bought for the same price). I've long heard it said in the tech sector that had developers embraced FMA then it would have done much to alleviate Bulldozer's somewhat lacking floating-point performance but we might never know. In a sense, Bulldozer is AMD's "Itanium". IMO the problem is that AMD expected devs to use new instruction sets and HSA but didn't really provide them with the tools to do so; this thread's title is a sad reminder of that lol.

Though it's been reported that console developers are finally, after several years of having the functionality just sitting there, starting to use HSA and seeing gains of "30% GPU performance".

But the biggest problem with all of this is availability. Nvidia's GPUs aren't much for this sort of task and they have 81% of the GPU market and Piledriver+/Haswell+ CPUs aren't all that ubiquitous yet still.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (Nintendo Maniac 64) — Tue, 01 Sep 2015 18:40:07 +0000

The problem is that Bulldozer's general performance was worse than even a Phenom II x6 in several cases, so considering how much faster the likes of an Intel i7 was, devs saw no reason to purchase such a processor and to develop for it.

I believe a similar thing is going on with HSA currently - devs see no reason to purchase a Kaveri APU when it'll be sub-par at everything else that they do, so they don't even bother. In this case we probably won't see more widespread usage of HSA until we have Zen APUs, and that's assuming that Jim Keller is as good as he has been in the past (a game console or server using an HSA-enabled APU would also very likely take advantage of HSA, but that's beside the point here).

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (xenonite) — Tue, 01 Sep 2015 14:16:34 +0000

VB_SVP wrote:

I'd love to see FMA and GPU SAD (especially as that could be used to efficiently detect and delete duplicate frames, further smoothing out video) come into use.

Me too. Even though I have 'invested' in AVX2, I'd much rather leave my CPU 'under utilized' if that means we can get GPU-based pixel metrics calculation. Heck, without having to do all the pixel metric calculations, the CPU would be much less loaded than it is now, so AVX2 would be less necessary as a result. GPU-based calculations would also open the door to much better image quality (think SSD/SATD in stead of SAD).

Actually, FMA would also help to compute such pixel metrics much more efficiently (although not nearly on the same level as GPU acceleration would). The only problem with that would be the performance required (even with FMA), would probably put it out of practical reach for all AMD owners anyway. Maybe Zen would make it more practical?
FMA also doesn't really help much with standard SAD calculation and would require someone to write the entire assembly functions from scratch (I don't think the x264 codebase includes any FMA pixel metrics), but it would be of great benefit to things such as bicubic frame resizing, gamma correction and colorspace conversion (if and when SVP makes use of such features (actually, SVP already uses frame resizing, but that could maybe be implemented more efficiently in the svpflow2.dll GPU-accelerated part of SVP)).

VB_SVP wrote:

As an AMD user, which got with the FMA program years before Intel did

That really is extremely sad. Even though I have only once owned an AMD CPU (the original FX series), I really hoped that the Bulldozer architecture would encourage developers to change their coding styles towards relying more on specific per-processor optimizations and less on outdated, general instructions sets. Since the 'thin-and-light' craze has capped CPU performance, the only big performance advances have been, and will continue to be, made by developing increasingly specialized and exclusive instruction sets that are optimized for specific classes of problems.
If FMA had caught on, Bulldozer's performance would have exceeded Intel's in all video processing (and gaming post-processing) applications, among others, which would have forced a new 'instruction set arms race' and continued linear performance scaling together with Moore's law.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (VB_SVP) — Tue, 01 Sep 2015 08:24:33 +0000

xenonite wrote:

Nintendo Maniac 64 wrote:
Just to clarify about SSE2 vs AVX and stuff - I'm guessing none of the newer SSE instructions (save for AVX) would be useful?
The FMA instructions, however, do have the potential for some very nice gains in code of the form x = (a+b)*(c+d), which is very common in any video or audio processing filter (indeed, video upscaling and downscaling, as well as audio upsampling, down sampling and equalizing, almost entirely consist of loops of millions of x = a + b*c or x+=b*c).
But for SVP, it seems that AVX2 would bring the most benefit (though actually processing the SAD calculations as entire frames and kernel blocks, on the GPU, would be able of speeding up SVP's main bottleneck by at least an order of magnitude, over and above what AVX2 would be able to do. Its just also about 10x harder to actually implement ).

As an AMD user, which got with the FMA program years before Intel did (FMA requires Haswell+), I'd love to see FMA and GPU SAD (especially as that could be used to efficiently detect and delete duplicate frames, further smoothing out video) come into use. I'm not feeling the AVX2 as much as that has a far smaller "install base" at the moment and I'm not in it

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (xenonite) — Tue, 01 Sep 2015 08:15:58 +0000

Nintendo Maniac 64 wrote:

Just to clarify about SSE2 vs AVX and stuff - I'm guessing none of the newer SSE instructions (save for AVX) would be useful?

Uhg, it seems my reply never actually posted!
Mostly yes, that is correct. Even AVX itself wont help much, its actually the second revision, AVX2, that contains the required instructions.

SSE3->4.2 can also provide a bit of a speedup, but it would be the same amount of work that implementing AVX2 would take for about 20% (mabe even less) of the gain.
The FMA instructions, however, do have the potential for some very nice gains in code of the form x = (a+b)*(c+d), which is very common in any video or audio processing filter (indeed, video upscaling and downscaling, as well as audio upsampling, down sampling and equalizing, almost entirely consist of loops of millions of x = a + b*c or x+=b*c).
But for SVP, it seems that AVX2 would bring the most benefit (though actually processing the SAD calculations as entire frames and kernel blocks, on the GPU, would be able of speeding up SVP's main bottleneck by at least an order of magnitude, over and above what AVX2 would be able to do. Its just also about 10x harder to actually implement ).

Chainik wrote:

I prefer to think that x264 guys are extremely experienced with all that stuff

Yes, of course. I was just wondering what those x264 calculations would bring to an integer codepath. It seemed strange for SVP, thats all... Then again, SVP itself comes from the very 'strange' MVTools, so there's that.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (Chainik) — Mon, 31 Aug 2015 20:57:03 +0000

xenonite

it seems your code also has 'support' for plain AVX instructions in calculating pixel metrics. Do you have any idea why someone would have put floating-point AVX together with the other integer SSE2 optimizations?

I prefer to think that x264 guys are extremely experienced with all that stuff

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (Nintendo Maniac 64) — Mon, 31 Aug 2015 20:33:47 +0000

Just to clarify about SSE2 vs AVX and stuff - I'm guessing none of the newer SSE instructions (save for AVX) would be useful?

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (xenonite) — Mon, 31 Aug 2015 12:50:12 +0000

Nintendo Maniac 64 wrote:

Also I'm not sure why you're bringing up old CPU architectures like the Pentium 4 in regards to my mention of processor models that are only a year old or so like the Pentium G3258...

Because its the first in the list of CPUs that SVP is designed for. (aka. SSE2). Also, I'm sorry, that rant wasn't aimed at you at all. I realize that you were only asking about the impact of building AVX2 libraries on 'lower end' CPUs. Luckily, after looking through the code a bit, it seems like SVP is already written to be able to use AVX2 instructions without interfereing with the standard SSE2 codepath (at least for the computationally intensive pixel metric calculations). It shouldn't be too difficult to follow the same template for other functions. At worst, the installer can detect your CPU and install the correct library if it were to be compiled for different architectures.

Chainik wrote:

parallelization is a cheating
we're already in heavy multi-threaded environment and we're not interested in single-threaded performance

Yes I understan that, but Avisynth's multithreading environment seems kind of funky to me (I have run some scaling tests on a 4-core 3770k with the default libs, like the one you posted a while back with the AMD CPU, and have found the optimum number of Avisynth 'threads' to be 22. Setting threads=8 leads to more than 50% less performance (I'll post my scaling results for Avisynth 2.5.8 and 2.6.0 shortly)).

Actually, after reviewing the code a bit, it seems like all the framework is already in place for proper AVX2 support on the assembly level.
Do you know of any work that was, or is being, done to enable AVX2 SAD calculation?
One more question: it seems your code also has 'support' for plain AVX instructions in calculating pixel metrics. Do you have any idea why someone would have put floating-point AVX together with the other integer SSE2 optimizations?

sparktank wrote:

I seem to remember reading when I wanted to get into programming, that main thing I wanted to learn was to update some things for AVX extension, but then read a lot of things (I mostly don't remember) that said it's not really worth it.
Short/long math is where it really counts?
And in most cases for AVS users, it doesn't count for us so much.

Yes, plain AVX instructions only work on floating point data (32/64-bit fractional numbers), while video data is stored as 8-bit unsigned integers (256 numbers from 0 to 255). That means AVX only benefits workflows that require the precision of floating point numbers (which are much slower to work with than integers).
AVX2, on the other hand, does work on 'packed' 8-bit data, so would provide a very nice speedup over 'legacy' SSE~SSE4.2 instructions.

libiomp5md.dll is nowhere on my system.
it seems there are no more static libraries available for distribution regarding OpenMP.

Ah, I don't know why it did that (I didn't enable the OpenMP language in the project settings and neither does SVP contain any OpenMP instructions what so ever), but thank you very much for reporting back.

I'll try to get the compiler to not link against OpenMP and test the resulting libraries by uninstalling all dev kits on the old 3770k-based system and running it there. If I get it to behave, I'll certainly post the new libraries for you to test as well.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (sparktank) — Mon, 31 Aug 2015 09:17:44 +0000

I tried them in SVP 3.17 and SVP 4 TP and couldn't get them to work.

Windows 7 (x64, but SVP is x86).
Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz
Intel Sandy Bridge rev. 09
MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX

I have SVP installed to default location (Program Files (x86)\SVP).
After renaming the .dll, it just says it can't find the required library.
I close SVP and the media players completely when I try them.

Running each of he .dll's through Dependency Walker show one error:
"Error: At least one required implicit or forwarded dependency was not found."
Module: libiomp5md.dll "Error opening file. "The system cannot find the file specified (2).""

libiomp5md.dll is nowhere on my system.

it seems there are no more static libraries available for distribution regarding OpenMP.
https://software.intel.com/en-us/articl … ft-windows
https://software.intel.com/en-us/articl … -libraries

EDIT:
Not that I'm in a real rush to compare anything.
I seem to remember reading when I wanted to get into programming, that main thing I wanted to learn was to update some things for AVX extension, but then read a lot of things (I mostly don't remember) that said it's not really worth it.
Short/long math is where it really counts?
And in most cases for AVS users, it doesn't count for us so much.
I remember something to that efffect out of the myriad of information I read through.

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (Chainik) — Mon, 31 Aug 2015 08:56:19 +0000

xenonite
I don't see any vectorizing or parallelizing hints to the compiler in the source code
... /Qipo

parallelization is a cheating
we're already in heavy multi-threaded environment and we're not interested in single-threaded performance

Have you ever run a performance profiler on the complete SVP program

very long time ago, I don't remember actual numbers

Re: Intel Parallel Studio 2015 Optimized AVX Builds of svpflow1.dll

null@example.com (Nintendo Maniac 64) — Mon, 31 Aug 2015 03:39:38 +0000

....uhhh, my point was just pointing out a possible reason why mashingan could have been having issues.

Also I'm not sure why you're bringing up old CPU architectures like the Pentium 4 in regards to my mention of processor models that are only a year old or so like the Pentium G3258...